Proj 1: Data Query

Posts: 160
Joined: Wed Sep 22, 2010 12:26 pm

Proj 1: Data Query

Post by glegrady » Tue Jan 14, 2014 2:44 pm

Please review previous posts at this forum from 2013 and 2012 prior to starting yours.

Your query should be in the "spl2" table at and can be in "inraw" or in any other tables. Please answer the following:

State an interesting Question: It can be cultural, system based, compare data, etc. The query should be exploratory and/or compelling in some way.

Give the MySQL Query:

Explain the Query if necessary:

Provide Results through a table or as you wish: (a Data printout to .csv file)

Give the processing Time: How long the search?

Comment & Analysis: DIscuss what you found through the query process in the data

Click "PostReply" and paste your info there. Attach pdfs files, or anything else. If you use give the link but make sure to duplicate your data into the Post Reply

Assignment is due Tues Jan 21

PS Contact us if you dont have your password.
George Legrady

Posts: 5
Joined: Tue Jan 14, 2014 11:46 am

Re: Proj 1: Data Query

Post by mohithingorani » Sat Jan 18, 2014 11:16 pm

Assignment 1 by Mohit Hingorani

Harry Potter is one of the best selling fantasy literature novels of all time. It is estimated that over 400 million books have been sold till date!
Question is how many kids read the entire series? How many kids lose interest? Which ones are the most popular?
The last book was published in July 2007. Hence I can safely assume that all the readers are reading the books for the first time.I am interested in finding out how many kids started reading Harry Potter books through the library in the last year, and how many finished the entire series.

here is a link to the CSV files: ... xmjIY2/csv
(53.61 KiB) Downloaded 238 times

Posts: 4
Joined: Tue Jan 14, 2014 11:47 am

Re: Proj 1: Data Query

Post by grant.mckenzie » Sun Jan 19, 2014 5:31 pm

One of the most interesting aspects of a dataset such as the one provided by the Seattle Public Library, is its ability (when asked the proper questions) to speak on user behavior . Given the completeness of the dataset, one can explore trends in the data and arguably infer user-interests based on these trends.

For this assignment I chose to explore how interest in subject matter changes over time. To do this I decided to investigate the variance in the items that users checked-out based on the Dewey Decimal Class of the medium. Variance is computed as it allows one to compare values between temporal units (e.g., month or day). A large variance indicates that media related to a wide range of classes were checked-out of the library, while a small (relative) variance indicates that the checkout history during the specified time period did not vary much in terms of subject (Dewey Class). It is important to note here that the Dewey Decimal Class system is not assigned to all media in the Library dataset. Given this, the results of the queries, as well as any inferences, are based on a subset of the library data.

Since Variance alone can often be misleading, it was important to also list the total number of items checked out as well as the count for each Dewey Class.

The two queries shown in the attached PDF show classification Variance by Month and by Day of the Week.
(81.81 KiB) Downloaded 267 times

Posts: 1
Joined: Tue Jan 14, 2014 11:45 am

Re: Proj 1: Data Query

Post by bhargavi19 » Sun Jan 19, 2014 7:10 pm

Question: What is the impact on the popularity of a book after the release of its film adaptation? If the book and film adaptation are part of a series, what is the impact on the sequels or prequels?

As a pilot, I decided to work on 'The Hunger Games' by Suzanne Collins and 'The Twilight Saga' by Stephenie Meyer. I chose these two texts because of the timeline constraints of the database. 'The Hunger Games Trilogy' were published in 2008, which makes it possible to track the record of transactions in the database since its publication. The corresponding movies however, do not fit in the timeline of the database (the second movie was released in November 2013 and the last of the series is still under production). In the case of 'The Twilight Saga', the first book was published a year before the database begins, but the subsequent books and the corresponding films have all been released around the timeline of the database. Both these texts are comparable as artifacts of popular culture. A comparison of the popularity of the these two book series and their corresponding films, taken together, can circumvent the constraints of timelines in the database.

Based on the Seattle Public Library's database, it is possible to obtain a record of the number of transactions of the books since their time of publication (from the library). A comparison of the release dates of the films (which are easily available) and the transactions of the books over time can give a clue as to whether the film adaptation has impacted (and whether the impact is positive,negative; positive impact is manifest as an increase in readership of the book and negative impact is a decrease in readership) the popularity of the book or not.

MySQL query:
Select title, year(cout), month(cout), count(*), SUM(CASE WHEN itemtype ='accd' THEN 1 ELSE 0 END) CD, SUM(CASE WHEN itemtype ='acdvd' THEN 1 ELSE 0 END) DVD, SUM(CASE WHEN itemtype ='acbk' THEN 1 ELSE 0 END) Book from inraw where date (cout)!='1970-01-01' and year (cout)>='2006' and year (cout)<='2014'and title like 'Twilight' group by year(cout), Month (cout) order by cout ASC;

The query is to display in ascending order of check out time, the number of transactions of each title (in this case it is 'Twilight') and to divide the transactions based on the type of item (as books, CDs or DVDs) and to display each of those and the corresponding transactions in separate columns.

Processing Time:
About 70 secs for each query.

Observations, Comments and Analysis:

In the case of the first book in the series (Twilight, 2005) there is an increase in the number of transactions even before the release of the movie (on Nov 21, 2008) and the popularity of the book reaches its peak in the early months of 2009. After that, the release of subsequent films of the sequel does not seem to have raised the popularity of the book. In fact, the number of transactions steadily drop.
The popularity of the second book (New Moon, 2006)) in the series also seems to increase around the same time (November 2008 onwards), reaching an all-time high of 377 transactions around May 2009. There is another peak in its popularity around November 2009, which is the release date of its corresponding movie (though this number does not exceed the first peak in 2008) and after that, the popularity slowly fades. The release of the subsequent movies does not correspond with an increased transaction.
The third book (Eclipse, 2007) also shows an increase in the number of transactions around the date of release of the first movie in November 2008, but as in the previous cases, the peak is in early months of 2009. There is an increase in the popularity around the time of release of its corresponding movie (in June 2010), but its peak seems to be in January 2011 (and this is marginally lower than its first peak in May 2009). After this, its popularity steadily declines. The release of the last movie in 2011 and 12 do not correspond with increased transactions (at least not in the scale previously observed).
The last book (Breaking Dawn, 2008) is more popular than the prequels at the time of their release and its popularity is highest in early months of 2009 (as in the case of other books). The release of subsequent movies does not correspond with an increased popularity. Infact, the number of transactions are quite low at the time of release of its corresponding movie.
Though it is impossible to establish a cause and effect relationship between the increase in number of transactions of each book and the release of the film adaptation, the data shows interesting peaks in their popularity around the time of release. It provokes research into the nature and timeline of publicity and advertising strategy of the books and movies.
It is important to note that the number of transactions is not necessarily an accurate marker of the popularity or demand of the book since the database does not record the number of ‘requests’ submitted by users when all the books are in circulation. A dataset that maps the increase (if any) in the acquisition of the books would be a useful supplement to mark the popularity of the book. Also, it does not reflect the number of books that could have been bought or read online as ebooks or kindle editions, which are also important markers of popularity of a text. The data set should be further supplemented with sales or popularity charts of the book from online and physical bookstores (likewise, pre-orders of sequels should also be taken into account).
The first novel of the ‘Twilight’ Series was released a year before the start of the database. There is no way to compare the popularity of the series at the time of its publication with its popularity at the time of release of the movie. This may or may not be a serious lack, depending on how the book was received. This makes inclusion of data outside the library database even more crucial for analysis. If we think of ‘The Hunger Games’ and ‘The Twilight Saga’ as comparable artifacts, then we can offset the lack of data after the publication of the book by using that of ‘The Hunger Games’. As is obvious from the data, the book was quite popular on its release (after the first three months atleast) and its popularity only increases steadily till the release of its film adaptation. The month after the release of the movie The Hunger Games (March, 2012), the number of transactions peaks to more than 400 and stays there for a few months. Though the popularity of the book does not fade away completely, the release of the second movie in November 2013 does not increase the number of transactions as drastically as the first film. All of this data should of course, be supplemented with an evaluation of each of the films on their own merit and with data about the popularity of each film and how well it was received, to be conclusive.
Lastly, it is crucial to note that the data obtained is noisy. The book ‘Eclipse’ for example, was released only in 2007. The data however, shows transactions under that name from the year 2006. The query in this case is not sharply defined to include only the exact novels by Stephanie Meyer. The marketing strategy of the book could probably be responsible for some amount of the noise. Since the release of the first movie ‘Twilight’, the entire series was dubbed ‘The Twilight Saga’. So independent titles such as ‘Eclipse’ could be dubbed under ‘Twilight Saga’, blurring categories based on titles, which could make this particular query, quite noisy.
Although the data shows interesting peaks of correspondence, one should be careful not to base any definite conclusions about the relationship between books and film adaptations based on this dataset alone. This dataset merely offers interesting points of entry into the analysis of different media objects and their co-existence. It can be a useful starting point to study strategies of marketing and advertisement for film adaptations of books.

I was unable to upload the csv files as attachments to this page. Here are the links to view the files. ... sp=sharing ... sp=sharing ... sp=sharing ... sp=sharing ... sp=sharing
Last edited by bhargavi19 on Mon Jan 20, 2014 7:34 pm, edited 1 time in total.

Posts: 2
Joined: Tue Jan 14, 2014 11:41 am

Re: Proj 1: Data Query

Post by siladitya_dey » Mon Jan 20, 2014 6:09 pm

Please find attached my response.

Side-note: I've been plagued with quite a few MySQL Error 2013's (connection time-out to database). Is there a flexibility in increasing the allowable number of connections in the pool?
(55.88 KiB) Downloaded 223 times

Posts: 4
Joined: Tue Jan 14, 2014 11:48 am

Re: Proj 1: Data Query

Post by songgaogeo » Mon Jan 20, 2014 10:04 pm

Question: Does the rainy weather in Seattle influence the check-in and check-out behavior in the Seattle Public Library?
Seattle is famous for its rainy weather and the average annual rainfall is approximately 38 inches. According to the Weatherdb website , Seattle has a relatively high amount of days per year with precipitation (158), compared to such rainy places as New York (119), Boston (127), and Nashville (119). All cities that get an average of about 16% more rain per year than Seattle, but also average between them about 36 less days a year of precipitation. Because of the entertainment industry producing things like Sleepless in Seattle, Frasier, and the works which portray it as such, most people think to go outside in Seattle without an umbrella is tantamount to committing troublesome. It inspired me to investigate whether the rainy weather did affect the users’ check-in and check-out behavior in the Seattle Public Library. The intuition is that the probability of check behavior will decrease if it does rain heavily outside in Seattle, which has influenced on the urban transportation and daily commuting behavior of citizens. To this end, I need to compare the number of total check-outs and check-ins with the seasonal, monthly and daily patterns of rainfalls in Seattle. The weather data can be queried and downloaded from the U.S. National Weather Service Forecast Office website , but it only limits on latest five years, i.e. starting from 1 January, 2009. Therefore, I will also query the library data from then on. Note that the Winter season usually starts from last December, so we also need to include the data during that time period.

The report is attached in the PDF file.
(403.21 KiB) Downloaded 295 times
Query Error Note: 2013. Lost connection to MySQL server during query 600.495 sec
Is possible to increase the time-out limits? It took very long time to make a simple query if there are multiple connections to the database.

Posts: 4
Joined: Tue Jan 14, 2014 11:43 am

Re: Proj 1: Data Query

Post by m_uppal » Mon Jan 20, 2014 11:52 pm

The 9/11 attacks on the US were one of the biggest attacks by the terrorist group, Al Queda, which still remains a mystery for many.
Examining the impact of 9/11 attacks on the US soil, I look deep into people's proclivity towards gaining knowledge about the attack and related. I look into books, and other media like CD and DVD's.
I divide my query into 2 main domains, primarily 2001- 2002-2006 and 2006-2011; ten years after the attack!
This gives a very interesting result specifically, a sharp surge towards end of every year, around September to December, from 2003 to 2011. This definitely has a great scope for further research and Visualization
Rest of the study is shared via google document: ... sp=sharing

Posts: 4
Joined: Tue Jan 14, 2014 11:54 am

Re: Proj 1: Data Query

Post by hellobuaazl » Tue Jan 21, 2014 3:23 am

Sports are very popular in the US, so I want to find if the quantity of books checked out is in accordance with the popularity and game time of several popular sports.
(268.23 KiB) Downloaded 287 times

Posts: 4
Joined: Tue Jan 14, 2014 11:50 am

Re: Proj 1: Data Query

Post by currier » Tue Jan 21, 2014 10:00 am

How do the item number and bar code relate to physical resources at the Seattle Public Library?

The barcode field only appears in two tables of the spl2 database: inraw and outraw. Tables derived from these (e.g. activity, callnum, collection) include the item number but not the bar code as columns, suggesting that the bar code is redundant information. To investigate this I asked several questions (all queried on inraw).
(285.27 KiB) Downloaded 201 times
(330.53 KiB) Downloaded 295 times
Last edited by currier on Thu Feb 27, 2014 6:25 pm, edited 2 times in total.

Posts: 2
Joined: Tue Jan 14, 2014 11:43 am

Re: Proj 1: Data Query

Post by kevin.g.deweese » Tue Jan 21, 2014 10:43 am

Last week it was brought up in class that inter library loans (ILL) were a source of anomalies in the data. I was curious just how much ILL activity has occured throughout the data collection period and how this activity has changed over time. I was also curious what the net loaned/borrowed ratio would be for a sizeable library.

SELECT count( * )
FROM activity
INNER JOIN kind ON kind.item = activity.item
WHERE kind
IN ("dcillb")
AND o BETWEEN "date1" AND "date2"

This query simply counts the number of ILL borrowed "dcillb", which are checked out between 2 dates.

Query Time:
I think the database was being finicky because some of these queries took around 16 minutes to run while others ran in a few seconds.

See the attached .pdf file for number of checkouts in each 6-month interval.

There is a sharp increase in the number of checkouts between 2005 and 2006, and then checkouts rise steadily. Strangely the number of checkouts drops back down an order of mangitude in the latter half of 2009 and remained that way for the rest of the time intervals. I wonder if this is due to some change in how checkouts are recorded as it is hard to believe that the demand for ILL dropped by that much. If the demand did actually drop this much perhaps this is due to e-books? What I really don't understand is that I also ran a query just counting the total number of ILL lent "dcilll" and only 1 result was returned on the entire dataset. Is this code not actually being used and there is another way I should be tracking lent material?
(12.22 KiB) Downloaded 270 times

Post Reply