1) Track the checkout history of the item or topic over the 10 year history to make sure there is adequate activity to study.
2) Compare the item/topic’s history to other similar ones. Select one or more items to see if performance patterns are comparable. Identify repeating patterns, and irregularities. Look for interesting, unexpected occurrences.
What to look for:
. Search for (unexpected) patterns in the data
. How often something happens (frequency)
. Anomalies in the system (errors, outliers, etc.) http://en.wikipedia.org/wiki/Anomaly_detection
. Association: Search for relationships between variables
. Do statistical analysis using MySQL aggregate functions: http://dev.mysql.com/doc/refman/5.6/en/ ... tions.html
Your Query Search Details
. Download a minimum 3, preferably 4 columns of metadata – this will be used for horizontal, vertical, and depth values for your visualization
. Best to sequence columns to sequence of how they may be used: (col1: vertical, col2: horizontal, col3: pixel value, etc.)
. Your search results should be sufficient to feature subtleties in the data to be expressed visually within a screen size between 1920x 1280 to 2560 x 1850 pixels
. Output to csv file
. Imagine a question
. Explore all fields of the database
. Test the query for logical and other errors
. Start with Booleans (AND, OR NOT), wildcards
. Make the query efficient (finetune)
. Interpret the results (data analysis)
. Export to .csv file to be used in visualization
Your Assignment Report:
. Concept/Question (describe what question you are exploring)
. Provide the Query
. Explain the Query
. Provide the results
. Give Processing Time
. Give an Analysis (report back on your results, explain how you did it, and what your outcomes are)
Post your Project as a PostReply to this description
. Standard Completion of Project: B
. Revised, advanced functions: B+, A-
. Innovative query with significant retrieved data: A
I wanted to look into the effect of the movie releases of the Harry Potter films has on the checkout frequency of the books and the films. Do the release of the movies increase the checkout frequency of the books? Is there a delay on the interest from the time the film comes out? How does this change with the different books?
I searched through the different item types for Harry Potter related material, and only found significant frequencies in books, dvds, and cds. I wanted to look at the books/films after and including the Goblet of Fire, which was released in theaters in November 2005.
First, we run the following query
Code: Select all
SELECT * FROM spl_2016.inraw WHERE (title LIKE '%harry potter and the goblet of fire%') AND (year(cout) = 2006) AND (itemType LIKE '%jccd%' OR itemType LIKE'%accd%') ORDER BY year(cout), month(cout)
The next query collects sums on all the titles that match the book/film title and the different item types and groups and orders them by month and year.
Code: Select all
SELECT year(cout), month(cout), sum( case when itemType LIKE '%bk%' then 1 else 0 end) as book, sum( case when itemType LIKE '%dvd%' then 1 else 0 end) as dvd, sum( case when itemType='jccd' then 1 else 0 end) as cd, sum( case when itemType='accd' then 1 else 0 end) as soundtrack FROM spl_2016.inraw WHERE (title LIKE '%harry potter and the goblet of fire%') AND (year(cout) > 1970) GROUP BY month(cout), year(cout) ORDER BY year(cout), month(cout)
We can find that the DVD release date for this movie is March 2006, and we see that interest in the DVD checkouts peaks in July 2006, 4 months after the DVD release.
We can run the same query as the second one but replace "the goblet of fire" with "the order of the phoenix" to see the results for the next movie, which was released in theaters in July 2007 and released to DVD in November 2007.
Interest in the book peaks during the same month as the theater release, but the cd interest peaks a few months earlier, possibly due to children (main consumers of harry potter audio books) needing more time to finish the audio book before the film comes out. As for interest in the DVDs, it jumps from 0 to 191, so it must have been made available that same month, and the peak interest occurs 3 months after the DVD release.
We can again run the same query for "the half blood prince" which was released in theaters in July 2009, and release to DVD in December 2009.
We can see that interest in the book peaks the month before the theater release, whereas the interest in the audio book on cd seems to have a relative dip during this period. Interest in the DVD peaks 3 months after the dvd release, which is consistent with the last two films.
The final book in the series is split into two films, so we will perform one MySQL command to see how both movies affected the interest in the book.
Code: Select all
SELECT year(cout), month(cout), sum( case when itemType LIKE '%bk%' then 1 else 0 end) as book, sum( case when itemType LIKE '%dvd%' then 1 else 0 end) as dvd, sum( case when itemType='jccd' then 1 else 0 end) as cd, sum( case when itemType='accd' then 1 else 0 end) as soundtrack FROM spl_2016.inraw WHERE (title LIKE '%harry potter and the deathly hallows%') AND (year(cout) > 1970) GROUP BY month(cout), year(cout) ORDER BY year(cout), month(cout)
Interest in the books peaks on both theater releases dates, with the first one being 10% higher and about a month longer. The audio book cd interest frequency is similar to the books, but checked out about half as frequently. Interest in the DVDs peaked only a month after the DVD release of part1, but 2 months after the DVD release of part2.
Analysis: From all this we can see that people like to read the books just as the movies are being released to theaters, with audio books sometime peaking sooner. The relative popularity of books and movies remains relatively constant over time. We also see that interest in the DVDs peaks just few months after the DVD release. Furthermore, interest for all Harry Potter items seems to slowly decrease and plateau from 2012 until now. The processing time for these queries each took about 30 seconds.
Code: Select all
Query1: select a.dewey, avg(case when a.itemtype like '%bk%' then a.length else 0 end) as book, avg(case when a.itemtype like '%cas%' or a.itemtype like '%cd%' or a.itemtype like '%disk%' or a.itemtype like '%dvd%' or a.itemtype like '%rec%' or a.itemtype like '%vhs%' then a.length else 0 end) as media from (select distinct bibNumber, itemtype, floor(deweyClass) as dewey, length(title) as length from spl_2016.inraw where cout > '2006-01-01 00:00:00' and floor(deweyClass) != '') as a group by a.dewey;
Explain: I did 3 queries based on different dewey class level: dewey classes, dewey divisons and dewey sections. The queries are very similar. First, I calculate the title length of all distinct items checkout after 2006(10-year data), then I calculate the average title length of different items (I classify items into two groups: books and media) and grouped them by dewey class. And finally I calculate the standard deviation of title length for query3 using stddev() function.
I just divided items into two groups and left misc out because of two reasons: first, the data volume of misc is not a lot, so the result is not convincing from statistical perspective. Second, misc contains many different item types (acfold, acart, etc.) in the database, it slowed down the query.
Standard deviation for book is 8.60, media is 2.59 in query3. The average title length for book is 41.58, and average for media is 3.72.
Processing time: 147.521s for query1, 147.683s for query2, 146.649s for query3. For query1, 916 columns were returned (not 1000 columns), because some sections are unassigned, and some are no longer used (I left out the data whose dewey class is null in my query).
(I used INNER JOIN at first to join spl_2016.deweyClass with spl_2016.title, but I find INNER JOIN is much slower than using function floor() to calculate dewey classes in spl_2016.deweyClass. So INNER JOIN is not very time-efficient here.)
Analysis: Generally speaking, the title length of book is obviously greater than title length of media just as I expected (books tend to have longer name than movies or songs).
At ‘class’ level, what’s interesting is class 400(language) and class 700(art&recreation) have both shorter title in books and longer title in media compared to other classes. Class 800 also has short title in book.
At ‘division’ level, 410-460(English and other European languages), 780(music) are very different from the overall tendency, they are the reason class 400 and 700 are different from other classes. 920(Biography & genealogy) also has a longer title.
At ‘section’ level, more details are observed(although there is no data in some sections in media). Some outliers are found: 7, 39, 75, 83, 114, 127, 167, 219, 240, 441, 781, 789 and etc. And section 7-9 is unassigned, 39 is about language, 75 is newspaper about Italy, 83 is about Germanic language, 114 is about space. It seems the title length has something to do with language, as many of these outliers are related to languages. And there are more information to be explored.
- (58.33 KiB) Downloaded 73 times
Query: The query uses a series of SUM(IF...) statements to get a frequency count of 13 common hobbies. I used this page to find Dewey classifications that relate to common hobbies... http://bpeck.com/references/ddc/ddc.htm. I grouped the observations by the day, so there are 3936 observations. There is also an overall count for all materials--the daily average is around 40,000 checkouts! The hobbies I decided to track are...
- - Cooking
- Office Skills
- Performing Music
- Outdoor Activities
- Indoor Activities
Code: Select all
# Jason Freeberg # January 15th, 2017 # Query for first project: 2D Frequency Mapping SELECT DAY(t.checkOut) AS dayOut, MONTH(t.checkOut) AS monthOut, YEAR(t.checkOut) AS yearOut, SUM(IF(FLOOR(d.deweyClass) = 641, 1, 0)) AS cooking, SUM(IF(FLOOR(d.deweyClass) = 651, 1, 0)) AS officeSkills, SUM(IF(FLOOR(d.deweyClass) = 694, 1, 0)) AS woodWorking, SUM(IF(FLOOR(d.deweyClass) BETWEEN 783 AND 788, 1, 0)) AS musicPerf, SUM(IF(FLOOR(d.deweyClass) BETWEEN 796 AND 799, 1, 0)) AS outdoorActivities, SUM(IF(FLOOR(d.deweyClass) = 794, 1, 0)) AS indoorActivies, SUM(IF((FLOOR(d.deweyClass) BETWEEN 4 AND 6) OR (d.deweyClass = 519.7), 1, 0)) AS programming, SUM(IF((FLOOR(d.deweyClass) = 646) OR (FLOOR(d.deweyClass) = 648), 1, 0)) AS houseKeeping, SUM(IF(FLOOR(d.deweyClass) = 657, 1, 0)) AS accounting, SUM(IF((FLOOR(d.deweyClass) = 631) OR (FLOOR(d.deweyClass) = 635), 1, 0)) AS gardening, SUM(IF(FLOOR(d.deweyClass) BETWEEN 740 AND 743, 1, 0)) AS drawing, SUM(IF(FLOOR(d.deweyClass) BETWEEN 750 AND 751, 1, 0)) AS painting, SUM(IF(FLOOR(d.deweyClass) = 910, 1, 0)) AS travel, COUNT(t.checkOut) AS dailyTotal FROM spl_2016.transactions AS t JOIN spl_2016.deweyClass AS d ON d.bibNumber = t.bibNumber WHERE YEAR(t.checkOut) between 2006 AND 2017 GROUP BY dayOut , monthOut , yearOut ORDER BY yearOut ASC , monthOut ASC , dayOut ASC;
Caveats: The query took longer than the 10 minute limit, so I broke the query into 4 smaller queries and concatenated the .csv files in R. Thanks for the idea, Professor! I don't know how long the query would take in its entirety.
- Full file, 2006 - 2017
- (252.53 KiB) Downloaded 85 times
Concept/Question: What fiction items are checked out once, or checked out once and never returned?
Provide the Query:
Code: Select all
SELECT c, checkout, checkin, title, callNumber, CASE WHEN callNumber LIKE '%SCI-FIC%' THEN 1 ELSE 0 END AS sci, CASE WHEN callNumber LIKE '%MYSTERY%' THEN 1 ELSE 0 END AS mys, CASE WHEN callNumber LIKE '%WESTERN%' THEN 1 ELSE 0 END AS wes, CASE WHEN callNumber LIKE 'FIC%' or callNumber LIKE 'CAS FIC%' THEN 1 ELSE 0 END AS fic, CASE WHEN callNumber LIKE 'YA%' or callNumber LIKE 'CAS YA%' THEN 1 ELSE 0 END AS YA, t.bibNumber FROM (SELECT COUNT(checkOut) AS c, spl_2016.itemToBib.bibNumber FROM spl_2016.callNumber join spl_2016.itemToBib on spl_2016.callNumber.itemNumber = spl_2016.itemToBib.itemNumber INNER JOIN spl_2016.transactions ON spl_2016.transactions.bibNumber = spl_2016.itemToBib.bibNumber where callNumber LIKE '%SCI-FIC%' OR callNumber LIKE '%MYSTERY%' OR callNumber LIKE '%WESTERN%' OR callNumber LIKE 'FIC%' OR callNumber LIKE 'YA%' OR callNumber LIKE 'CAS YA%' OR callNumber LIKE 'CAS FIC%' GROUP BY spl_2016.itemToBib.bibNumber ORDER BY c) t INNER JOIN spl_2016.transactions ON spl_2016.transactions.bibNumber = t.bibNumber LEFT JOIN spl_2016.callNumber ON spl_2016.callNumber.itemNumber = spl_2016.transactions.itemNumber LEFT JOIN spl_2016.title ON spl_2016.transactions.bibNumber = spl_2016.title.bibNumber LEFT JOIN spl_2016.deweyClass ON spl_2016.transactions.bibNumber = spl_2016.deweyClass.bibNumber WHERE c = 1 AND YEAR(checkout) > 2004 and deweyClass = "" ORDER BY checkout;
First I wrote a subquery that counts how many times each fiction book is checked out. I joined the callNumber table with the transactions table where they callNumber was like one of the book types I was searching for. I grouped the query by bibNumber counting the checkout of each book. I really wanted to group the query by itemNumber but the query takes too long to complete.
Next I joined the subquery to transactions, callNumber, title and deweyClass to get data about each book where there had only been one checkout the year that the book was checked out was greater than 2004 and it had no dewey class (because it was fiction). I wrote a case statement to categorize the fiction based on the call number.
I found that having a join instead of a left join with itemToBib table in the subquery dramatically increased efficiency. This dropped it from 340.044 seconds to .069 seconds, though it increased the fetch from .078 to .242 seconds.
Provide the results:
Give Processing Time:
runtime duration: .069 sec, fetch: 0.242 sec, 1460 rows
Give an Analysis:
My outcome is a table of 1460 entries, each with a bib entry that has only been checked out once. I really wanted to do the analysis with itemNumbers instead of bibNumbers, to see which books had multiple items that were checked out only once, or stolen repeatedly. I could not manage to write an efficient enough query to get that data and the queries would take over 10 minutes to execute.
I found that out of the 1460 bib entries checked out only once 76 were science fiction 233 were mysteries 81 were westerns 1012 were just "fiction" and 58 were YA fiction. Also 875 of these books have not been returned. Some of that may be due to being books that were just checked out for the first time, but according to the Seattle public library website any book can be checked out for up to 63 days (including renewals), only 123 books are within that timeframe. That means that 51% of the books that were checked out only once are overdue even if the patron renewed the book as many times as possible.
For the visualization I plan to display this data with dates on the x axis and items in order by cal number on the y axis. The item would have a line from it's checkout to checkin date. The color would be the genre. The items would be visually different if the book was never returned.
Originally I looked at all the books that were checked out and had not been returned but the data was very messy and there were 26224 entries. When I tried to split them up categorically many books were ILL books (10266) or could not categorized by the dewey decimal system (4896). The interlibrary loan books were the worst because they had no metadata. I went through the call numbers of the 4896 entries to see if there was any way to categorize them, I decided to focus on fiction. I thought that focusing on fiction would give me a smaller subset of data.
Update: 1/19/2016 1:58 pm
I completed the query using itemNumbers instead of bibNumbers. I also noticed I wasn't collecting all of the audiobooks so I updated the query to handle them.
Code: Select all
SELECT c, checkout, checkin, title, callNumber, CASE WHEN callNumber LIKE '%SCI-FIC%' THEN 1 ELSE 0 END AS sci, CASE WHEN callNumber LIKE '%MYSTERY%' THEN 1 ELSE 0 END AS mys, CASE WHEN callNumber LIKE '%WESTERN%' THEN 1 ELSE 0 END AS wes, CASE WHEN callNumber LIKE 'FIC%' or callNumber LIKE 'CAS FIC%' or callNumber LIKE 'CD FIC%' THEN 1 ELSE 0 END AS fic, CASE WHEN callNumber LIKE 'YA%' or callNumber LIKE 'CAS YA%' or callNumber LIKE 'CD YA%' THEN 1 ELSE 0 END AS YA, spl_2016.transactions.bibNumber, t.itemNumber FROM (SELECT COUNT(checkOut) AS c, spl_2016.callNumber.itemNumber FROM spl_2016.callNumber INNER JOIN spl_2016.transactions ON spl_2016.transactions.itemNumber = spl_2016.callNumber.itemNumber where callNumber LIKE '%SCI-FIC%' OR callNumber LIKE '%MYSTERY%' OR callNumber LIKE '%WESTERN%' or callNumber LIKE 'FIC%' or callNumber LIKE 'CAS FIC%' or callNumber LIKE 'CD FIC%' or callNumber LIKE 'YA%' or callNumber LIKE 'CAS YA%' or callNumber LIKE 'CD YA%' GROUP BY spl_2016.callNumber.itemNumber ORDER BY c) t INNER JOIN spl_2016.transactions ON spl_2016.transactions.itemNumber = t.itemNumber LEFT JOIN spl_2016.callNumber ON spl_2016.callNumber.itemNumber = spl_2016.transactions.itemNumber LEFT JOIN spl_2016.title ON spl_2016.transactions.bibNumber = spl_2016.title.bibNumber LEFT JOIN spl_2016.deweyClass ON spl_2016.transactions.bibNumber = spl_2016.deweyClass.bibNumber WHERE c = 1 AND YEAR(checkout) > 2004 and deweyClass = "" ORDER BY checkout;
Give Processing Time:
runtime duration: 684.374 sec, fetch: 0.170 sec, 18757 rows
Give an Analysis:
There are books that have been checked out repeatedly and never returned. For example there are 58 items for "Moonglow a novel" by Chabon, that have been checked out and never returned. 25 items of "books on bikes that were checked out only once. 24 items of "best American short stories and the yearbook of the American short story" that have only been checked out once. 23 copies of "sellout" have only been checked out once. There is a chance that these are errors, but according to the catalog there are 108 copies of "best american short stories", so it doesn't seem unreasonable that 1/4 of them have only been checked out once. There are 14890 distinct bibNumbers for the 18757 items. 2469 are books that have multiple items that has only been checked out once for the same bibNumber.
I still would like to expand this query to all items, so for example it would grab all the entries for "catcher in the rye" because there is one of the items that has only been checked out once.
I think that this is an interesting dataset, but there is still a lot of room to think about how to organize it.
Patrons may turn to the library for resources to assist in acquiring new skills, such as mastering new programming languages. In the midst of a financial crisis in mid-2008 and a growing tech boom in the Pacific Northwest that began a meteoric rise around 2010 and continues to this day.
Code: Select all
The results are binned by programming language and are summarized by month of each year. Each row shows the sum of books checked out each month for the respective programming language. The totalPYPL column adds the sum of all six classes of programming languages for a total number per month. The monthly005Total returns the number of checked out items . Using these columns, both an intra-language rank frequency and intra-Dewey classification frequency can be calculated at a monthly and annual temporal resolution. This means that popularity of:
- 1. programming languages can be tracked by comparing monthly totals against themselves.
2. all languages, given their share of total checkouts, can be tracked by comparing monthly totals as “shares” of popularity across programming language categories.
3. the six selected programming languages can be compared against all other computer science reference books (which include manuals for other programming languages not retrieved here).
The duration of the query is 4 seconds. This means that the query is relatively straightforward.
- (4.56 KiB) Downloaded 72 times
Code: Select all
SELECT year(cout), COUNT(CASE WHEN deweyClass >= 000 AND deweyClass<100 THEN 1 END) as Info, COUNT(CASE WHEN deweyClass >= 100 AND deweyClass<200 THEN 1 END) as Phil_Psych, COUNT(CASE WHEN deweyClass >= 200 AND deweyClass<300 THEN 1 END) as Religion, COUNT(CASE WHEN deweyClass >= 300 AND deweyClass<400 THEN 1 END) as SS, COUNT(CASE WHEN deweyClass >= 400 AND deweyClass<500 THEN 1 END) as Lang, COUNT(CASE WHEN deweyClass >= 500 AND deweyClass<600 THEN 1 END) as Science, COUNT(CASE WHEN deweyClass >= 600 AND deweyClass<700 THEN 1 END) as Tech, COUNT(CASE WHEN deweyClass >= 700 AND deweyClass<800 THEN 1 END) as Arts, COUNT(CASE WHEN deweyClass >= 800 AND deweyClass<900 THEN 1 END) as Lit, COUNT(CASE WHEN deweyClass >= 900 AND deweyClass<1000 THEN 1 END) as History, SUM(CASE WHEN itemtype = 'accas' OR itemtype = 'arcas' OR itemtype = 'bccas' OR itemtype = 'jccas' OR itemtype = 'jrcas' OR itemtype = 'accd' OR itemtype = 'arcd' OR itemtype = 'jccd' OR itemtype = 'jrcd' OR itemtype = 'accdrom' OR itemtype = 'arcdrom' OR itemtype = 'bccdrom' OR itemtype = 'drcdrom' OR itemtype = 'jccdrom' OR itemtype = 'acdisk' OR itemtype = 'ardisk' OR itemtype = 'jrdisk' OR itemtype = 'acdvd' OR itemtype = 'ardvd' OR itemtype = 'bcdvd'OR itemtype = 'jcdvd' OR itemtype = 'jrdvd' OR itemtype = 'xrcdrom' OR itemtype = 'ucflpdr' THEN 1 ELSE 0 END) AS Media, SUM(CASE WHEN itemtype = 'acvhs' OR itemtype = 'alvhs' OR itemtype = 'bcvhs' OR itemtype = 'blvhs' OR itemtype = 'jcvhs' OR itemtype = 'jlvhs' OR itemtype = 'jrvhs' OR itemtype = 'xrvhs' OR itemtype = 'scmed' OR itemtype = 'acvid' THEN 1 ELSE 0 END) AS Video, SUM(CASE WHEN itemtype = 'acbk' OR itemtype = 'arbk ' OR itemtype = 'bcbk' OR itemtype = 'drbk' OR itemtype = 'jcbk' OR itemtype = 'jrbk' OR itemtype = 'bccd' THEN 1 ELSE 0 END) AS Book FROM spl_2016.inraw WHERE year(cout) >= "2005" and year(cout) < 2017 GROUP BY year(cout) ORDER BY year(cout);
Processing time: 285 seconds, 12 rows affected
Analysis: Looking at the table, the Dewey classes that are the most popular are in the first 100 block, the 600 to 700 block, and the 700 to 800 block. It is interesting to note that these Dewey classes are related to computer science, information and general works, technology, and arts and recreation. From 2005 to 2009 there was a steady increase in the number of checkouts per Dewey class, but from 2010 to 2016 the checkouts quickly decreased. Also, when looking at the time scale of how many of the item types were checked out per year, one can notice that they also increase from 2005 to 2009 and decreased from 2010 to 2016. Only one video was checked out in all of 2016.
Because my question is asking how technology has influenced people’s checkouts at the Seattle Public Library, I began to do some research on the technology advancements that happened near 2009 that might have decreased the popularity in the checkouts at the library. Interestingly, I found that during this period, the App Store was launched. This provided people with quicker access to music, movies, and books at the touch of their finger instead of getting it from a library. Spotify also had a big technology advancement allowing not only people to listen to music through streaming, but being able to download songs onto your device. Spotify even offers audio books, which further demonstrates why many people might not still be going to the library.
This information then led me to look at the stock market around 2009. I focused in on Amazon, Google, and Netflix. Consistently these three companies were low in stocks in 2009, but have been increasing ever since. This shows the popularity with these companies grew. Amazon creating the Kindle allowed easier access to a reader, making it more appealing to some over a hard copy. Google created a similar concept using Google Books and Netflix uses its ability of streaming movies instantly that may detract many former library attendees. Blockbuster on the other hand has been at a steady low since 2009 because of these advancements where everyone can download almost anything anywhere.
- (1.2 KiB) Downloaded 70 times
Code: Select all
SELECT year(cout), floor((deweyClass)/100), sum(case when itemtype LIKE "%bk%" then 1 else 0 end) as Books, sum(case when itemtype LIKE "%cd%" then 1 else 0 end) as CDs, sum(case when itemtype LIKE "%dvd%" then 1 else 0 end) as DVDs, sum(case when itemtype LIKE "%vhs%" then 1 else 0 end) as VHS, sum(case when itemtype LIKE "%cas%" then 1 else 0 end) as Cassettes FROM outraw WHERE year(cout) > 2006 and year(cout) < 2017 and deweyClass > 0 GROUP BY year(cout), floor((deweyClass)/100)
However, I thought I needed an additional element to make a more dynamic set of data to present in my frequency map, and so I wanted to also look at how their subjects (determined by Dewey class) were being affected by those same changes - did some categories maintain their popularity while the rest of the medium format was abandoned? With this purpose, I also selected deweyClass to make a total of 4 elements being examined. I also needed them grouped up into their major classes, so I used the FLOOR function to help me out here.
The query took 108 seconds.
I was forced to implement a parameter in that the Dewey class number had to be greater than 0 because many have no values set for the column and show up as 0 after using the floor function . This meant that the first section or 000-100s would always show up as the most popular class of items when it was generally always the 700s. Some patterns are ultimately changed: without removing those entries, DVDs are even more popular now than they were in 2007, but after removing them, DVDs have experienced a steady drop-off over the years.
In the end, we see quite a dramatic change in the popularity of various media formats. There was a sharp drop-off in the checkouts of cassettes - from 12,994 in 2007 to absolutely 0 in 2016 (5 in 2015, though). Similarly, 70,109 VHS tapes were checked out in 2007 but none were checked out in 2016 (2 in 2015). At the same time, books have only gained in popularity - 1.68mil and 1.75mil in the last two years as opposed to 2007 when 1.4mil were checked out. Other formats like DVDs have basically the same checkouts, with occasional dips and rises looking at year to year.
The most popular categories for each media type have generally remained the same - 700s for books, CDs, and DVDs, 800s for cassettes, and 900s for VHS tapes. The different categories, however, have different proportions during the year, which I hope to highlight in the model itself.
- (3.58 KiB) Downloaded 76 times
Process: I chose a few countries that sounded interesting to explore: Vietnam, China, Russia, Egypt, Mexico, Brazil, Cuba, New Zealand, Germany, Iraq, Israel, Hungary, Sweden, Spain, Ireland, Japan. First I wrote a simple query to look at all the items with one specific country in the title or subject, ordered by number of checkouts. I made sure to allow for both the country name and its adjective — e.g., both “ireland” and “irish.” Here is the example for Ireland.
Code: Select all
SELECT COUNT(cout), cout, spl_2016.title.title, spl_2016.subject.subject, deweyClass FROM spl_2016.outraw INNER JOIN spl_2016.title ON spl_2016.outraw.bibNumber = spl_2016.title.bibNumber INNER JOIN spl_2016.subject ON spl_2016.outraw.bibNumber = spl_2016.subject.bibNumber WHERE (spl_2016.title.title LIKE “%ireland%” OR spl_2016.subject.subject LIKE "%ireland%" OR spl_2016.title.title LIKE “%irish%” OR spl_2016.subject.subject LIKE "%irish%”) AND year(cout) > 2005 GROUP BY spl_2016.title.title ORDER BY COUNT(cout) DESC;
I used two “inner join” statements in this query, because it seemed to run faster than other methods. However, I was having a lot of variation in run times overall, possibly because of an inconsistent connection, so I am not sure whether this is actually the fastest way.
It is interesting to look at the titles and subjects of popular Ireland media, to get a sense of what is most appealing. A lot of novels and literature, as well as song and dance media, are popular here.
I checked the counts of all transactions involving items associated with each country since 2006, to make sure there are enough records to work with. All countries listed above have around 1,000 titles or more, except for Hungary and New Zealand; all have many thousands or millions of checkouts, which should be enough to work with.
Next, for each country, I broke up the number of checkouts by year and binned by dewey class. Here is the example for Mexico and Ireland.
Code: Select all
SELECT year(cout), floor(deweyClass/10)*10 as deweyBin, sum(CASE WHEN spl_2016.outraw.title LIKE "%mexic%" THEN 1 ELSE 0 END) as Mexico, sum(CASE WHEN spl_2016.outraw.title LIKE "%ireland%" OR spl_2016.outraw.title LIKE "%irish%" THEN 1 ELSE 0 END) as Ireland FROM spl_2016.outraw WHERE year(cout) > 2005 GROUP BY floor(deweyClass/10)*10, year(cout) ORDER BY floor(deweyClass/10)*10, year(cout);
This query searches on title alone, since it is already listed in the “outraw” table and thus is more efficient than having to join the table with the “subject” table. Initial attempts to run while also searching on “subject” took much longer -- a little over ten minutes just for these two countries.
The results are interesting even for these two countries. For example, Mexico is much more highly represented in the food (640) and science (500s) categories, while Ireland is much more highly represented in religion / Christianity categories (200s) and English literature (820). In some categories there are trends that are hard to explain — for example, Mexican architecture seems to have gotten slowly less popular over the decade, while Irish architecture gained popularity. Similarly, Mexico gained checkouts over time in category 360 (social problems and social services) while Ireland started out with more but showed a downward trend, perhaps related to the reporting of crime and other issues in those countries by US media.
Finally, I wrote a query to complete this table for all my chosen countries, searching on title. I would have liked to search on subject as well, but queries joining the two tables in addition to completing all the CASE statements take much too long to run. I also removed some of the countries with the fewest entries, which I decided not to include in my project because they have a lot less interesting data compared to most of the other countries.
Code: Select all
SELECT year(cout), floor(deweyClass/10)*10 as deweyBin, sum(CASE WHEN spl_2016.outraw.title LIKE "%mexic%" THEN 1 ELSE 0 END) as Mexico, sum(CASE WHEN spl_2016.outraw.title LIKE "%ireland%" OR spl_2016.outraw.title LIKE "%irish%" THEN 1 ELSE 0 END) as Ireland, sum(CASE WHEN spl_2016.outraw.title LIKE "%spain%" OR spl_2016.outraw.title LIKE "%spanish%" THEN 1 ELSE 0 END) as Spain, sum(CASE WHEN spl_2016.outraw.title LIKE "%israel%" THEN 1 ELSE 0 END) as Israel, sum(CASE WHEN spl_2016.outraw.title LIKE "%egypt%" THEN 1 ELSE 0 END) as Egypt, sum(CASE WHEN spl_2016.outraw.title LIKE "%viet%nam%" THEN 1 ELSE 0 END) as Vietnam, sum(CASE WHEN spl_2016.outraw.title LIKE "%german%" THEN 1 ELSE 0 END) as Germany, sum(CASE WHEN spl_2016.outraw.title LIKE "%russia%" THEN 1 ELSE 0 END) as Russia, sum(CASE WHEN spl_2016.outraw.title LIKE "%china%" OR spl_2016.outraw.title LIKE "%chinese%" THEN 1 ELSE 0 END) as China, sum(CASE WHEN spl_2016.outraw.title LIKE "%cuba%" THEN 1 ELSE 0 END) as Cuba, sum(CASE WHEN spl_2016.outraw.title LIKE "%iraq%" THEN 1 ELSE 0 END) as Iraq, sum(CASE WHEN spl_2016.outraw.title LIKE "%japan%" THEN 1 ELSE 0 END) as Japan FROM spl_2016.outraw WHERE year(cout) > 2005 GROUP BY deweyBin, year(cout) ORDER BY deweyBin, year(cout);
This query is not especially efficient, but I don't yet know enough about how MySQL works under the hood to tell how best to make it more efficient. For example, if the query requires searching through a large table 16 separate times (once for each country), is there a faster way to search for multiple strings with only a single pass through the table? How much time does joining take compared to the summing of different cases? I did manage to shave off about one minute of time by adding the condition that a country's name was in the title to the WHERE statement. This creates a smaller table sum over during the CASE statements.
The results include a lot of interesting information. Apart from basic history and literature divisions, which follow predictable lines, some countries are highly represented in certain areas — such as Vietnam, Mexico, Japan, and China in food/cookbooks; Egypt in ancient history; Israel in “the Bible,” and Iraq in war/military. Other cultural influences appear as well, such as that of Spain on South American history. It will be interesting to show differences between the Dewey profile of different countries, as well as trends over time.
In this phase, I am questioning whether film adaptations of particular titles promote their printed counterparts? It is common notion that an average audience of big budget productions later being informed that the story is based on either fiction or nonfiction books. I want to examine when this revelation kicks in and pushes people to read books, and what kinds of parameters play an important role on this timeframe between movie popularity and book resurrection.
Code: Select all
SELECT MONTH(c.checkOut) 'Month', YEAR(c.checkOut) 'Year', COUNT(*) '# Checkouts', i.itemType 'Type', t.title 'Title' FROM title t JOIN itemToBib b ON t.bibNumber = b.bibNumber JOIN itemType i ON b.itemNumber = i.itemNumber JOIN transactions c ON c.itemNumber = b.itemNumber WHERE t.title LIKE 'game of thrones%' AND YEAR(c.checkOut) > 1995 GROUP BY MONTH(c.checkOut) , YEAR(c.checkOut) , i.itemType HAVING Year > 1995
Results: Processing times of these queries vary from 200 seconds to 300 seconds. The wildcard optimization increased the speed by a factor of 5.
The general trend of movie checkout vs book checkout relation seems to be that the more a movie becomes popular the quicker its original book attracts attention on itself.
For Hunger Games, the trend reveals that before the movie release people were actively reading the book. After two months from the release, the book checkouts dropped drastically and left its place to DVD and cd checkouts.
As for Game of Thrones, while reading rates progressing slowly but steadily, there is a quick jump when the shows announced. However, it could not maintain this popularity and left its portion to the media disks.
The similar trend is also apparent for the other titles. My main intuition that the motion picture alternatives of the titles could potentially increase the book checkouts proved wrong in the sense that when the alternative is announced, there is indeed a jump in the printed medium checkouts, however, as the time goes by, the motion picture alternatives suppress them and claim their victory over printed medium. I would expect a steady increase in the printed medium since people may want to learn the real story behind the scenario.