1 Advanced MySQL: Frequency Pattern

glegrady
Posts: 138
Joined: Wed Sep 22, 2010 12:26 pm

1 Advanced MySQL: Frequency Pattern

Post by glegrady » Wed Jan 04, 2017 6:25 pm

The assignment in MySQL is to select a topic of interest and to retrieve data that can show some frequency pattern variations. Part of the assignment can be to discover some interesting cultural content, trends and/or patterns within the data. Your MySQL data will be visualized in a 2D matrix.

Methodologies:
1) Track the checkout history of the item or topic over the 10 year history to make sure there is adequate activity to study.

2) Compare the item/topic’s history to other similar ones. Select one or more items to see if performance patterns are comparable. Identify repeating patterns, and irregularities. Look for interesting, unexpected occurrences.

What to look for:
. Search for (unexpected) patterns in the data
. How often something happens (frequency)
. Anomalies in the system (errors, outliers, etc.) http://en.wikipedia.org/wiki/Anomaly_detection
. Association: Search for relationships between variables
. Do statistical analysis using MySQL aggregate functions: http://dev.mysql.com/doc/refman/5.6/en/ ... tions.html

Your Query Search Details
. Download a minimum 3, preferably 4 columns of metadata – this will be used for horizontal, vertical, and depth values for your visualization
. Best to sequence columns to sequence of how they may be used: (col1: vertical, col2: horizontal, col3: pixel value, etc.)
. Your search results should be sufficient to feature subtleties in the data to be expressed visually within a screen size between 1920x 1280 to 2560 x 1850 pixels
. Output to csv file

Process:
. Imagine a question
. Explore all fields of the database
. Test the query for logical and other errors
. Start with Booleans (AND, OR NOT), wildcards
. Make the query efficient (finetune)
. Interpret the results (data analysis)
. Export to .csv file to be used in visualization

Your Assignment Report:
. Concept/Question (describe what question you are exploring)
. Provide the Query
. Explain the Query
. Provide the results
. Give Processing Time
. Give an Analysis (report back on your results, explain how you did it, and what your outcomes are)

Post your Project as a PostReply to this description

Grading
. Standard Completion of Project: B
. Revised, advanced functions: B+, A-
. Innovative query with significant retrieved data: A
George Legrady
legrady@mat.ucsb.edu

brooksjaredc
Posts: 5
Joined: Wed Jan 11, 2017 10:39 am

Re: 1b Advanced MySQL: Frequency Pattern

Post by brooksjaredc » Sun Jan 15, 2017 4:16 pm

Harry Potter: The Book or the Movie?

I wanted to look into the effect of the movie releases of the Harry Potter films has on the checkout frequency of the books and the films. Do the release of the movies increase the checkout frequency of the books? Is there a delay on the interest from the time the film comes out? How does this change with the different books?

I searched through the different item types for Harry Potter related material, and only found significant frequencies in books, dvds, and cds. I wanted to look at the books/films after and including the Goblet of Fire, which was released in theaters in November 2005.

First, we run the following query

Code: Select all

SELECT *
FROM spl_2016.inraw WHERE (title LIKE '%harry potter and the goblet of fire%') AND (year(cout) = 2006) AND (itemType LIKE '%jccd%' OR itemType LIKE'%accd%')
ORDER BY year(cout), month(cout)
to see that items with item type 'jccd' all have "Harry Potter and the Goblet of Fire" as the title and are most likely audio books, but all the items with 'accd' as the type have the term "soundtrack" in the title, so we separate the two.

The next query collects sums on all the titles that match the book/film title and the different item types and groups and orders them by month and year.

Code: Select all

SELECT year(cout), month(cout), sum( case when itemType LIKE '%bk%' then 1 else 0 end) as book, 
sum( case when itemType LIKE '%dvd%' then 1 else 0 end) as dvd,
sum( case when itemType='jccd' then 1 else 0 end) as cd,
sum( case when itemType='accd' then 1 else 0 end) as soundtrack
FROM spl_2016.inraw WHERE (title LIKE '%harry potter and the goblet of fire%') AND (year(cout) > 1970)
GROUP BY month(cout), year(cout)
ORDER BY year(cout), month(cout)
goblet_of_fire.csv
(2.5 KiB) Downloaded 77 times
The film was released in theaters in November 2005 and released to DVD in March 2006. We can see the interest in the books peaks just a few months after the theater release, and the interest in cds peaks around the same time. The fact that the book interest and audio book (cd) interest peak around the same time is not surprising.

We can find that the DVD release date for this movie is March 2006, and we see that interest in the DVD checkouts peaks in July 2006, 4 months after the DVD release.

We can run the same query as the second one but replace "the goblet of fire" with "the order of the phoenix" to see the results for the next movie, which was released in theaters in July 2007 and released to DVD in November 2007.
order_of_the_phoenix.csv
(2.55 KiB) Downloaded 68 times
Interest in the book peaks during the same month as the theater release, but the cd interest peaks a few months earlier, possibly due to children (main consumers of harry potter audio books) needing more time to finish the audio book before the film comes out. As for interest in the DVDs, it jumps from 0 to 191, so it must have been made available that same month, and the peak interest occurs 3 months after the DVD release.

We can again run the same query for "the half blood prince" which was released in theaters in July 2009, and release to DVD in December 2009.
half_blood_prince.csv
(2.45 KiB) Downloaded 72 times
We can see that interest in the book peaks the month before the theater release, whereas the interest in the audio book on cd seems to have a relative dip during this period. Interest in the DVD peaks 3 months after the dvd release, which is consistent with the last two films.

The final book in the series is split into two films, so we will perform one MySQL command to see how both movies affected the interest in the book.

Code: Select all

SELECT year(cout), month(cout), sum( case when itemType LIKE '%bk%' then 1 else 0 end) as book, 
sum( case when itemType LIKE '%dvd%' then 1 else 0 end) as dvd,
sum( case when itemType='jccd' then 1 else 0 end) as cd,
sum( case when itemType='accd' then 1 else 0 end) as soundtrack
FROM spl_2016.inraw WHERE (title LIKE '%harry potter and the deathly hallows%') AND (year(cout) > 1970)
GROUP BY month(cout), year(cout)
ORDER BY year(cout), month(cout)
Part 1 was release to theaters in November 2010 and released to DVD in April 2011, and Part 2 was released to theaters in July 2011 and released to DVD in November 2011.
deathly_hallows.csv
(2.12 KiB) Downloaded 74 times
Interest in the books peaks on both theater releases dates, with the first one being 10% higher and about a month longer. The audio book cd interest frequency is similar to the books, but checked out about half as frequently. Interest in the DVDs peaked only a month after the DVD release of part1, but 2 months after the DVD release of part2.

Analysis: From all this we can see that people like to read the books just as the movies are being released to theaters, with audio books sometime peaking sooner. The relative popularity of books and movies remains relatively constant over time. We also see that interest in the DVDs peaks just few months after the DVD release. Furthermore, interest for all Harry Potter items seems to slowly decrease and plateau from 2012 until now. The processing time for these queries each took about 30 seconds.

jingyi_xiao
Posts: 5
Joined: Wed Jan 11, 2017 10:43 am

Re: 1b Advanced MySQL: Frequency Pattern

Post by jingyi_xiao » Mon Jan 16, 2017 3:34 pm

Question: What is the average title length of different kinds of items (book, media) in different dewey classes? Does title length vary in different dewey classes and in different itemes?

Code: Select all

Query1:
select a.dewey, 
avg(case when a.itemtype like '%bk%' then a.length else 0 end) as book, 
avg(case when a.itemtype like '%cas%' or a.itemtype like '%cd%' or 
a.itemtype like '%disk%' or a.itemtype like '%dvd%' or a.itemtype like '%rec%'
or a.itemtype like '%vhs%' then a.length else 0 end) as media 
from
(select distinct bibNumber, itemtype, floor(deweyClass) as dewey, 
length(title) as length 
from spl_2016.inraw
where cout > '2006-01-01 00:00:00' and floor(deweyClass) != '') as a
group by a.dewey;
Query2 and query3 are almost the same as query1, the only difference is floor(deweyClass) as dewey was replaced by floor(deweyClass/10)*10 as dewey and floor(deweyClass/100)*100 as dewey. The results of queries are in different sheets in a single .csv file.

Explain: I did 3 queries based on different dewey class level: dewey classes, dewey divisons and dewey sections. The queries are very similar. First, I calculate the title length of all distinct items checkout after 2006(10-year data), then I calculate the average title length of different items (I classify items into two groups: books and media) and grouped them by dewey class. And finally I calculate the standard deviation of title length for query3 using stddev() function.

I just divided items into two groups and left misc out because of two reasons: first, the data volume of misc is not a lot, so the result is not convincing from statistical perspective. Second, misc contains many different item types (acfold, acart, etc.) in the database, it slowed down the query.

Standard deviation for book is 8.60, media is 2.59 in query3. The average title length for book is 41.58, and average for media is 3.72.

Processing time: 147.521s for query1, 147.683s for query2, 146.649s for query3. For query1, 916 columns were returned (not 1000 columns), because some sections are unassigned, and some are no longer used (I left out the data whose dewey class is null in my query).

(I used INNER JOIN at first to join spl_2016.deweyClass with spl_2016.title, but I find INNER JOIN is much slower than using function floor() to calculate dewey classes in spl_2016.deweyClass. So INNER JOIN is not very time-efficient here.)

Analysis: Generally speaking, the title length of book is obviously greater than title length of media just as I expected (books tend to have longer name than movies or songs).

At ‘class’ level, what’s interesting is class 400(language) and class 700(art&recreation) have both shorter title in books and longer title in media compared to other classes. Class 800 also has short title in book.

At ‘division’ level, 410-460(English and other European languages), 780(music) are very different from the overall tendency, they are the reason class 400 and 700 are different from other classes. 920(Biography & genealogy) also has a longer title.

At ‘section’ level, more details are observed(although there is no data in some sections in media). Some outliers are found: 7, 39, 75, 83, 114, 127, 167, 219, 240, 441, 781, 789 and etc. And section 7-9 is unassigned, 39 is about language, 75 is newspaper about Italy, 83 is about Germanic language, 114 is about space. It seems the title length has something to do with language, as many of these outliers are related to languages. And there are more information to be explored.
Attachments
titleLength.xlsx
(58.33 KiB) Downloaded 73 times

freeberg
Posts: 5
Joined: Wed Jan 11, 2017 10:39 am

Re: 1b Advanced MySQL: Frequency Pattern

Post by freeberg » Wed Jan 18, 2017 10:36 pm

Question: How often do residents of the greater Seattle area check out materials relating to common hobbies or skills? Some examples of hobbies and skills include gardening, cooking, programming, and housekeeping. (I am an avid hobbyist. I like to learn new disciplines deep enough to make them enjoyable. I was curious if residents of the Seattle area use the library to get materials relating to common hobbies.)

Query: The query uses a series of SUM(IF...) statements to get a frequency count of 13 common hobbies. I used this page to find Dewey classifications that relate to common hobbies... http://bpeck.com/references/ddc/ddc.htm. I grouped the observations by the day, so there are 3936 observations. There is also an overall count for all materials--the daily average is around 40,000 checkouts! The hobbies I decided to track are...
  • - Cooking
    - Office Skills
    - Woodworking
    - Performing Music
    - Outdoor Activities
    - Indoor Activities
    - Programming
    - Housekeeping
    - Accounting
    - Gardening
    - Painting
    - Drawing
    - Travel

Code: Select all

# Jason Freeberg
# January 15th, 2017
# Query for first project: 2D Frequency Mapping

SELECT 
    DAY(t.checkOut) AS dayOut,
    MONTH(t.checkOut) AS monthOut,
    YEAR(t.checkOut) AS yearOut,
    SUM(IF(FLOOR(d.deweyClass) = 641, 1, 0)) AS cooking,
    SUM(IF(FLOOR(d.deweyClass) = 651, 1, 0)) AS officeSkills,
    SUM(IF(FLOOR(d.deweyClass) = 694, 1, 0)) AS woodWorking,
    SUM(IF(FLOOR(d.deweyClass) BETWEEN 783 AND 788, 1, 0)) AS musicPerf,
    SUM(IF(FLOOR(d.deweyClass) BETWEEN 796 AND 799, 1, 0)) AS outdoorActivities,
    SUM(IF(FLOOR(d.deweyClass) = 794, 1, 0)) AS indoorActivies,
    SUM(IF((FLOOR(d.deweyClass) BETWEEN 4 AND 6) 
		OR (d.deweyClass = 519.7), 1, 0)) AS programming,
    SUM(IF((FLOOR(d.deweyClass) = 646)
		OR (FLOOR(d.deweyClass) = 648), 1, 0)) AS houseKeeping,
    SUM(IF(FLOOR(d.deweyClass) = 657, 1, 0)) AS accounting,
    SUM(IF((FLOOR(d.deweyClass) = 631)
		OR (FLOOR(d.deweyClass) = 635), 1, 0)) AS gardening,
    SUM(IF(FLOOR(d.deweyClass) BETWEEN 740 AND 743, 1, 0)) AS drawing,
    SUM(IF(FLOOR(d.deweyClass) BETWEEN 750 AND 751, 1, 0)) AS painting,
    SUM(IF(FLOOR(d.deweyClass) = 910, 1, 0)) AS travel,
    COUNT(t.checkOut) AS dailyTotal
FROM
    spl_2016.transactions AS t
        JOIN
    spl_2016.deweyClass AS d ON d.bibNumber = t.bibNumber
WHERE
    YEAR(t.checkOut) between 2006 AND 2017
GROUP BY dayOut , monthOut , yearOut
ORDER BY yearOut ASC , monthOut ASC , dayOut ASC;
Noticeable Trends: Cooking and music consistently top the list. Outdoor activities are a close second. Materials on programming and housekeeping average around 150/day. I am interested to see if materials related to accounting go up during tax season (late January to April). I got these these preliminary results using summary functions in R.

Caveats: The query took longer than the 10 minute limit, so I broke the query into 4 smaller queries and concatenated the .csv files in R. Thanks for the idea, Professor! I don't know how long the query would take in its entirety.
Attachments
hobbies.csv
Full file, 2006 - 2017
(252.53 KiB) Downloaded 85 times

wolfe
Posts: 5
Joined: Wed Jan 11, 2017 10:43 am

Re: 1b Advanced MySQL: Frequency Pattern

Post by wolfe » Thu Jan 19, 2017 1:18 am

Original Post: Jan 19, 2017 1:18 am

Concept/Question: What fiction items are checked out once, or checked out once and never returned?

Provide the Query:

Code: Select all

SELECT 
    c, checkout, checkin, title, callNumber,
    CASE WHEN callNumber LIKE '%SCI-FIC%' THEN 1 ELSE 0 END AS sci,
    CASE WHEN callNumber LIKE '%MYSTERY%' THEN 1 ELSE 0 END AS mys,
    CASE WHEN callNumber LIKE '%WESTERN%' THEN 1 ELSE 0 END AS wes,
    CASE WHEN callNumber LIKE 'FIC%' or callNumber LIKE 'CAS FIC%' THEN 1 ELSE 0 END AS fic,
    CASE WHEN callNumber LIKE 'YA%' or callNumber LIKE 'CAS YA%' THEN 1 ELSE 0 END AS YA,
    t.bibNumber
FROM
    (SELECT 
        COUNT(checkOut) AS c, spl_2016.itemToBib.bibNumber
	FROM spl_2016.callNumber 
	join spl_2016.itemToBib on spl_2016.callNumber.itemNumber = spl_2016.itemToBib.itemNumber
	INNER JOIN spl_2016.transactions ON spl_2016.transactions.bibNumber = spl_2016.itemToBib.bibNumber
	where callNumber LIKE '%SCI-FIC%' OR callNumber LIKE '%MYSTERY%' OR callNumber LIKE '%WESTERN%'
		OR callNumber LIKE 'FIC%' OR callNumber LIKE 'YA%' OR callNumber LIKE 'CAS YA%' OR callNumber LIKE 'CAS FIC%'
	GROUP BY spl_2016.itemToBib.bibNumber
	ORDER BY c) t
        INNER JOIN spl_2016.transactions ON spl_2016.transactions.bibNumber = t.bibNumber
        LEFT JOIN spl_2016.callNumber ON spl_2016.callNumber.itemNumber = spl_2016.transactions.itemNumber
        LEFT JOIN spl_2016.title ON spl_2016.transactions.bibNumber = spl_2016.title.bibNumber
        LEFT JOIN spl_2016.deweyClass ON spl_2016.transactions.bibNumber = spl_2016.deweyClass.bibNumber
WHERE
    c = 1 AND YEAR(checkout) > 2004 and deweyClass = ""
ORDER BY checkout; 
Explain the Query:

First I wrote a subquery that counts how many times each fiction book is checked out. I joined the callNumber table with the transactions table where they callNumber was like one of the book types I was searching for. I grouped the query by bibNumber counting the checkout of each book. I really wanted to group the query by itemNumber but the query takes too long to complete.

Next I joined the subquery to transactions, callNumber, title and deweyClass to get data about each book where there had only been one checkout the year that the book was checked out was greater than 2004 and it had no dewey class (because it was fiction). I wrote a case statement to categorize the fiction based on the call number.

I found that having a join instead of a left join with itemToBib table in the subquery dramatically increased efficiency. This dropped it from 340.044 seconds to .069 seconds, though it increased the fetch from .078 to .242 seconds.

Provide the results:
empty5.csv
(131.79 KiB) Downloaded 76 times
Give Processing Time:

runtime duration: .069 sec, fetch: 0.242 sec, 1460 rows

Give an Analysis:

My outcome is a table of 1460 entries, each with a bib entry that has only been checked out once. I really wanted to do the analysis with itemNumbers instead of bibNumbers, to see which books had multiple items that were checked out only once, or stolen repeatedly. I could not manage to write an efficient enough query to get that data and the queries would take over 10 minutes to execute.

I found that out of the 1460 bib entries checked out only once 76 were science fiction 233 were mysteries 81 were westerns 1012 were just "fiction" and 58 were YA fiction. Also 875 of these books have not been returned. Some of that may be due to being books that were just checked out for the first time, but according to the Seattle public library website any book can be checked out for up to 63 days (including renewals), only 123 books are within that timeframe. That means that 51% of the books that were checked out only once are overdue even if the patron renewed the book as many times as possible.

For the visualization I plan to display this data with dates on the x axis and items in order by cal number on the y axis. The item would have a line from it's checkout to checkin date. The color would be the genre. The items would be visually different if the book was never returned.

Originally I looked at all the books that were checked out and had not been returned but the data was very messy and there were 26224 entries. When I tried to split them up categorically many books were ILL books (10266) or could not categorized by the dewey decimal system (4896). The interlibrary loan books were the worst because they had no metadata. I went through the call numbers of the 4896 entries to see if there was any way to categorize them, I decided to focus on fiction. I thought that focusing on fiction would give me a smaller subset of data.

---------------------------------------------------

Update: 1/19/2016 1:58 pm

I completed the query using itemNumbers instead of bibNumbers. I also noticed I wasn't collecting all of the audiobooks so I updated the query to handle them.

Code: Select all

SELECT 
    c, checkout, checkin, title, callNumber,
    CASE WHEN callNumber LIKE '%SCI-FIC%' THEN 1 ELSE 0 END AS sci,
    CASE WHEN callNumber LIKE '%MYSTERY%' THEN 1 ELSE 0 END AS mys,
    CASE WHEN callNumber LIKE '%WESTERN%' THEN 1 ELSE 0 END AS wes,
    CASE WHEN callNumber LIKE 'FIC%' or callNumber LIKE 'CAS FIC%' or callNumber LIKE 'CD FIC%' THEN 1 ELSE 0 END AS fic,
    CASE WHEN callNumber LIKE 'YA%' or callNumber LIKE 'CAS YA%' or callNumber LIKE 'CD YA%' THEN 1 ELSE 0 END AS YA,
    spl_2016.transactions.bibNumber, t.itemNumber
FROM
    (SELECT 
        COUNT(checkOut) AS c, spl_2016.callNumber.itemNumber
	FROM spl_2016.callNumber
	INNER JOIN spl_2016.transactions ON spl_2016.transactions.itemNumber = spl_2016.callNumber.itemNumber
	where callNumber LIKE '%SCI-FIC%' OR callNumber LIKE '%MYSTERY%' OR callNumber LIKE '%WESTERN%'
		or callNumber LIKE 'FIC%' or callNumber LIKE 'CAS FIC%' or callNumber LIKE 'CD FIC%'
        or callNumber LIKE 'YA%' or callNumber LIKE 'CAS YA%' or callNumber LIKE 'CD YA%'
	GROUP BY spl_2016.callNumber.itemNumber
	ORDER BY c) t
        INNER JOIN spl_2016.transactions ON spl_2016.transactions.itemNumber = t.itemNumber
        LEFT JOIN spl_2016.callNumber ON spl_2016.callNumber.itemNumber = spl_2016.transactions.itemNumber
        LEFT JOIN spl_2016.title ON spl_2016.transactions.bibNumber = spl_2016.title.bibNumber
        LEFT JOIN spl_2016.deweyClass ON spl_2016.transactions.bibNumber = spl_2016.deweyClass.bibNumber
WHERE
    c = 1 AND YEAR(checkout) > 2004 and deweyClass = ""
ORDER BY checkout; 
Provide the results:
empty6.csv
(1.74 MiB) Downloaded 79 times
Give Processing Time:

runtime duration: 684.374 sec, fetch: 0.170 sec, 18757 rows

Give an Analysis:

There are books that have been checked out repeatedly and never returned. For example there are 58 items for "Moonglow a novel" by Chabon, that have been checked out and never returned. 25 items of "books on bikes that were checked out only once. 24 items of "best American short stories and the yearbook of the American short story" that have only been checked out once. 23 copies of "sellout" have only been checked out once. There is a chance that these are errors, but according to the catalog there are 108 copies of "best american short stories", so it doesn't seem unreasonable that 1/4 of them have only been checked out once. There are 14890 distinct bibNumbers for the 18757 items. 2469 are books that have multiple items that has only been checked out once for the same bibNumber.

I still would like to expand this query to all items, so for example it would grab all the entries for "catcher in the rye" because there is one of the items that has only been checked out once.
I think that this is an interesting dataset, but there is still a lot of room to think about how to organize it.
Last edited by wolfe on Thu Jan 19, 2017 1:58 pm, edited 1 time in total.

sara.lafia
Posts: 5
Joined: Wed Jan 11, 2017 10:41 am

Re: 1b Advanced MySQL: Frequency Pattern

Post by sara.lafia » Thu Jan 19, 2017 8:16 am

How have the relative popularities of programming languages shifted over a decade?
Patrons may turn to the library for resources to assist in acquiring new skills, such as mastering new programming languages. In the midst of a financial crisis in mid-2008 and a growing tech boom in the Pacific Northwest that began a meteoric rise around 2010 and continues to this day.

Query

Code: Select all

SELECT MONTH(t.checkOut) AS monthOut, YEAR(t.checkOut) AS yearOut,
SUM(i.title LIKE '%java%') AS Java, SUM(i.title LIKE '%python%') AS Python, 
SUM(i.title LIKE '%php%') AS PHP, SUM(i.title LIKE '%javascript%') AS JavaScript,
SUM(i.title LIKE '%perl%') AS Perl, SUM(i.title LIKE '%ruby%') AS Ruby,
SUM((i.title LIKE '%java%') + (i.title LIKE '%python%')+(i.title LIKE '%php%')+(i.title LIKE '%javascript%')+(i.title LIKE '%perl%')+
(i.title LIKE '%ruby%')) as totalPYPL, SUM(d.deweyClass = 005) AS monthly005Total
FROM (spl_2016.transactions as t INNER JOIN spl_2016.title as i ON t.bibNumber= i.bibNumber)
INNER JOIN spl_2016.deweyClass as d ON(t.bibNumber= d.bibNumber)
WHERE (d.deweyClass = 005) AND (YEAR(t.checkOut) between 2006 AND 2016)
group by yearOut, monthOut
order by yearOut, monthOut
Query
This query filters all reference books to return only those with a Dewey classification of 005 (Computer programming, programs & data). These books are then aggregated to total checkouts per month and then are sorted by programming language. This is achieved by summing all string matches for various programming languages. The choice for the six retrieved programming languages (Java, Python, PHP, JavaScript, Perl, and Ruby) is inspired by Google’s popularity of Programming Languages (http://pypl.github.io/PYPL.html) measure which is derived from top Google searches for programming languages each year. This measure calculates the share of popularity each language receives annually by comparing the number of searches per language against the total number of searches for all languages. Each checkout for the given programming language that appears in the book’s title, such as “Java”, is summed and reported at a monthly resolution. The transactions, titles, and dewey class tables are joined by bibliographic number. The transactions provide the checkout date, the titles provide the text strings that contain programming languages, and the dewey class provides the dewey range for computer science books. Filtering the results to only return data between 2006 and 2016 removes spurious data (such as records for January 1970) and incomplete data (partial totals for January 2017, which is the current month of this year). Grouping the results by year out and month out provides a sum for each month of each year, resulting in 132 rows. Ordering by year out and month out returns a sequential snapshot of total checkouts for each programming language beginning with January 2006 and ending with December 2016.

Results
The results are binned by programming language and are summarized by month of each year. Each row shows the sum of books checked out each month for the respective programming language. The totalPYPL column adds the sum of all six classes of programming languages for a total number per month. The monthly005Total returns the number of checked out items . Using these columns, both an intra-language rank frequency and intra-Dewey classification frequency can be calculated at a monthly and annual temporal resolution. This means that popularity of:
  • 1. programming languages can be tracked by comparing monthly totals against themselves.
    2. all languages, given their share of total checkouts, can be tracked by comparing monthly totals as “shares” of popularity across programming language categories.
    3. the six selected programming languages can be compared against all other computer science reference books (which include manuals for other programming languages not retrieved here).
Processing Time
The duration of the query is 4 seconds. This means that the query is relatively straightforward.

Analysis
Comparing the popularity of the programming languages to Google’s PYPL measure, we see similar results. Beginning in 2006, established languages like Java, PHP, and Perl were solidly in the lead, with nascent languages like JavaScript, Python, and Ruby enjoyed less of a share of checkouts. Limitations to this analysis include a lack of normalization for each language, lack of accounting for new titles introduced each month, and unknown demand for the items. For example, if the library carries 500 titles of Java books and adds new materials at a higher rate (i.e. 10 new titles per month), compared with 15 Python books, which are less frequently updated (i.e. 1 new title per month), this skew is not accounted for. We don’t know how many titles are available. This is why comparing programming languages against one another relatively is safer. Also, demand for the items is unknown, since hundreds of people may have searched for and planned to checkout a book on Python for example, making it highly popular, but couldn’t since it was out of circulation for many months. We cannot account for this either. Overall, the monthly trends for Java show that its popularity has remained fairly stable over the past decade, which languages like PHP and Perl clearly indicate declining popularity. Languages like Python, around the first quarter of 2008, and JavaScript, around the second quarter of 2009, begin to surge in popularity. Languages like Ruby only seem to have taken off substantially in the second quarter of 2015. Of course, these qualitative assessments should also be tested to determine if the growth in the number of checkouts is statistically significant.
Attachments
1b_Lafia.csv
(4.56 KiB) Downloaded 72 times

ariellalgilmore
Posts: 7
Joined: Wed Jan 11, 2017 10:40 am

Re: 1b Advanced MySQL: Frequency Pattern

Post by ariellalgilmore » Thu Jan 19, 2017 8:17 am

Question(s): How has the increase in technology effected the items being checked out from 2005 to 2016? How is each dewey class being effected and what item types are still being checked out?

Code: Select all

SELECT
year(cout),
COUNT(CASE WHEN deweyClass >= 000 AND deweyClass<100 THEN 1 END) as
Info,
COUNT(CASE WHEN deweyClass >= 100 AND deweyClass<200 THEN 1 END) as
Phil_Psych,
COUNT(CASE WHEN deweyClass >= 200 AND deweyClass<300 THEN 1 END) as
Religion,
COUNT(CASE WHEN deweyClass >= 300 AND deweyClass<400 THEN 1 END) as
SS,
COUNT(CASE WHEN deweyClass >= 400 AND deweyClass<500 THEN 1 END) as
Lang,
COUNT(CASE WHEN deweyClass >= 500 AND deweyClass<600 THEN 1 END) as
Science,
COUNT(CASE WHEN deweyClass >= 600 AND deweyClass<700 THEN 1 END) as
Tech,
COUNT(CASE WHEN deweyClass >= 700 AND deweyClass<800 THEN 1 END) as
Arts,
COUNT(CASE WHEN deweyClass >= 800 AND deweyClass<900 THEN 1 END) as
Lit,
COUNT(CASE WHEN deweyClass >= 900 AND deweyClass<1000 THEN 1 END) as
History,
SUM(CASE WHEN itemtype = 'accas' OR itemtype = 'arcas' OR itemtype = 'bccas' OR itemtype = 'jccas' OR itemtype = 'jrcas' OR itemtype = 'accd' OR itemtype = 'arcd' OR itemtype = 'jccd'
OR itemtype = 'jrcd' OR itemtype = 'accdrom' OR itemtype = 'arcdrom' OR itemtype = 'bccdrom'
OR itemtype = 'drcdrom' OR itemtype = 'jccdrom' OR itemtype = 'acdisk' OR itemtype = 'ardisk' OR itemtype = 'jrdisk' OR itemtype = 'acdvd' OR itemtype = 'ardvd' OR itemtype = 'bcdvd'OR itemtype = 'jcdvd' OR itemtype = 'jrdvd' OR itemtype = 'xrcdrom' OR itemtype = 'ucflpdr' THEN 1 ELSE 0 END) AS Media,
SUM(CASE WHEN itemtype = 'acvhs' OR itemtype = 'alvhs' OR itemtype = 'bcvhs' OR itemtype = 'blvhs' OR itemtype = 'jcvhs' OR itemtype = 'jlvhs' OR itemtype = 'jrvhs' OR itemtype = 'xrvhs' OR itemtype = 'scmed' OR itemtype = 'acvid' THEN 1 ELSE 0 END) AS Video,
SUM(CASE WHEN itemtype = 'acbk' OR itemtype = 'arbk ' OR itemtype = 'bcbk' OR itemtype = 'drbk' OR itemtype = 'jcbk' OR itemtype = 'jrbk' OR itemtype = 'bccd' THEN 1 ELSE 0 END) AS Book
FROM spl_2016.inraw
WHERE year(cout) >= "2005" and year(cout) < 2017 
GROUP BY year(cout)
ORDER BY year(cout);
Explanation: This query is using the table from inraw. It is counting how many checkouts happened during each year in each Dewey class; and it is also counting how many checkouts each year are either media, video, or book. I am gathering this information to see if the increase in technology from 2006 to 2017 would affect a decrease/increase in certain Dewey classes and if the item type decreases. Before running the query, I assumed that the number of checkouts in each Dewey class and item type would decrease because everything can be so quickly accessed by the internet in today’s age.

Processing time: 285 seconds, 12 rows affected

Analysis: Looking at the table, the Dewey classes that are the most popular are in the first 100 block, the 600 to 700 block, and the 700 to 800 block. It is interesting to note that these Dewey classes are related to computer science, information and general works, technology, and arts and recreation. From 2005 to 2009 there was a steady increase in the number of checkouts per Dewey class, but from 2010 to 2016 the checkouts quickly decreased. Also, when looking at the time scale of how many of the item types were checked out per year, one can notice that they also increase from 2005 to 2009 and decreased from 2010 to 2016. Only one video was checked out in all of 2016.

Because my question is asking how technology has influenced people’s checkouts at the Seattle Public Library, I began to do some research on the technology advancements that happened near 2009 that might have decreased the popularity in the checkouts at the library. Interestingly, I found that during this period, the App Store was launched. This provided people with quicker access to music, movies, and books at the touch of their finger instead of getting it from a library. Spotify also had a big technology advancement allowing not only people to listen to music through streaming, but being able to download songs onto your device. Spotify even offers audio books, which further demonstrates why many people might not still be going to the library.

This information then led me to look at the stock market around 2009. I focused in on Amazon, Google, and Netflix. Consistently these three companies were low in stocks in 2009, but have been increasing ever since. This shows the popularity with these companies grew. Amazon creating the Kindle allowed easier access to a reader, making it more appealing to some over a hard copy. Google created a similar concept using Google Books and Netflix uses its ability of streaming movies instantly that may detract many former library attendees. Blockbuster on the other hand has been at a steady low since 2009 because of these advancements where everyone can download almost anything anywhere.
Attachments
techvslib.csv
(1.2 KiB) Downloaded 70 times

christopherchen0
Posts: 5
Joined: Wed Jan 11, 2017 10:44 am

Re: 1b Advanced MySQL: Frequency Pattern

Post by christopherchen0 » Thu Jan 19, 2017 4:50 pm

As the number and formats of media change - especially in the last 20 or 30 years, some are eventually phased out while others remain popular. When it comes to this dataset, I wanted to investigate those sorts of changes within a decade in the context of library checkouts, from 2006-2016. Ultimately, my question was how have media format changes affected checkouts within the last 10 years ?

Code: Select all

SELECT year(cout), floor((deweyClass)/100),
sum(case when itemtype LIKE "%bk%" then 1 else 0 end) as Books, 
sum(case when itemtype LIKE "%cd%" then 1 else 0 end) as CDs,
sum(case when itemtype LIKE "%dvd%" then 1 else 0 end) as DVDs,
sum(case when itemtype LIKE "%vhs%" then 1 else 0 end) as VHS,
sum(case when itemtype LIKE "%cas%" then 1 else 0 end) as Cassettes
FROM outraw
WHERE year(cout) > 2006 and year(cout) < 2017 and deweyClass > 0
GROUP BY year(cout), floor((deweyClass)/100)
I started out by filtering the major types of media and arrived at these 5 types: Books, CDs, DVDs, VHS, and Cassettes. I intentionally left out periodicals, music scores/other music materials, and kits because none of those had significant numbers and were seen more as outliers. With that, I had 3 dimensions: the various types of media, the individual years, and number of checkouts.

However, I thought I needed an additional element to make a more dynamic set of data to present in my frequency map, and so I wanted to also look at how their subjects (determined by Dewey class) were being affected by those same changes - did some categories maintain their popularity while the rest of the medium format was abandoned? With this purpose, I also selected deweyClass to make a total of 4 elements being examined. I also needed them grouped up into their major classes, so I used the FLOOR function to help me out here.

The query took 108 seconds.

I was forced to implement a parameter in that the Dewey class number had to be greater than 0 because many have no values set for the column and show up as 0 after using the floor function . This meant that the first section or 000-100s would always show up as the most popular class of items when it was generally always the 700s. Some patterns are ultimately changed: without removing those entries, DVDs are even more popular now than they were in 2007, but after removing them, DVDs have experienced a steady drop-off over the years.

In the end, we see quite a dramatic change in the popularity of various media formats. There was a sharp drop-off in the checkouts of cassettes - from 12,994 in 2007 to absolutely 0 in 2016 (5 in 2015, though). Similarly, 70,109 VHS tapes were checked out in 2007 but none were checked out in 2016 (2 in 2015). At the same time, books have only gained in popularity - 1.68mil and 1.75mil in the last two years as opposed to 2007 when 1.4mil were checked out. Other formats like DVDs have basically the same checkouts, with occasional dips and rises looking at year to year.

The most popular categories for each media type have generally remained the same - 700s for books, CDs, and DVDs, 800s for cassettes, and 900s for VHS tapes. The different categories, however, have different proportions during the year, which I hope to highlight in the model itself.
Attachments
OptMediaDeweyRollups.csv
(3.58 KiB) Downloaded 76 times

kschlesi
Posts: 6
Joined: Wed Jan 11, 2017 10:42 am

Re: 1b Advanced MySQL: Frequency Pattern

Post by kschlesi » Tue Jan 24, 2017 8:05 am

Question: What are the classifications of items related to other countries or cultures in the library? How does the popularity of these items change over the last decade, and does it vary by country and classification? The goal is to explore the interactions of the people of Seattle with other countries’ cultures, and identify changing interest in different aspects of these cultures.

Process: I chose a few countries that sounded interesting to explore: Vietnam, China, Russia, Egypt, Mexico, Brazil, Cuba, New Zealand, Germany, Iraq, Israel, Hungary, Sweden, Spain, Ireland, Japan. First I wrote a simple query to look at all the items with one specific country in the title or subject, ordered by number of checkouts. I made sure to allow for both the country name and its adjective — e.g., both “ireland” and “irish.” Here is the example for Ireland.

Code: Select all

SELECT COUNT(cout), cout, spl_2016.title.title, spl_2016.subject.subject, deweyClass
FROM spl_2016.outraw 
INNER JOIN spl_2016.title
ON spl_2016.outraw.bibNumber = spl_2016.title.bibNumber 
INNER JOIN spl_2016.subject 
ON spl_2016.outraw.bibNumber = spl_2016.subject.bibNumber 
WHERE (spl_2016.title.title LIKE “%ireland%” OR spl_2016.subject.subject LIKE "%ireland%"
             OR spl_2016.title.title LIKE “%irish%” OR spl_2016.subject.subject LIKE "%irish%”)
  AND year(cout) > 2005
GROUP BY spl_2016.title.title
ORDER BY COUNT(cout) DESC;
(11.4 seconds, 3731 rows)

I used two “inner join” statements in this query, because it seemed to run faster than other methods. However, I was having a lot of variation in run times overall, possibly because of an inconsistent connection, so I am not sure whether this is actually the fastest way.
q1_ireland_items.csv
(355.62 KiB) Downloaded 80 times
It is interesting to look at the titles and subjects of popular Ireland media, to get a sense of what is most appealing. A lot of novels and literature, as well as song and dance media, are popular here.

I checked the counts of all transactions involving items associated with each country since 2006, to make sure there are enough records to work with. All countries listed above have around 1,000 titles or more, except for Hungary and New Zealand; all have many thousands or millions of checkouts, which should be enough to work with.

Next, for each country, I broke up the number of checkouts by year and binned by dewey class. Here is the example for Mexico and Ireland.

Code: Select all

SELECT year(cout), floor(deweyClass/10)*10 as deweyBin, 
       sum(CASE WHEN spl_2016.outraw.title LIKE "%mexic%" THEN 1 ELSE 0 END) as Mexico, 
       sum(CASE WHEN spl_2016.outraw.title LIKE "%ireland%" OR spl_2016.outraw.title LIKE "%irish%" THEN 1 ELSE 0 END) as Ireland
FROM spl_2016.outraw
WHERE year(cout) > 2005
GROUP BY floor(deweyClass/10)*10, year(cout)
ORDER BY floor(deweyClass/10)*10, year(cout);
(1188 rows, 127 sec)

This query searches on title alone, since it is already listed in the “outraw” table and thus is more efficient than having to join the table with the “subject” table. Initial attempts to run while also searching on “subject” took much longer -- a little over ten minutes just for these two countries.

The results are interesting even for these two countries. For example, Mexico is much more highly represented in the food (640) and science (500s) categories, while Ireland is much more highly represented in religion / Christianity categories (200s) and English literature (820). In some categories there are trends that are hard to explain — for example, Mexican architecture seems to have gotten slowly less popular over the decade, while Irish architecture gained popularity. Similarly, Mexico gained checkouts over time in category 360 (social problems and social services) while Ireland started out with more but showed a downward trend, perhaps related to the reporting of crime and other issues in those countries by US media.
q2_mexico_ireland.csv
(15.73 KiB) Downloaded 68 times
Finally, I wrote a query to complete this table for all my chosen countries, searching on title. I would have liked to search on subject as well, but queries joining the two tables in addition to completing all the CASE statements take much too long to run. I also removed some of the countries with the fewest entries, which I decided not to include in my project because they have a lot less interesting data compared to most of the other countries.

Code: Select all

SELECT year(cout), floor(deweyClass/10)*10 as deweyBin, 
       sum(CASE WHEN spl_2016.outraw.title LIKE "%mexic%" THEN 1 ELSE 0 END) as Mexico, 
       sum(CASE WHEN spl_2016.outraw.title LIKE "%ireland%" OR spl_2016.outraw.title LIKE "%irish%" THEN 1 ELSE 0 END) as Ireland,
       sum(CASE WHEN spl_2016.outraw.title LIKE "%spain%" OR spl_2016.outraw.title LIKE "%spanish%" THEN 1 ELSE 0 END) as Spain,
       sum(CASE WHEN spl_2016.outraw.title LIKE "%israel%" THEN 1 ELSE 0 END) as Israel,
       sum(CASE WHEN spl_2016.outraw.title LIKE "%egypt%" THEN 1 ELSE 0 END) as Egypt,
       sum(CASE WHEN spl_2016.outraw.title LIKE "%viet%nam%" THEN 1 ELSE 0 END) as Vietnam,
       sum(CASE WHEN spl_2016.outraw.title LIKE "%german%" THEN 1 ELSE 0 END) as Germany,
       sum(CASE WHEN spl_2016.outraw.title LIKE "%russia%" THEN 1 ELSE 0 END) as Russia,
       sum(CASE WHEN spl_2016.outraw.title LIKE "%china%" OR spl_2016.outraw.title LIKE "%chinese%" THEN 1 ELSE 0 END) as China,
       sum(CASE WHEN spl_2016.outraw.title LIKE "%cuba%" THEN 1 ELSE 0 END) as Cuba,
       sum(CASE WHEN spl_2016.outraw.title LIKE "%iraq%" THEN 1 ELSE 0 END) as Iraq,
       sum(CASE WHEN spl_2016.outraw.title LIKE "%japan%" THEN 1 ELSE 0 END) as Japan
FROM spl_2016.outraw
WHERE year(cout) > 2005
GROUP BY deweyBin, year(cout)
ORDER BY deweyBin, year(cout);
(1188 rows, 8 minutes)

This query is not especially efficient, but I don't yet know enough about how MySQL works under the hood to tell how best to make it more efficient. For example, if the query requires searching through a large table 16 separate times (once for each country), is there a faster way to search for multiple strings with only a single pass through the table? How much time does joining take compared to the summing of different cases? I did manage to shave off about one minute of time by adding the condition that a country's name was in the title to the WHERE statement. This creates a smaller table sum over during the CASE statements.
q3_country_dewey_yr.csv
(42.88 KiB) Downloaded 77 times
The results include a lot of interesting information. Apart from basic history and literature divisions, which follow predictable lines, some countries are highly represented in certain areas — such as Vietnam, Mexico, Japan, and China in food/cookbooks; Egypt in ancient history; Israel in “the Bible,” and Iraq in war/military. Other cultural influences appear as well, such as that of Spain on South American history. It will be interesting to show differences between the Dewey profile of different countries, as well as trends over time.

merttoka
Posts: 6
Joined: Wed Jan 11, 2017 10:42 am

Re: 1b Advanced MySQL: Frequency Pattern

Post by merttoka » Wed Feb 01, 2017 9:14 pm

Do Film Adaptations Promote Reading?

In this phase, I am questioning whether film adaptations of particular titles promote their printed counterparts? It is common notion that an average audience of big budget productions later being informed that the story is based on either fiction or nonfiction books. I want to examine when this revelation kicks in and pushes people to read books, and what kinds of parameters play an important role on this timeframe between movie popularity and book resurrection.

Queries:

Code: Select all

SELECT 
    MONTH(c.checkOut) 'Month',
    YEAR(c.checkOut) 'Year',
    COUNT(*) '# Checkouts',
    i.itemType 'Type',
    t.title 'Title'
FROM
    title t
        JOIN
    itemToBib b ON t.bibNumber = b.bibNumber
        JOIN
    itemType i ON b.itemNumber = i.itemNumber
        JOIN
    transactions c ON c.itemNumber = b.itemNumber
WHERE
    t.title LIKE 'game of thrones%'
        AND YEAR(c.checkOut) > 1995
GROUP BY MONTH(c.checkOut) , YEAR(c.checkOut) , i.itemType
HAVING Year > 1995
The year and title are adjusted depending on the item. This example is only for the Game of Thrones. The query joins title, itemToBib, itemType and transactions tables in order to be able to retrieve checkout month and year, the number of checkouts, itemType, and title. The grouping on checkout month, year, and item type enables COUNT(*) to work for a specific month of a year. Also, the wildcard search on the title is performed as only right wildcards because the MySQL server applies indexing to fasten right-wildcard searches but it is particularly vulnerable for the left ones. I ran separate queries to validate accuracy for this assumption and found out that every copy of the book or motion picture is, in fact, starts with the original title. There are some titles that have additional suffixes, such as "episode 1" but this search term can capture those as well.

Results:
twilight.csv
(11.96 KiB) Downloaded 71 times
orangeIsTheNewBlack.csv
(11.1 KiB) Downloaded 71 times
mazeRunner.csv
(5.63 KiB) Downloaded 71 times
hungerGames.csv
(7.59 KiB) Downloaded 70 times
gameOfThrones.csv
(12.14 KiB) Downloaded 68 times
divergent.csv
(3.81 KiB) Downloaded 62 times
Processing times of these queries vary from 200 seconds to 300 seconds. The wildcard optimization increased the speed by a factor of 5.

Analysis:
The general trend of movie checkout vs book checkout relation seems to be that the more a movie becomes popular the quicker its original book attracts attention on itself.

For Hunger Games, the trend reveals that before the movie release people were actively reading the book. After two months from the release, the book checkouts dropped drastically and left its place to DVD and cd checkouts.

As for Game of Thrones, while reading rates progressing slowly but steadily, there is a quick jump when the shows announced. However, it could not maintain this popularity and left its portion to the media disks.

The similar trend is also apparent for the other titles. My main intuition that the motion picture alternatives of the titles could potentially increase the book checkouts proved wrong in the sense that when the alternative is announced, there is indeed a jump in the printed medium checkouts, however, as the time goes by, the motion picture alternatives suppress them and claim their victory over printed medium. I would expect a steady increase in the printed medium since people may want to learn the real story behind the scenario.

Post Reply