PROJ 0: CULTURE ANALYTICS - MYSQL

glegrady
Posts: 203
Joined: Wed Sep 22, 2010 12:26 pm

PROJ 0: CULTURE ANALYTICS - MYSQL

Post by glegrady » Sat Dec 26, 2015 4:42 pm

CULTURE ANALYTICS - MYSQL

Media Theorist Lev Manovich defines Culture Analytics as "the analysis of massive cultural data sets and flows using computational and visualization techniques."

Create a MySQL query that creatively and insightfully reveals something about the Seattle Public Library dataset over a 10 year period (2006-2015)
----
CULTURE ANALYTICS - 2D VISUALIZATION SCHEDULE

1.05 Introduction to course and project
1.07 Software setup MYSQL, Processing, Introduction to MYSQL queries
1.12 Students present: MYSQL query, Lev Manovich article discussion
1.14 Intro to 2D demo in processing
1.19 Review of visual language basics: form, color, movement, etc.
1.21 Presentation of Culture Analytics project in 2D
----
The assignment in MySQL is to dig into the database of the Seattle Public Library consisting of over 80 million entries since September 2005 and to select a topic that reveals some cultural trends taking place in the past 10 years. Your assignment is to discover some interesting cultural content, trends and/or patterns within the data and then to visualize it in a 2D matrix. The library includes over 2 million items of books, cds, DVDs, and other forms of published material.

You will need to become familiar with 1) the actual data content of the library database, 2) the system by which the data is organized, including irregularities in the organizational structure 3) and how the items are used by the public – what is checked out, for how long, and how this may reveal cultural trends.

Approach:
. Get an overview of the dataset
. Explore topic specifics of personal interest
. Embrace detail to integrate complexity

Begin by asking what are people checking out, what may be of personal interest to you in books, movies, music, topics, etc. Go to the library site: https://seattle.bibliocommons.com/searc ... mit=search and search for items of interest to you, be it “Global warming”, “nuclear war”, “chinese cooking”, “beatles”, “Downtown Abbey”, etc. If there is some activity, then proceed to MySQL.

Methodologies:
1) Track the checkout history of the item or topic over the 10 year history to make sure there is adequate activity to study. Your queries should provide both general overview to details.

2) Compare the item/topic’s history to other similar ones. Select one or more items to see if performance patterns are comparable. Identify repeating patterns, and irregularities. Look for interesting, unexpected occurrences.

What to look for:
. Search for (unexpected) patterns in the data
. How often something happens (frequency)
. Anomalies in the system (errors, outliers, etc.) http://en.wikipedia.org/wiki/Anomaly_detection
. Association: Search for relationships between variables
. Do statistical analysis using MySQL aggregate functions: http://dev.mysql.com/doc/refman/5.6/en/ ... tions.html

Your Query Search Details
. Download a minimum 3 columns of metadata – this will be used for horizontal, vertical, and depth values for your visualization
. Standardize columns for output to .csv file: (vertical, horizontal, pixel value + more)
. Your search results should be sufficient to feature subtleties in the data to be expressed visually within a screen size between 1920x 1280 to 2560 x 1850 pixels

Process:
. Imagine a question
. Explore all fields of the database
. Test the query for logical and other errors
. Start with Booleans (AND, OR NOT), wildcards
. Make the query efficient (finetune)
. Interpret the results (data analysis)
. Export to .csv file to be used in visualization

Your Assignment Report:
. Concept/Question (describe what question you are exploring)
. Provide the Query
. Explain the Query
. Provide the results
. Give Processing Time
. Give an Analysis (report back on your results, explain how you did it, and what your outcomes are)

Post your Project:
. Print out as pdf and attach. Keep csv files in the ".csv" format. Post your report at http://www.mat.ucsb.edu/forum/ at MAT 259/Winter 2016
. PostReply to Proj 1 - Attach the pdf.

Grading
. Standard Completion of Project: B
. Revised, advanced functions: B+, A-
. Innovative question: A
George Legrady
legrady@mat.ucsb.edu

changhe
Posts: 6
Joined: Wed Jan 06, 2016 1:39 pm

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by changhe » Wed Jan 13, 2016 6:59 pm

When “New” Books Are No Longer New

It’s a common situation that library buys new books every several months or each year, so does Seattle Public Library. The library has its own collection code defined. With the call number together with the collection code, it’s easy to know whether the book is a new one bought this year or not. Therefore the question is how does these “new” book attract readers over time. People might guess that those new books won’t be as popular as they arrived after 3 years. But is that the truth? How fast does the frequency pattern drop? Does the category of books affect as well? These are the concepts I try to explore.
Attachments
MAT259_HildaChangHE_2016_NewBooks.pdf
Report of concept and analysis about new arriving books from Hilda
(38.71 KiB) Downloaded 466 times
HildaChangHE_2016hw1_newBooks.csv
Query result about new books from Hilda Chang HE
(1.06 KiB) Downloaded 443 times
Last edited by changhe on Thu Jan 14, 2016 10:05 am, edited 1 time in total.

akibmayadav
Posts: 3
Joined: Tue Jan 12, 2016 11:57 am

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by akibmayadav » Wed Jan 13, 2016 8:30 pm

Is The Sun Out ?
Visiting a library is a recreational activity for some ,academic pursuit for some and research for a few others. But whether we humans take the step ahead and go visit the library for these activities depends on a lot of factors. A major factor that I thought would play a crucial role in this decision at an individual level would be the weather.I wanted to see how the number of checkouts in Seattle Public Library ,which is dependent on the peoples’ presence in library is affected by weather conditions.
ITERATION ONE :

What was done here ?
I started with comparing the the number of checkouts every month from 2006-2010.

What was observed?
There was a variation in the number of books being checked out every month.Two major conclusions were drawn from here.

1. The trend of decline of the number of books checked out in the months of December and January and the hike in the number of books checked out in the months of June-July has prevailed through years. This was because December and January are the months of extreme weather and June-July is when the weather is better.

2. There was a constant increase in the number of books checked out as the years passed.Particularly there was a hike in the number of books checked out during July 2009. July 2009 is historically known as the hottest month Seattle has ever seen.

In the month of July 2009, another aspect ,that I found out about when I searched about the 2009 Seattle heat wave , was that people found Seattle Public library to be the "one of the coolest spot to hang out in" during the Summer of 2009. It is evident from this post : https://weatherspark.com/history/29735/ ... ted-States

SQL QUERY :

Code: Select all

SELECT
 YEAR(checkout) AS Year,
 SUM(CASE
 WHEN Month(checkout)=1 THEN 1
 ELSE 0
 END) AS 'Jan',
 SUM(CASE
 WHEN Month(checkout)=2 THEN 1
 ELSE 0
 END) AS 'Feb',
 SUM(CASE
 WHEN Month(checkout)=3 THEN 1
 ELSE 0
 END) AS 'March',
 SUM(CASE
 WHEN Month(checkout)=4 THEN 1
 ELSE 0
 END) AS 'April',
 SUM(CASE
 WHEN Month(checkout)=5 THEN 1
 ELSE 0
 END) AS 'May',
 SUM(CASE
 WHEN Month(checkout)=6 THEN 1
 ELSE 0
 END) AS 'Jun',
 SUM(CASE
 WHEN Month(checkout)=7 THEN 1
 ELSE 0
 END) AS 'July',
 SUM(CASE
 WHEN Month(checkout)=8 THEN 1
 ELSE 0
 END) AS 'Aug',
 SUM(CASE
 WHEN Month(checkout)=9 THEN 1
 ELSE 0
 END) AS 'Sep',
 SUM(CASE
 WHEN Month(checkout)=10 THEN 1
 ELSE 0
 END) AS 'Oct',
 SUM(CASE
 WHEN Month(checkout)=11 THEN 1
 ELSE 0
 END) AS 'Nov',
 SUM(CASE
 WHEN Month(checkout)=12 THEN 1
 ELSE 0
 END) AS 'Dec'
 
FROM
 spl3._rawXmlDataCheckOuts
WHERE
 itemtype = 'acbk' 
 AND YEAR(checkOut) >= '2006'
 AND YEAR(checkOut) < '2015'
 GROUP BY YEAR(checkOut)
 ORDER BY YEAR(checkOut) 
OUTPUT :
Iteration_1.csv
(856 Bytes) Downloaded 354 times
ITERATION TWO:

What was done here ?
I started with comparing the the number of checkouts every season (Fall,Winter,Spring,Summer) from 2006-2010.

What was observed?
There was a variation in the number of books being checked out every season,but it was not as evident as the check Outs were every month.Two major conclusions were drawn from here as well :

1.In the duration of 9 years (2006-2014) , maximum books have been checked out in Summer and then in Spring. That implies people prefer visiting and coming out of their houses in Summer and Springs more when compared to Winters and Fall.

2.The Fall and Winter the trend has not been very continuos . From 2006-2008 and 2012-2014 people preferred going to the library in the Fall more than Winter. Whereas from 2009-2011people preferred going to the library in Winter more than Fall.

SQL QUERY:

Code: Select all

SELECT
 YEAR(checkout) AS Year,
 -- MONTH(checkout) AS Month,
 SUM(CASE
 WHEN Month(checkout)=3 OR Month(checkout)=4 OR Month(checkout)=5  THEN 1
 ELSE 0
 END) AS 'Spring',
 SUM(CASE
 WHEN Month(checkout)=6 OR Month(checkout)=7 OR Month(checkout)=8  THEN 1
 ELSE 0
 END) AS 'Summer',
 SUM(CASE
 WHEN Month(checkout)=9 OR Month(checkout)=10 OR Month(checkout)=11  THEN 1
 ELSE 0
 END) AS 'Fall',
 SUM(CASE
 WHEN Month(checkout)=12 OR Month(checkout)=1 OR Month(checkout)=2  THEN 1
 ELSE 0
 END) AS 'Winter'
FROM
 spl3._rawXmlDataCheckOuts
WHERE
 itemtype = 'acbk' AND Year(Checkout)=2006
OUTPUT :
Iteration_2.csv
(328 Bytes) Downloaded 359 times
ITERATION THREE :

What was done here ?
Since the comparison is seasonal Checkouts did not reveal a lot , I went on to do a daily analysis of the number of books checked out on a daily basis and compared it to the weather conditions (temperature and rainfall on those days).

How was it done ?
STEP 1
I extracted daily data from http://www.wunderground.com/ of the temperature and precipitation using Beautiful Soup and Python . This technique is known as Data Crawling. The output was taken in a csv format.

DATA CRAWLER CODES :

1.Precipitation

Code: Select all

from requests import session
import os.path
import subprocess
import pickle
import sys
from bs4 import BeautifulSoup, Comment
import HTMLParser
import urllib2
import csv
with session() as c:
    # Looping over the year and months to get the data over the months from 2006 to 2015.
 for yr in range (2006,2016):
    for mon in range (1,13) :
       if ((mon == 1) or (mon == 3) or (mon == 5) or (mon == 7) or (mon == 8) or (mon == 10) or (mon == 12)):
        last=32
       else :
        if ((mon == 4) or (mon == 6) or (mon == 9) or (mon == 7) or (mon == 11)):
         last=31
        else :
         if (mon == 2) and ((yr == 2008) or (yr == 2012)):
           last=30
         else :
           last=31
       for date in range (1,last) :
            datestr = str(yr)+"/"+str(mon)+"/"+str(date)
            datenew = str(yr)+"-"+str(mon)+"-"+str(date)
            weather_url = "http://www.wunderground.com/history/airport/KBFI/"+datestr+"/DailyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1"
            response = c.get(weather_url)
            soup =  BeautifulSoup(response.content, "html.parser")
            #print soup
            table_all = soup.find_all('table', {"id" : "historyTable"})
            #print table_all
            for table in table_all:
              tbodies = table.findAll('tbody')
              for tbody in tbodies:
                tries = tbody.findAll('tr')
        tr = tries[12]
        td=tr.findAll('td')
        print td
        avg_prep=td[1]
        avg_prep=str(avg_prep.text.encode('utf-8').strip())[:-4]
        print avg_prep
        break
            
            with open("weather_prep.csv","a+") as a:
             fieldnames = ['Date','Avg_Prep']
             writer = csv.DictWriter(a, fieldnames=fieldnames)
             writer.writerow({'Date':datenew,'Avg_Prep':avg_prep})
2. Temperature :

Code: Select all

from requests import session
import os.path
import subprocess
import pickle
import sys
from bs4 import BeautifulSoup, Comment
import HTMLParser
import urllib2
import csv
with session() as c:
    # Looping over the year and months to get the data over the months from 2006 to 2015.
 for yr in range (2006,2016):
    for mon in range (1,13) :
       if ((mon == 1) or (mon == 3) or (mon == 5) or (mon == 7) or (mon == 8) or (mon == 10) or (mon == 12)):
        last=32
       else :
        if ((mon == 4) or (mon == 6) or (mon == 9) or (mon == 7) or (mon == 11)):
         last=31
        else :
         if (mon == 2) and ((yr == 2008) or (yr == 2012)):
           last=30
         else :
           last=31
       for date in range (1,last) :
            datestr = str(yr)+"/"+str(mon)+"/"+str(date)
            dateprint=str(yr)+"-"+str(mon)+"-"+str(date)
            weather_url = "http://www.wunderground.com/history/airport/KBFI/"+datestr+"/DailyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1"
            response = c.get(weather_url)
            soup =  BeautifulSoup(response.content, "html.parser")
            #print soup
            table_all = soup.find_all('table', {"id" : "historyTable"})
            #print table_all
            for table in table_all:
              tbodies = table.findAll('tbody')
              for tbody in tbodies:
                     td= tbody.findAll("td")
                     avg_temp=td[3]
                     avg_temp=str(avg_temp.text.encode('utf-8').strip())[:-3]
                     break


            with open("weather_temp.csv","a+") as a:
             fieldnames = ['Date','Avg_Temp']
             writer = csv.DictWriter(a, fieldnames=fieldnames)
             writer.writerow({'Date':dateprint,'Avg_Temp':avg_temp})
OUTPUT:
weather_prep .csv
(50.78 KiB) Downloaded 358 times
weather_temp.csv
(50.01 KiB) Downloaded 362 times
STEP 2
Next using SQL Workbench and the Seattle Public Library ,I took out the data for every day .This included the total number of media, dewey, non-dewey and books checked out on a daily basis by looking into the itemtype.

SQL QUERY:

Code: Select all

SELECT
 date(CheckOut) AS Date, 
 COUNT(Date(CheckOut)) AS TotalCheckouts,
 SUM(CASE
 WHEN deweyClass = ' ' THEN 1
 ELSE 0
 END) AS 'Fiction Items',
 SUM(CASE
 WHEN deweyClass != ' ' THEN 1
 ELSE 0
 END) AS 'Non-Fiction Items',
 SUM(CASE
 WHEN itemtype = 'accas ' 
 OR itemtype = 'arcas ' 
 OR itemtype = 'bccas '
 OR itemtype = 'jccas '
 OR itemtype = 'jrcas '
 OR itemtype = 'accd '
 OR itemtype = 'arcd '
 OR itemtype = 'jccd '
 OR itemtype = 'jrcd '
 OR itemtype = 'accdrom '
 OR itemtype = 'arcdrom '
 OR itemtype = 'bccdrom '
 OR itemtype = 'drcdrom '
 OR itemtype = 'jccdrom '
 OR itemtype = 'acdisk '
 OR itemtype = 'ardisk '
 OR itemtype = 'jrdisk '
 OR itemtype = 'acdvd '
 OR itemtype = 'ardvd'
 OR itemtype = 'bcdvd '
 OR itemtype = 'jcdvd '
 OR itemtype = 'jrdvd '
 OR itemtype = 'xrcdrom '
 OR itemtype = 'ucflpdr'
 OR itemtype = 'acvhs '
 OR itemtype = 'alvhs '
 OR itemtype = 'bcvhs '
 OR itemtype = 'blvhs '
 OR itemtype = 'jcvhs '
 OR itemtype = 'jlvhs '
 OR itemtype = 'jrvhs '
 OR itemtype = 'xrvhs '
 OR itemtype = 'scmed '
 OR itemtype = 'acvid '
 THEN 1
 ELSE 0
 END) AS 'Media',
 SUM(CASE
 WHEN itemtype = 'acvhs '
 OR itemtype = 'alvhs '
 OR itemtype = 'bcvhs '
 OR itemtype = 'blvhs '
 OR itemtype = 'jcvhs '
 OR itemtype = 'jlvhs '
 OR itemtype = 'jrvhs '
 OR itemtype = 'xrvhs '
 OR itemtype = 'scmed '
 OR itemtype = 'acvid '
 THEN 1
 ELSE 0
 END) AS 'Video',
 SUM(CASE
 WHEN itemtype = 'acbk '
 OR itemtype = 'arbk '
 OR itemtype = 'bcbk '
 OR itemtype = 'drbk'
 OR itemtype = 'jcbk '
 OR itemtype = 'jrbk '
 OR itemtype = 'bccd '
 THEN 1
 ELSE 0
 END) AS 'Book'
FROM
 spl3._rawXmlDataCheckOuts
GROUP BY date
ORDER BY date
LIMIT 3700;
OUTPUT
library_data.csv
(130.14 KiB) Downloaded 369 times
STEP 3
Data from the last two steps was used to make a new database called Assign1. I developed SQL Queries using this new database to see the checkouts and the weather conditions on that particular day.

SQL QUERY :

1. Temperature Vs. Checkouts

Code: Select all

SELECT Assign1.library_data.Date as Date ,
 Assign1.library_data.TotalCheckouts as TotalCheckouts ,
 Assign1.weather_temp.Average_Temp as Average_Temp
FROM Assign1.library_data
LEFT JOIN Assign1.weather_temp
ON Assign1.library_data.date=Assign1.weather_temp.date
ORDER BY Assign1.library_data.Date

2.RainFall Vs. Checkouts

Code: Select all

SELECT Assign1.library_data.Date as Date ,
 Assign1.library_data.TotalCheckouts as TotalCheckouts ,
 Assign1.weather_temp.Average_Temp as Average_Temp
FROM Assign1.library_data
LEFT JOIN Assign1.weather_temp
ON Assign1.library_data.date=Assign1.weather_temp.date
ORDER BY Assign1.library_data.Date
OUTPUT
weather_temp_comparingcheckouts.csv
(61.34 KiB) Downloaded 362 times
weather_prep_comparingcheckouts.csv
(62.21 KiB) Downloaded 375 times
What was observed?
Though it was difficult to comprehend this data, as it was very huge, it could be observed that people preferred to stay home during not so pleasant weathers.I was able to properly comprehend this data, only once I implemented the visualization for this particular query.

WHAT DO I WANT IMPLEMENT AHEAD OF THIS ?
1. I want to look into the number of media items being checked out in comparison to books for different weather conditions. A question like : Do people like reading books or do they like watching movies when it rains?
2. I want to know about what do people like reading during a particular weather .Do they like reading thriller fiction during winters , do they like reading books on environment more then others during the summer months?
Last edited by akibmayadav on Wed Jan 27, 2016 10:07 am, edited 3 times in total.

qiu0717
Posts: 9
Joined: Wed Jan 06, 2016 1:44 pm

Re: [Weihao Qiu] Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by qiu0717 » Wed Jan 13, 2016 8:31 pm

How do Barack Obama 'show up' and 'fade away' ?


Weihao Qiu, Jan.13 , 2016
// Concept/Question

Barack Obama, the 44th president of the United States, is going to end his tenure of presidency this year. As the first African American to hold the office, lots of attention and wonders was casted on him.
• What time during his term do people interest in him most?
• What aspect do people mostly care about him?
• How did the main events happened in his term affect affect people’s attention and
approval of his job?
In the light that reading books and looking over materials related to him is definitely a way to know him, I utilize the data queries about the checkout records of those books to explore those questions and wish to get the answers.


//
Attachments
MAT259_WeihaoQiu_Project1_RawCsvFiles.zip
The query results as CSV files.
(44.01 KiB) Downloaded 459 times
MAT259_WeihaoQiu_Project 1_Report.pdf
The report.
(279.51 KiB) Downloaded 478 times

lliu
Posts: 9
Joined: Wed Jan 06, 2016 1:41 pm

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by lliu » Wed Jan 13, 2016 9:58 pm

Winter 2016 MAT 259
Visualizing Information

Media Arts Technology
Lu Liu

First Assignment: Culture Analytics – MySQL

Digging into the database of the Seattle Public Library select a topic that reveals some cultural trends taking place in the past 10 years.

Question
Books and movies remain the most popular cultural represent forms. Will they have mutual impact on each other? For example: When a film adapted from a book released, will the checkout number of this book in library increase?


Assumption

Books and movies offer different experience to the audience. After watching a great film, people may want to get more details about the story. And books also provide more space for people to image. So I expect the release of these films lead to an increase of checkout number of the original book.

Approach
I chose several adapted films respectively released on 2006, 2009, 2012 and 2014. Compare the checkout number of specific book in the normal time and the time after adaptation film released to find the trend over months and years. And one of them is a series film in order to discover if the influence will continue.


Query
Code:

Code: Select all

SELECT 
    YEAR(checkOut) AS CheckoutYear,
    MONTH(checkout) AS CheckoutMonth,
    COUNT(checkOut) AS CheckoutCount
FROM
    spl3._rawXmlDataCheckIns

WHERE
    title LIKE '%Hunger Games%'
	   AND SUBSTRING(itemType, 3, 4) = 'bk'
GROUP BY CheckoutYear , CheckoutMonth
Query Time: 405.019 sec (the second run time will be shorter)

Explanation:
The substring ensures all the items found in MySQL are books. And I set the COUNT(checkOut) to directly get the checkout number of specific book every month.


Results


The Da Vinci Code Book Checkout in 2006
The Da Vinci Code is a mystery-detective novel by Dan Brown written in 2003. The adaptation film was released on May 19, 2006. (The position of red line on horizontal axis)
1.png
The Time Traveler’s Wife Checkout in 2009
The Time Traveler’s Wife is a novel by Audrey Niffenegger published in 2003. The adaptation film was released on 14 August 2009.
2.png
The Hunger Games Checkout in 2012
The Hunger Games is a series of three adventure novels written by Suzanne Collins, which respectively published in 2008, 2009 and 2010. And the corresponding adaptation film was respectively released on March 23, 2012, 2013, 2014 and 2015 (The third one Mockingjay divided into two parts)
3.png
The Fault in Our Stars Checkout in 2014
The Fault in Our Stars is a novel by John Green published in 2012. The adaptation film was released on June 6, 2014.
4.png
Analysis
1. From the four charts above, we can see that after the release of the film, the checkout number of corresponding book more or less has increased. So the hypothesis is basically confirmed. The adaptation films indeed stimulate the desire of people to read the original book.

2. This impact is time limited. It’s obviously that the checkout number of book will decrease after several months of film released. And the chart below is more persuasive on this property. Since the Hunger Games is a series of novel. So I compare the checkout number in different years. Only when the first adaptation film of this trilogy released, the checkout number of the novel has clearly increased. However, Part of the reason is that when the first film released, the book has ended (2010). So even if the second film released on 2013, people already read the whole book after the first film released on 2012.
5.png
3. The chart below reveals the trend from five months before the release time to five months after the release time. Besides the increase and decrease of checkout number, we can also find that the time of maximum checkout is not the release the duration of film showing. And people need some time to respond and decide whether to see the original book.
6.png
4. I tried to search some films adapted from documentary book than novel. But it doesn't have any obvious change. It seems that novels are more attractive to people. And it is also conform the common sense.
Attachments
hunger games.csv
(883 Bytes) Downloaded 436 times
LuLiu HW1.pdf
(119.06 KiB) Downloaded 464 times
Last edited by lliu on Mon Jan 25, 2016 12:20 pm, edited 3 times in total.

thomasahervey
Posts: 4
Joined: Wed Jan 06, 2016 12:50 pm

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by thomasahervey » Wed Jan 13, 2016 10:33 pm

Assessing the Temporal Popularity of New Titles

The clean, rich and large Seattle Public Library dataset that we have access to has a number of valuable temporal attributes and trends to be studied. Instead of looking at particular media and their relation to outside events, I've opted to investigate temporal checkout frequencies and fluctuations based on their age. This can be seen as a checkout history of media at the Seattle Public Library.
Attachments
MAT259_HERVEY_TemporalPopularity_1.csv
(1.19 MiB) Downloaded 440 times
MAT259_HERVEY_TemporalPopularity_1.pdf
(520.6 KiB) Downloaded 537 times

theuniqueeye
Posts: 4
Joined: Wed Jan 06, 2016 12:51 pm

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by theuniqueeye » Thu Jan 14, 2016 12:46 am

How does the dead rock star stay alive?
:: Jing Yan

On January 10, 2016, David Bowie, one of the most influential musicians of his era, whose persistent innovations and personal reinventions transformed him into a larger-than-life rock star and cultural icon, passed away.

When rock stars are dead, they are never gone. They keep being memorized by millions of their fans, music lovers, and the public. Their songs went on again and again. And as culture icons, their life stories are discovered, passed on, and exaggerated year by year.

Question
I am interested in the topic: how the dead rock star, like David Bowie, stays “alive”. If the users in the Seattle Public Library can be considered as a miniature of society, I’d like to examine the aliveness of David Bowie through analyzing the checkout information of his CDs and related books.
From 2005-2015, how does the times verify of CDs being checked out? Will the results have some patterns related to specific important dates such as birthday? What influence would happen after the musician released a new album? Is there anything interesting that the different checkout duration of books and CDs can reveal? Those questions make me curious, and based on them I set up my query.

Approach
In my view, the “aliveness” of David Bowie can be examined through two dimension: the quantity and quality. While the checkout times could be regarded as the quantity index, the checkout duration could be considered as the quality index.
I select the data of David Bowie’s CDs’ and related books’ checkout times of every day and month and year separately, and each items’(with the same title) checkout duration.

Here is the Query code.
all CDs’ bibNumber of David Bowie’s

Code: Select all

SELECT DISTINCT bibNumber, title
FROM spl3._rawXmlDataCheckIns
WHERE
callNumber LIKE '%CD 782.42166 B679%'
AND deweyClass < 789 AND deweyClass >= 780
all related books’ bibNumber of David Bowie’s

Code: Select all

SELECT DISTINCT bibNumber, title
FROM spl3._rawXmlDataCheckIns
WHERE
callnumber LIKE '%782.42166 B679%'
AND SUBSTRING(itemType, 3, 4) = 'bk'
David Bowie’s CDs checkout times by year/month/day

Code: Select all

SELECT 
    DATE_FORMAT(checkout, '20%y-%m-%d') AS formatted_date,
    COUNT(checkout) AS checkoutTimes
FROM
    spl3._rawXmlDataCheckIns
WHERE
    bibNumber = '1952906'
        OR bibNumber = '2148651'
        OR bibNumber = '2023159'
        OR bibNumber = '2362135'
        OR bibNumber = '3029080'
        OR bibNumber = '2023157'
        OR bibNumber = '2299335'
        OR bibNumber = '1954184'
        OR bibNumber = '1736211'
        OR bibNumber = '2627462'
        OR bibNumber = '2128387'
        OR bibNumber = '3061214'
        OR bibNumber = '1959048'
        OR bibNumber = '1950899'
        OR bibNumber = '2444123'
        OR bibNumber = '2515298'
        OR bibNumber = '1976192'
        OR bibNumber = '2086964'
        OR bibNumber = '1953017'
        OR bibNumber = '2867039'
        OR bibNumber = '3057303'
        OR bibNumber = '1639110'
        OR bibNumber = '2363644'
        OR bibNumber = '2567243'
        OR bibNumber = '2211550'
        OR bibNumber = '3067034'
        OR bibNumber = '2631168'
        OR bibNumber = '1954182'
        OR bibNumber = '1880152'
        OR bibNumber = '2412531'
        OR bibNumber = '1976321'
        OR bibNumber = '2296937'
        OR bibNumber = '1972971'
        OR bibNumber = '2448150'
        OR bibNumber = '2612658'
        OR bibNumber = '1953009'
GROUP BY formatted_date
David Bowie’s related books’ checkout times by year/month/day

Code: Select all

SELECT 
    DATE_FORMAT(checkout, '20%y-%m-%d') AS formatted_date,
    COUNT(checkout) AS checkoutTimes
FROM
    spl3._rawXmlDataCheckIns
WHERE
    bibNumber = '1836438'
        OR bibNumber = '1965047'
        OR bibNumber = '2277690'
        OR bibNumber = '1823619'
        OR bibNumber = '2613601'
        OR bibNumber = '2707908'
        OR bibNumber = '2723037'
        OR bibNumber = '2815071'
        OR bibNumber = '2917442'
        OR bibNumber = '2941304'
        OR bibNumber = '3029321'
GROUP BY formatted_date
David Bowie’s CD + related books’ checkout duration

Code: Select all

SELECT 
    bibNumber,
    title,
    AVG(TIMESTAMPDIFF(DAY, checkout, checkin)) AS averageDuration
FROM
    spl3._rawXmlDataCheckIns
WHERE
TIMESTAMPDIFF(DAY, checkout, checkin)<365 ANd
    bibNumber = '1836438'
        OR bibNumber = '1965047'
        OR bibNumber = '2277690'
        OR bibNumber = '1823619'
        OR bibNumber = '2613601'
        OR bibNumber = '2707908'
        OR bibNumber = '2723037'
        OR bibNumber = '2815071'
        OR bibNumber = '2917442'
        OR bibNumber = '2941304'
        OR bibNumber = '3029321'
        OR bibNumber = '1952906'
        OR bibNumber = '2148651'
        OR bibNumber = '2023159'
        OR bibNumber = '2362135'
        OR bibNumber = '3029080'
        OR bibNumber = '2023157'
        OR bibNumber = '2299335'
        OR bibNumber = '1954184'
        OR bibNumber = '1736211'
        OR bibNumber = '2627462'
        OR bibNumber = '2128387'
        OR bibNumber = '3061214'
        OR bibNumber = '1959048'
        OR bibNumber = '1950899'
        OR bibNumber = '2444123'
        OR bibNumber = '2515298'
        OR bibNumber = '1976192'
        OR bibNumber = '2086964'
        OR bibNumber = '1953017'
        OR bibNumber = '2867039'
        OR bibNumber = '3057303'
        OR bibNumber = '1639110'
        OR bibNumber = '2363644'
        OR bibNumber = '2567243'
        OR bibNumber = '2211550'
        OR bibNumber = '3067034'
        OR bibNumber = '2631168'
        OR bibNumber = '1954182'
        OR bibNumber = '1880152'
        OR bibNumber = '2412531'
        OR bibNumber = '1976321'
        OR bibNumber = '2296937'
        OR bibNumber = '1972971'
        OR bibNumber = '2448150'
        OR bibNumber = '2612658'
        OR bibNumber = '1953009'
GROUP BY bibNumber
Analysis + Development
As the data located in a very large scale (every day from 10 years), it’s hard to examine the detail of them through a static chart. An interactive chart that allows you to zoom in will be much better to examine the changing pattern with complexity.
The timeline of all the charts above are unevenly distributed due to the limit of excel chart. Moreover, as the data of CD and Book are largely different in scale, they can be hardly compared through this way. These problems will be solved later in processing.
As two dimensions of David Bowie’s “aliveness”, the data of times and duration should be combined together into one chart. And more detailed analysis should be carried out about the correlation between the pattern and some particular reasons and dates.
Attachments
HowDeadRockStarsStayAlive_xs.pdf
report pdf
(1.12 MiB) Downloaded 497 times
HowDeadRockStarsStayAlive.zip
csv files
(24.93 KiB) Downloaded 387 times
Last edited by theuniqueeye on Sat Jan 23, 2016 12:53 am, edited 1 time in total.

d.aleman.24.da
Posts: 4
Joined: Wed Jan 06, 2016 1:40 pm

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by d.aleman.24.da » Thu Jan 14, 2016 6:54 am

Technology Through the Years
by David Aleman

With the creation of new technology and the demand for engineers for future innovation I decided to look into the popularity of books that involve technology and computer science and other related themes. Using the deweyClass numbers I was able to see the total amount of check outs from 4 different Dewey Numbers and see how they changed in a nine year span (2006-2015).

This is the query below.

Code: Select all

SELECT 
    YEAR(checkOut) AS Year,
    SUM(CASE
        WHEN deweyClass = 600 THEN 1
        ELSE 0
    END) AS 'Technology(600)',
    SUM(CASE
        WHEN deweyClass = 000 THEN 1
        ELSE 0
    END) AS 'Computer Science(000)',
    SUM(CASE
        WHEN deweyClass = 005 THEN 1
        ELSE 0
    END) AS 'Computer Programming(005)',
    SUM(CASE
        WHEN deweyClass = 776 THEN 1
        ELSE 0
    END) AS 'Computer Art(776)'
    
FROM
    spl3._rawXmlDataCheckOuts
    WHERE
    YEAR(checkOut) >= '2006'
        AND YEAR(checkOut) < '2015'
GROUP BY YEAR(checkOut)
ORDER BY YEAR(checkOut)
Attachments
davidAleman_Project1_Data.csv
(303 Bytes) Downloaded 365 times
davidAleman_Project1_Report.pdf
(124.06 KiB) Downloaded 363 times
Last edited by d.aleman.24.da on Tue Jan 26, 2016 10:01 am, edited 2 times in total.

markhirsch8
Posts: 3
Joined: Wed Jan 06, 2016 1:43 pm

God Will Save Seattle From Financial Disaster

Post by markhirsch8 » Thu Jan 14, 2016 8:22 am

As an institution based on the collection and dissemination of knowledge in the service of the public, our access to the data set of the Seattle Public Library provides an insightful window into the behaviors and patterns of the public’s acquisition of knowledge over time. The time span of this collected data, 2005-2015, naturally encapsulates one of the most important events of recent history: the bursting of the American housing market bubble and resulting financial crisis of 2008.

Given this glimpse into a people’s means for seeking & acquiring information, we can begin to focus one lens for studying the public’s reaction to the 2008 financial crisis. By studying relationships of check-out frequency for particular types of library resources in relation to the financial fulcrum (the year 2008), it is possible to make assumptions on the way a particular public (that of Seattle) reacts in the wake of financial crisis.

Query :

Code: Select all

SELECT 
	
    SUM(CASE
        WHEN YEAR(checkout) = '2006' THEN 1
        ELSE 0
    END) AS '2006',
    SUM(CASE
        WHEN YEAR(checkout) = '2007' THEN 1
        ELSE 0
    END) AS '2007',
    SUM(CASE
        WHEN YEAR(checkout) = '2008' THEN 1
        ELSE 0
    END) AS '2008',
    SUM(CASE
        WHEN YEAR(checkout) = '2009' THEN 1
        ELSE 0
    END) AS '2009',
    SUM(CASE
        WHEN YEAR(checkout) = '2010' THEN 1
        ELSE 0
    END) AS '2010',
    SUM(CASE
        WHEN YEAR(checkout) = '2011' THEN 1
        ELSE 0
    END) AS '2011',
    SUM(CASE
        WHEN YEAR(checkout) = '2012' THEN 1
        ELSE 0
    END) AS '2012',
    SUM(CASE
        WHEN YEAR(checkout) = '2013' THEN 1
        ELSE 0
    END) AS '2013',
    SUM(CASE
        WHEN YEAR(checkout) = '2014' THEN 1
        ELSE 0
    END) AS '2014'
FROM
    spl3._rawXmlDataCheckOuts
WHERE
    title LIKE '%finance%' 

UNION

SELECT 
	
    SUM(CASE
        WHEN YEAR(checkout) = '2006' THEN 1
        ELSE 0
    END) AS '2006',
    SUM(CASE
        WHEN YEAR(checkout) = '2007' THEN 1
        ELSE 0
    END) AS '2007',
    SUM(CASE
        WHEN YEAR(checkout) = '2008' THEN 1
        ELSE 0
    END) AS '2008',
    SUM(CASE
        WHEN YEAR(checkout) = '2009' THEN 1
        ELSE 0
    END) AS '2009',
    SUM(CASE
        WHEN YEAR(checkout) = '2010' THEN 1
        ELSE 0
    END) AS '2010',
    SUM(CASE
        WHEN YEAR(checkout) = '2011' THEN 1
        ELSE 0
    END) AS '2011',
    SUM(CASE
        WHEN YEAR(checkout) = '2012' THEN 1
        ELSE 0
    END) AS '2012',
    SUM(CASE
        WHEN YEAR(checkout) = '2013' THEN 1
        ELSE 0
    END) AS '2013',
    SUM(CASE
        WHEN YEAR(checkout) = '2014' THEN 1
        ELSE 0
    END) AS '2014'
FROM
    spl3._rawXmlDataCheckOuts
WHERE
    deweyClass = 220 
    
    UNION

SELECT 
	
    SUM(CASE
        WHEN YEAR(checkout) = '2006' THEN 1
        ELSE 0
    END) AS '2006',
    SUM(CASE
        WHEN YEAR(checkout) = '2007' THEN 1
        ELSE 0
    END) AS '2007',
    SUM(CASE
        WHEN YEAR(checkout) = '2008' THEN 1
        ELSE 0
    END) AS '2008',
    SUM(CASE
        WHEN YEAR(checkout) = '2009' THEN 1
        ELSE 0
    END) AS '2009',
    SUM(CASE
        WHEN YEAR(checkout) = '2010' THEN 1
        ELSE 0
    END) AS '2010',
    SUM(CASE
        WHEN YEAR(checkout) = '2011' THEN 1
        ELSE 0
    END) AS '2011',
    SUM(CASE
        WHEN YEAR(checkout) = '2012' THEN 1
        ELSE 0
    END) AS '2012',
    SUM(CASE
        WHEN YEAR(checkout) = '2013' THEN 1
        ELSE 0
    END) AS '2013',
    SUM(CASE
        WHEN YEAR(checkout) = '2014' THEN 1
        ELSE 0
    END) AS '2014'
FROM
    spl3._rawXmlDataCheckOuts
WHERE
    title like '%depression%' 
—————————

This query selects three categories of library resources to probe the public’s reaction to the financial crisis. The checkout frequencies of the selected resources—on economics, on depression, and bibles—are summed with respect to each year of the data set. Using the ‘UNION’ method, each yearly sum is arranged in one table, offering a direct comparison between each categories relationship to the year in question (2008).

In selecting these particular categories, there was an inherent assumption that they would have a some correlation to the financial crisis of 2008. Indeed, the correlation appears to be obvious: for each, a sharp increase in frequency appeared in the year 2008.

What is more interesting is to look at the amount of increase relative to the two other categories. Did more people reach for the bible after the crisis? Did more people wish to learn more about economics? Did more people encounter depression and seek means for understanding and guidance?

From this query, it would appear that, relatively speaking, there was a greater increase in the need for spiritual guidance in the form of biblical text. The number of bibles checked out in 2008 increase by almost 5 times the amount in 2006 and 2007. While there was also a marked increase in the checkout frequency of materials on economics and depression, this increase was limited to a factor of two.

Processing time: 144.704 seconds

Other analysis notes:
The method of this query is quite inefficient. A 'for loop', iterating a counter which correlates to the year of the data frequency count, would have provided far fewer lines of code and would likely have shaved time off the processing time. However, it appears that due to permission issues within the database, running queries that utilize these methods are inconsistent at best and completely impossible at other times.
Attachments
HIRSCH_financial_crisis_analytics.csv
(169 Bytes) Downloaded 351 times

junxiangyao
Posts: 10
Joined: Wed Jan 06, 2016 1:38 pm

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by junxiangyao » Thu Jan 14, 2016 9:53 am

Project 1: CULTURE ANALYTICS & DATA MINING
Concept
In this assignment, I tried to figure out whether the sports events will impact the book checkout in the Seattle Public Library. I chose Olympics, the largest sport events all over the world, as the first clew to digging the data. I wanted to find out if the data in 2008 and 2012, which are the years Olympic Games were hosted by Beijing and London, are different from other years. Besides, if that were not the case, will other sport events create particular patterns in the chart ?

Process
At the beginning, I chose Olympics as the clew to digging the data. I studied the Dewey Decimal Classification, and the Dewey class of Olympics is 796.48. And the query is showing below.

Code: Select all

SELECT
YEAR(checkout) AS Year,
SUM(CASE WHEN deweyClass >= 796.48 AND deweyClass < 796.49 Then 1 ELSE 0 END) AS 'Olympics'
FROM s pl3._rawXmlDataCheckOuts
WHERE
itemtype = 'acbk'
AND YEAR(checkOut) >= '2006' AND YEAR(checkOut) < '2015' GROUP BY
YEAR(checkOut) ORDER BY
YEAR(checkOut) Query 
Time: 76.006 sec
屏幕快照 2016-02-07 下午10.38.47.png
It is really obvious that in 2008 and 2012, the years of Beijing and London Olympics, the check out times of books within “Olympics” class increased by nearly 3 times. However, even in 2008, the total number of check out times is smaller than 200, which is quite a small number considering the population scale of Seattle. Therefore, I decided to explore further to see if there is any impact on check out data of a specific sports in 2008 and 2012. Thus, I wrote another query.

Code: Select all

SELECT
YEAR(checkout) AS Year,
SUM(CASE WHEN deweyClass >= 796.312 AND deweyClass < 796.313 Then 1 ELSE 0 END) AS 'Handball',
SUM(CASE WHEN deweyClass >= 796.315 AND deweyClass < 796.316 Then 1 ELSE 0 END) AS 'Bowling',
SUM(CASE WHEN deweyClass >= 796.323 AND deweyClass < 796.324 Then 1 ELSE 0 END) AS 'Basketball',
SUM(CASE WHEN deweyClass >= 796.325 AND deweyClass < 796.326 Then 1 ELSE 0 END) AS 'Volleyball',
SUM(CASE WHEN deweyClass >= 796.33 AND deweyClass < 796.34 Then 1 ELSE 0 END) AS 'Football',
SUM(CASE WHEN deweyClass >= 796.342 AND deweyClass < 796.343 Then 1 ELSE 0 END) AS 'Tennis',
SUM(CASE WHEN deweyClass >= 796.345 AND deweyClass < 796.346 Then 1 ELSE 0 END) AS 'Badminton',
SUM(CASE WHEN deweyClass >= 796.346 AND deweyClass < 796.347 Then 1 ELSE 0 END) AS 'Table Tennis',
SUM(CASE WHEN deweyClass >= 796.352 AND deweyClass < 796.353 Then 1 ELSE 0 END) AS 'Golf',
SUM(CASE WHEN deweyClass >= 796.357 AND deweyClass < 796.358 Then 1 ELSE 0 END) AS 'Baseball',
SUM(CASE WHEN deweyClass >= 796.4 AND deweyClass < 796.5 Then 1
ELSE 0 END) AS 'Weight Lifting,Track&Field, Gymnastics',
SUM(CASE WHEN deweyClass >= 796.6 AND deweyClass < 796.7 Then 1 ELSE 0 END) AS 'Cycling',
SUM(CASE WHEN deweyClass >= 796.7 AND deweyClass < 796.8 Then 1 ELSE 0 END) AS 'Motor',
SUM(CASE WHEN deweyClass >= 796.8 AND deweyClass < 796.9 Then 1 ELSE 0 END) AS 'Combat',
SUM(CASE WHEN deweyClass >= 796.9 AND deweyClass < 797 Then 1 ELSE 0 END) AS 'Ice&Snow',
SUM(CASE WHEN deweyClass >= 797.1 AND deweyClass < 797.2 Then 1 ELSE 0 END) AS 'Boating',
SUM(CASE WHEN deweyClass >= 797.2 AND deweyClass < 797.3 Then 1 ELSE 0 END) AS 'Swimming & Diving',
SUM(CASE WHEN deweyClass >= 796.312 AND deweyClass < 798 Then 1 ELSE 0 END) AS 'Sports'
FROM spl3._rawXmlDataCheckOuts
WHERE
itemtype = 'acbk'
AND YEAR(checkOut) >= '2006' AND YEAR(checkOut) < '2015' GROUP BY
YEAR(checkOut) ORDER BY
YEAR(checkOut)

Query Time: 199.735 sec

The charts below are the results of this query.
屏幕快照 2016-02-07 下午10.39.02.png
This chart shows the total check out number of books from all the classes I used in the query. The result is quite different from the first query as the measuring number is really large this time.
屏幕快照 2016-02-07 下午10.39.14.png
Although this query didn’t show the pattern I expected, it shows Seattle Citizens preference of sports to some level. Still, most of the items in this chart is not showing a particular pattern illustrating changes caused by the Olympic Games. Only the pattern of “swimming & diving” and “Weight Lifting, Track & Field, Gymnastics” seems like responding to the Olympics a little bit.
屏幕快照 2016-02-07 下午10.39.31.png
In my perspective, since there are organizations like NFL and NBA, people are able to watch football games and basketball games everyday. They don’t need the Olympics to remind them of these kind of sports. This might be one of the reasons why those lines are not showing any pattern related to the Olympics.
However, for the fact that some lines in the chart are showing dramatic changes still, what could be the factors of that?
My first guess is the record of those sports teams in Seattle.
屏幕快照 2016-02-07 下午10.39.45.png
After referencing the record of Seattle Seahawks in NFL and Seattle Mariners in MLB, the record cannot explain those fold lines precisely. But for the line of Football in the chart, the data in 2014 is quite larger than 2013. I think the Super Bowl championship Seattle Seahawks won on Feb. 2nd, 2014, should be a main reason.
My second guess is the local large scale competition.
屏幕快照 2016-02-07 下午10.39.55.png
From the former chart, according to the total check out number, we can tell that Seattle citizens like riding bikes. Also, there is a bicycle race in Seattle every August. To prove the check out data is impacted by this event, I wrote another query.

Code: Select all

SELECT
MONTH(checkout) AS Month,
SUM(CASE WHEN MONTH(checkout) = 1 Then 1 ELSE 0 END) AS '1',
 SUM(CASE WHEN MONTH(checkout) = 2 Then 1 ELSE 0 END) AS '2', 
SUM(CASE WHEN MONTH(checkout) = 3 Then 1 ELSE 0 END) AS '3', 
SUM(CASE WHEN MONTH(checkout) = 4 Then 1 ELSE 0 END) AS '4', 
SUM(CASE WHEN MONTH(checkout) = 5 Then 1 ELSE 0 END) AS '5',
 SUM(CASE WHEN MONTH(checkout) = 6 Then 1 ELSE 0 END) AS '6', 
SUM(CASE WHEN MONTH(checkout) = 7 Then 1 ELSE 0 END) AS '7', 
SUM(CASE WHEN MONTH(checkout) = 8 Then 1 ELSE 0 END) AS '8', 
SUM(CASE WHEN MONTH(checkout) = 9 Then 1 ELSE 0 END) AS '9', 
SUM(CASE WHEN MONTH(checkout) = 10 Then 1 ELSE 0 END) AS '10', 
SUM(CASE WHEN MONTH(checkout) = 11 Then 1 ELSE 0 END) AS '11', 
SUM(CASE WHEN MONTH(checkout) = 12 Then 1 ELSE 0 END) AS '12'
FROM spl3._rawXmlDataCheckOuts
WHERE
itemtype = 'acbk'
And deweyClass >= 796.6 AND deweyClass < 796.7 AND YEAR(checkOut) >= '2006'
AND YEAR(checkOut) < '2015'
ORDER BY MONTH(checkOut)
Query Time: 40.650 sec

I summed up the data from different years but same month. And the chart below
is the result.
屏幕快照 2016-02-07 下午10.40.08.png
In this chart, the highest check out number is in July. If the condition that the bicycle race is the main reason of data fluctuation, the highest number should
locate on August in the chart. But considering the features of bike-riding, climate might be a reason. People prefer riding bikes in summer to riding in Seattle’s rainy winter.

Supplemental Research
There might be other reasons which can impact the check out number of books from certain sports class. Like “boating”, there might be two reasons. Firstly, the size of check out number in this class is caused by geography reasons. Seattle is a coastal city with perfect bays, which allowed people boating in the sea. In addition, I think this sport is getting popular according to the chart below. Since the equipment boating need is not cheap, economy could be the second reason shaping the pattern of this data.
屏幕快照 2016-02-07 下午10.40.20.png
屏幕快照 2016-02-07 下午10.40.32.png
resource:http://www.statista.com/statistics/1838 ... etro-area/

Conclusion
Although Olympics data shows a particular pattern, it is not quite convincing after comparing with other classes. And other sport events also didn’t show clear connections with the data in the chart. However, there are some other factors that impact the patterns. From my point of view, knowing life in Seattle better will be helpful to find those main factors.
Last edited by junxiangyao on Sun Feb 07, 2016 11:03 pm, edited 8 times in total.

Post Reply