PROJ 0: CULTURE ANALYTICS - MYSQL

xindi
Posts: 8
Joined: Wed Jan 06, 2016 1:39 pm

The Trend in Popularity of World Language Learning

Post by xindi » Thu Jan 14, 2016 10:11 am

M259 Project 1 Xindi (Cindy) Kang
Project Report
------ The Trend in Popularity of World Language Learning
. Question
- What languages are the most popular languages among language learners who borrow books from Seattle Public Library? (Referred to as SPL in this report)
- What does the trend in the popularity of major languages like English, Chinese, Spanish, and Japanese look like in the past decade?
- What about minority languages and Hawaiian, Icelandic and Swedish?
. The Query
Majority languages:
The following is the query for the number of checkouts for Japanese language related books at SPL (in the language deweyClass 490-499) from 2006 to 2016: -----------------------------------------------------------------------------------
SELECT
FLOOR(deweyClass) AS Dewey,
SUM(CASE
WHEN YEAR(checkOut) = 2006 THEN 1
ELSE 0
END) AS '2006',
SUM(CASE
WHEN YEAR(checkOut) = 2007 THEN 1
ELSE 0
END) AS '2007',
SUM(CASE
WHEN YEAR(checkOut) = 2008 THEN 1
ELSE 0
END) AS '2008',
SUM(CASE
WHEN YEAR(checkOut) = 2009 THEN 1
ELSE 0
END) AS '2009',
SUM(CASE
WHEN YEAR(checkOut) = 2010 THEN 1
ELSE 0
END) AS '2010',
SUM(CASE

WHEN YEAR(checkOut) = 2011 THEN 1
ELSE 0
END) AS '2011',
SUM(CASE
WHEN YEAR(checkOut) = 2012 THEN 1
ELSE 0
END) AS '2012',
SUM(CASE
WHEN YEAR(checkOut) = 2013 THEN 1
ELSE 0
END) AS '2013',
SUM(CASE
WHEN YEAR(checkOut) = 2014 THEN 1
ELSE 0
END) AS '2014',
SUM(CASE
WHEN YEAR(checkOut) = 2014 THEN 1
ELSE 0
END) AS '2015',
SUM(CASE
WHEN YEAR(checkOut) = 2014 THEN 1
ELSE 0
END) AS '2016'
FROM
spl3._rawXmlDataCheckOuts
WHERE
deweyClass >= 400 AND deweyClass < 500
AND title LIKE '%japan%'
GROUP BY FLOOR(deweyClass)
ORDER BY SUM(CASE
WHEN
(YEAR(checkOut) >= 2006
OR YEAR(checkOut) <= 2016)
THEN 1
ELSE 0
END) DESC
----------------------------------------------------------------------------------------
(The queries for other languages are the same except for the title keyword)
Results and processing times for Japanese and other major languages:
Japanese: 39.182s
06 07 08 09 10 111213141516 Chinese:
39.612s
06 07 08 09 10 111213141516 English:
39.768s

06 07 08 09 10 111213141516 Spanish:
39.593
06 07 08 09 10 111213141516
Popularity Ranking:
1. English 2. Spanish 3. Chinese 4. Japanese
Minority Languages: Hawaiian:
39.421s

06 Icelandic:
39.693s
07 08 09 10 11 12 13 14 15 16
06 Swedish:
39.761
07 08 09 10 11 12 13 14 15 16
06 07 08 09 10 1112 1314 1516
Popularity Ranking:
1. Swedish 2. Hawaiian 3. Icelandic
Trending:
The popularity of the learning of every majority language researched in this project reached their peak during 2009 and 2010 and then fell back to normal level since then.
The popularity of the learning of the minority languages researched in this project each has different performances over the past decade.
The popularity for Hawaiian learning showed a rather unstable performance whereas Swedish fell back to a descending trend after a peak in 2009.
Icelandic showed a significant increase in 2011 and has remained at that level since then.
Attachments
Swedish.csv
(161 Bytes) Downloaded 281 times
Spanish.csv
(613 Bytes) Downloaded 252 times
Japan.csv
(182 Bytes) Downloaded 258 times
Icelandic.csv
(92 Bytes) Downloaded 276 times
Hawaiian.csv
(157 Bytes) Downloaded 279 times
English.csv
(1.65 KiB) Downloaded 171 times
Chinese.csv
(266 Bytes) Downloaded 266 times
Last edited by xindi on Wed Jan 20, 2016 8:50 pm, edited 1 time in total.

joo
Posts: 4
Joined: Thu Jan 07, 2016 10:19 am

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by joo » Thu Jan 14, 2016 9:31 pm

Checkout rates in travel books, from month to month, year to year
It comes from the idea that specific seasons can affect the direction of travel. Some people might want to travel to warm places in winter, and vice versa. Others prefer just well-known places to visit regardless of weather. By using the data sets in the Seattle Public Library, I am going to see the number of checkouts in particular range of Dewey Classification (from 914 to 919) during 5 years (2010 to 2015). Based on checkout counts, I may assume that which area attracts people’s attention the most in a particular month and year.
Attachments
Woohun_assignment1.pdf
(197.09 KiB) Downloaded 182 times
Woohun_assignment1.csv
(5.47 KiB) Downloaded 246 times

mayelsherif
Posts: 4
Joined: Wed Jan 06, 2016 1:42 pm

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by mayelsherif » Thu Jan 14, 2016 11:58 pm

MAT 259 – Winter 2016 – Project 1
May ElSherif


Do people in Seattle read generic or specialized books? Can labeling be used to increase the checkout rate for items?

Initial question: For non-fictional items, is there a correlation between the number of subject entries (labels/tags) that describe an item in the Seattle Public Library and the number of checkouts? Does it differ across different Dewey classes?

Example:
As shown in Fig.1, the item with bibNumber = 3 has only one entry that describes it while the item with bibNumber = 7 has more than one entry. Could this discrepancy lead to a difference in the number of checkouts as well?
subjectTable.png
Figure 1. Different items with a different number of entries in the subject table.

Labeling and tagging items with keywords are important aspects in data organization. It is used to create search indexes that help users, especially on the web retrieve the information they need.

Methodology:
To tackle the aforementioned question, I began by investigating the deweyClass table, the x_checkOutCountBib, and the subject table. To acquire the data in the needed form to answer the question, a triple join was needed between the previously mentioned tables.

Code: Select all

SELECT 
    deweyClass AS Dewey,
    spl3.deweyClass.bibNumber,
    checkOutCount,
    COUNT(spl3.subject.bibNumber) AS SubjectEntriesCount
FROM
    spl3.x_checkOutCountBib,
    spl3.deweyClass,
    spl3.subject
WHERE
    deweyClass > 0
        AND spl3.x_checkOutCountBib.bibNumber = spl3.deweyClass.bibNumber
        AND spl3.x_checkOutCountBib.bibNumber = spl3.subject.bibNumber
        AND (spl3.subject.subject != ‘’
        OR spl3.subject.subject IS NOT NULL)
GROUP BY bibNumber
ORDER BY deweyClass;
Query Explanation:
The query captures for every bibNumber, the number of subject entries that describe this item and its deweyClass through a triple join. The deweyClass > 0 eliminates fictional items. The conditions spl3.x_checkOutCountBib.bibNumber = spl3.deweyClass.bibNumber
AND spl3.x_checkOutCountBib.bibNumber = spl3.subject.bibNumber
join the three tables. The important condition is that to check for empty string subjects or null subject which is captured by the last condition spl3.subject.subject != ‘’ OR spl3.subject.subject IS NOT NULL. This last condition provoked me to do a separate analysis for items with bibNumbers that has no subject entries.

Results:
Figure 2 depicts a sample of the output CSV file which shows for each bibNumber: the checkOutCount, the number of subject entries and its Dewey classification. To better understand these numbers, a scatter plot that incorporates the number of subject entries on the x-axis, the number of checkouts on the y-axis and the Dewey classification grouped into 10 classes is shown in Figure 3.

Processing time:
For the previously described query, the result contained 476, 936 rows, the duration was 54.448 seconds and the fetch time was 6.249 seconds.


outputTable.png
Figure 2. A sample of the output CSV file.
labesVsCheckOuts.png
Figure 3. Number of checkouts vs. number of subject entries.

Results explanation and further analysis:
As we can see from Figure 3, there is a trend of a decreasing number of checkouts when the number of labels increases. Although many aspects affect the number of checkouts, this result can be interpreted in the sense that people in the Seattle public library tend to check out very specific books. Since a generic book incorporates multiple topics, the probability of this book having more than one subject entry increases. A specialty book (i.e. a book that focuses on fewer points) has the probability of having fewer labels. The above result suggests that people tend to check specialty books more than generic books. On the other hand, the question still remains: “What about unlabeled items?” (i.e. items without subject entries).
I was interested to know how many entries in the subject table contain an empty subject (select * from spl3.subject where subject = ' ' ) and the number of rows was 134, 966 unlabeled items. From these entries, most of them belonged to Dewey Classes from 800 to 900 (as shown in Figure. 4) which suggests that the majority of unlabeled items are Literature items.
UntaggedDewey.png
Figure 4. Histogram of Dewey class for untagged items.
Another question that seems compelling was “Do unlabeled items tend to have lower or higher checkout rate?”
To answer this question, I constructed another query to check for entries with empty string subject entries in the subject table as follows.

Code: Select all

SELECT 
    spl3.deweyClass.bibNumber,
    deweyClass AS Dewey,
    checkOutCount
FROM
    spl3.x_checkOutCountBib,
    spl3.deweyClass,
    spl3.subject
WHERE
    deweyClass > 0
        AND spl3.x_checkOutCountBib.bibNumber = spl3.deweyClass.bibNumber
        AND spl3.x_checkOutCountBib.bibNumber = spl3.subject.bibNumber
        AND spl3.subject.subject = ''
ORDER BY deweyClass;
This query returned 25, 933 rows which suggests that there are 109, 027 entries with zero check out rate. The checkout rate of the rest of the entries is depicted in Figure 5.
If we compare Figure 3 and Figure 5, we can see that the rate of checkout for labeled items is on average higher than the rate of checkout for unlabeled items. This indicates that labeling an item could result in higher checkout rates for that item.
UnlabeledCheckOuts.png
Figure 5. Checkout rate for unlabeled items.
Attachments
checkOutAndSubjectEntries.csv.zip
(2.39 MiB) Downloaded 176 times
MayElSherif - MAT 259 – Winter 2016 – Project 1.pdf
(266.21 KiB) Downloaded 187 times
null-Info.csv
(348.77 KiB) Downloaded 176 times
Last edited by mayelsherif on Thu Jan 21, 2016 12:06 pm, edited 2 times in total.

hkung
Posts: 4
Joined: Wed Jan 06, 2016 12:47 pm

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by hkung » Fri Jan 15, 2016 12:30 am

I would like to investigate possible relation of the number of checkout travel books and months among a year. There are peak season, shoulder season, and off-peak season to travel during a year. Before running the query, I predict that the number would increase prior to peak season while drops afterwards.

***UPDATE***
Following up to my last queries, I found that there were falls in the numbers of travel books checked out during 2010 and 2012. Therefore I wondered if we can also observe similar phenomena among other book categories.
Attachments
UPDATE MAT 259 Data Query.xlsx
(55.2 KiB) Downloaded 175 times
[UPDATE MAT 259] Data Query.pdf
(362.24 KiB) Downloaded 241 times
mat259_kung_hw1_data.zip
(93.25 KiB) Downloaded 168 times
mat259_kung_hw1.pdf
(231.31 KiB) Downloaded 284 times
Last edited by hkung on Mon Jan 25, 2016 11:31 am, edited 2 times in total.

woodfin
Posts: 1
Joined: Wed Jan 06, 2016 12:46 pm

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by woodfin » Sat Jan 16, 2016 2:00 am

Math is very widely despised subject. Even people who work in scientific fields that require the use of math often cringe at the thought of having to evaluate a nasty looking integral. What type of masochist would subject themselves to reading a book on mathematics when there are so many other topics in the library?
I plan to find out.
Let’s study the behavior of people who checkout books on mathematics and see what else they checkout. Are they shut-ins that only focus on pure math? Are they more open and interest themselves in a wide range of subjects? Let’s explore.
Attachments
MAT259_JamesonWoodfin_Projet1.pdf
(202.98 KiB) Downloaded 258 times
NormalDeweyClassFrequencies.csv
(162 Bytes) Downloaded 170 times
MathDeweyClassFrequencies.csv
(152 Bytes) Downloaded 269 times

jordanhughes
Posts: 4
Joined: Wed Jan 06, 2016 12:48 pm

Re: PROJ 1: CULTURE ANALYTICS - MYSQL

Post by jordanhughes » Sat Jan 16, 2016 8:16 pm

I chose to explore the relationship between "banned" books and their popularity before, during, and after being ”banned”. The American Library Association puts out a list every year of the top ten most frequently challenged books and their reasons for being challenged. My goal was to glean any kind of connection between the popularity of a particular title and it’s addition or placement on the list. The yearly list of the top most frequently challenged books can be found at http://www.ala.org/bbooks/frequentlycha ... ooks/top10. My query involved each title on that list between the years 2006-2014.
Attachments
Mat259Hw1.pdf
(1.35 MiB) Downloaded 165 times
Banned Books.xlsx
(20.32 KiB) Downloaded 135 times
challenged_list_final.csv
(4.55 MiB) Downloaded 148 times
challenged_list_final.xlsx
(2.02 MiB) Downloaded 133 times

Post Reply