Proj 1 - Data Exploration Concept & MySQL Query

Professor George Legrady
Post Reply
glegrady
Posts: 203
Joined: Wed Sep 22, 2010 12:26 pm

Proj 1 - Data Exploration Concept & MySQL Query

Post by glegrady » Fri Dec 23, 2022 7:33 am

Proj 1 - Data Exploration Concept & MySQL Query

This is the 1st of 3 assignments to be realized in the MAT 259 Data Visualization course: https://www.mat.ucsb.edu/~g.legrady/aca ... 3w259.html
----------------------------------
PROJECT ASSIGNMENT: The assignment is to come up with an interesting MySQL query with results exploring the Seattle Public Library (SPL) database. Click on the red POST REPLY to post your pdf of your assignment.

In the posting provide:
a) A one paragraph description of your SQL data search project which cna be repeated in the attached pdf file
b) Add a PDF that documents the work you did. The PDF should have your name in the title
c) Add the results of your query as a CSV file(s).
----------------------------------
The PDF should include the following:
. A one-paragraph description of your query search
. Concept description
. MySQL Query
. The data and data results
. Discussion/Analysis of results
----------------------------------
WHAT SHOULD THE QUERY BE ABOUT?
. The query can be about a topic of interest to you that might be represented in the library database
. The query can explore the structure of how the database is organized, how things are classified
. The query can reveal anomalies, errors, outliers in either the data or how the data is organized, or how things are classified
. Consider that this is a warm-up exercise to produce data to be used to create a 3D visualization.

The 3D visualization project will need 5 columns of values:
1) A value for horizontal position
2) A value for vertical position
3) A value for depth position
4) A value for scaling or color for each cell in the visualization
5) Possibly a string value for labeling each indivisual cell
----------------------------------
SOME PREVIOUS STUDENT EXAMPLES
. PREDICTION: https://vislab.mat.ucsb.edu/2020/p1/Gua ... rName.html
. ITEMNUMBER: https://vislab.mat.ucsb.edu/2019/p1/Jia ... index.html
. PROBLEMS WITH DATA: https://vislab.mat.ucsb.edu/2019/p1/Ale ... eport.html
. CORRELATION: https://vislab.mat.ucsb.edu/2022/p1/Lij ... Cheng.html
. INTEREST IN A TOPIC: https://vislab.mat.ucsb.edu/2021/p1/Lar ... hetty.html

REVIEW OTHER TOPICS in previous first assignements: https://vislab.mat.ucsb.edu/courses.html
Additional examples from the 2022 Fall M265 class: viewforum.php?f=87
----------------------------------
EVALUATION CRITERIA
. The topic should be relevant or interesting. Describe why.
. The query should result in extensive data saved as a csv file.
. Consider the whole database, going beyond the Dewey numeric classification
----------------------------------
SCHEDULE
January 10, 2023, Tues - Course Overview & Introduction to MySQL
--
January 12, 2023, Thurs - Review & Test MySQL examples
https://www.mat.ucsb.edu/~g.legrady/aca ... wMysql.pdf (Examples for each of the SPL metadata)
https://www.mat.ucsb.edu/~g.legrady/aca ... MySQL2.pdf
https://www.mat.ucsb.edu/~g.legrady/aca ... tudent.pdf
--
January 17, 2023, Tues - Review of the MySQL Assignment
--
January 19, 2023, Thurs - Further examples of previous project documentations and In Lab work production
--
January 24, 2023, Thurs - In Class result presentations, Introduction to Processing
----------------------------------
FURTHER DISCUSSION: Data Exploration & Knowledge Discovery through a large multivariate dataset

The first assignment is to explore a large database consisting of multivariate data with the intent to discover and extract patterns with MySQL that may reveal something of interest. SQL is the standardized language used to access the database: https://www.mysqltutorial.org/what-is-mysql/


We have access to a unique database consisting of checkouts of books, cds, dvd's documented by the hour since January 1, 2006 that represents the aggregated cultural interests of downtown Seattle, but also of the larger national interests. The database currently consists of over 101 million checkouts and returns.
----------------------------------
THE CONCEPT: In any database, there lies hidden knowledge. What does a database contain, and what can MySQL queries reveal? Your first assignment is to find something of interest based on your own interests and skillsets. Here are some options:

1) Patterns, Probability & Prediction
. What is the performance of a topic(s), a title(s), media over time, or by volume, or by trends?
. What can be predicted based on previous performance?
. Can it be predicted how a sequence of data change over time?

2) The Database Organizational Structure
. What anomalies, errors, outliers, illogical classification methods, etc. may be revealed within the organizational and classification of how items are encoded
(All databases have outliers, anomalies, errors in the system as its impossible to precisely classify all things within a structured form)

3) Additional Irregularities in the Data Situation
. 2020 has had an unprecedented impact on the database as the library was closed for over 5 months between March to September. Nonetheless items were circulating.
. Electronic books have been in the collection since 2009 but these are not recorded in the database we receive. Nonetheless they can be reviewed through other means:

4) Data Analytics Query Methods
. Explore statistical methods or algorithms to retrieve or process data
. Are there any machine-learning opportunities in analyzing the data?
----------------------------------
The database consists of multivariate data. For each checkout there exists the following metadata:

Ordinal (In a numeric sequence)
ID: Assigned by the database to keep track of each entry
ItemNumber: Assigned by the library when an object enters the system
Dewey Classification (Dewey numeric) The item's dewey classification if it is recorded as a Dewey (non-fiction) item

Interval Scale (Time-Stamp)
Check-out/check-in in minutes, hour, day, month, year

Categorical (Not necessarily numerically orderable)
BibNumber: Each title has a specific number, copies of titles all have same number. Defined by the LIbrary of Congress
Barcode: Each item has a unique number on RFID sticker
CallNumber: by which to locate items on shelves - Ordinal if Dewey, otherwise categorical. Multiple copies of same item may share same call number but have different barcodes and itemNumber
CollCode: What the item is and where its located: https://data.seattle.gov/Community/Libr ... /6vkj-f5xf

Semantic (Text-based)
Title: Each Item has a title
ItemType: books, cds, dvds, music sheets, etc.
Subjects: Keywords (arbitrary labeling). These are located in a separate database:
----------------------------------
The library uses the Dewey Decimal system by which to organize non-fiction item but the majority of the items in the library collection do not have Dewey classification labels. For instance, music, movies, seem to be distributed in both Dewey and non-Dewey ways of classification. The most popular Dewey tend to be comic books, cookbooks, health, travel books, etc. A daily insight to the Dewey performance can be tracked at: http://128.111.26.109/parsing/index.php
---------------------------------
Specific labeling of your Documents
Please make sure to label your documents like csv files by the name of your project, or your name so we can identify where they come from
George Legrady
legrady@mat.ucsb.edu

lu_yang
Posts: 9
Joined: Mon Sep 26, 2022 10:23 am

Concurrency Based Unidirectional Correlations

Post by lu_yang » Sun Jan 22, 2023 1:59 pm

The Question
I wanted to look at pairs of bibliographical items that were borrowed together and measurement that can be established based on this concurrency. I believe this is interesting as it will review circulation history and correlate items even though they have irrelevant metadata. From an agent-based modelling perspective, I’m more interested in creating unidirectional correlations and measure concurrence frequency based on Dewey class and subjects.

The Approach
In order to find items that were borrowed together, I assumed that items that had the same checkout and checkin timestamp were likely to be checked out by the same person. It should be noted that it is possible that
1) two individuals checked in/out books at the exact same time
2) librarians hold books at a rotation interval, so batches of books checked in/out by different person at different time may have same timestamp
3) processing time by the same librarian can lead to a few minutes difference on the timestamp
4) someone checked out multiple books but did not return them all at the same time

The Query
In order to perform the search, I looked at the cartesian product of the inraw table with itself and selected bibliographical items that had the same checkout and checkin times as pairs. To measure concurrence frequency, I also looked at 3-digit Dewey class, which contains 1) main class; 2) division; 3) section; and subjects, which are retrieved as a concatenated string from the subject table.

Code: Select all

SELECT 
    t1.bibNumber AS A_bib,
    FLOOR(t1.deweyClass) AS A_dewey,
    t3.subject AS subject,
    t2.bibNumber AS B_bib,
    FLOOR(t2.deweyClass) AS B_dewey,
    t4.subject AS subject
	
FROM
    (SELECT 
        bibNumber,
            GROUP_CONCAT(subject
                SEPARATOR ';') AS subject
    FROM
        spl_2016.subject
    WHERE
        spl_2016.subject.subject != ''
    GROUP BY bibNumber) AS t3,
    
    (SELECT 
        bibNumber,
            GROUP_CONCAT(subject
                SEPARATOR ';') AS subject
    FROM
        spl_2016.subject
    WHERE
        spl_2016.subject.subject != ''
    GROUP BY bibNumber) AS t4,
    
    spl_2016.inraw t1
        INNER JOIN
    spl_2016.inraw t2 ON t1.cout = t2.cout AND t1.cin = t2.cin
        AND t1.bibNumber != t2.bibNumber
        AND t1.deweyClass != ''
        AND t2.deweyClass != ''
        AND YEAR(t1.cout) > 2017

WHERE
    t1.bibNumber = t3.bibNumber
        AND t2.bibNumber = t4.bibNumber
Findings
I have queried the entire dataset and retrieved 3,207,172 rows in 307 seconds. Attached is a sample of the data. The data shows correlation between two identical bibliographical items A and B with unique bibNumbers (copies with different format are treated as one bibliographical item). From this spreadsheet, further concurrence frequency can be counted such as A_dewey to B_dewey, A_dewey to B_subject, A_subject to B_subject.
1.png
These three frequencies are important to me because I’m interested in visualizing these correlations in an agent-based swarm simulation. The aggregation of these frequencies will determine the attraction/repulsion force from A to B, not necessarily from B to A, when two random bibliographical items meet.
One thing to notice is that Unicode character in subjects is creating duplicated information(such as below), and I will try to clean up the subject lists.
2.png
Output https://ucsb.box.com/s/s0jwt29kn13dznpnxoqocekidv7sjs88

briannagriffin
Posts: 11
Joined: Fri Sep 23, 2022 10:04 am

Re: Proj 1 - Data Exploration Concept & MySQL Query

Post by briannagriffin » Mon Jan 23, 2023 1:39 pm

For the first project, I will be looking at irregularities in the data set through outliers. Specifically, I will be looking at checkout time in days for the first book in the Twilight series by Stephenie Meyer, namely "Twilight". Published in 2005, the book is based in Forks, Washington around 100 miles away from Seattle. Thus, it has a lot of prevalence within the Seattle Public Library with a lot of years of checkout and checkin data. Specifically, I would like to answer the following questions:

- Are there any observations that do not fit the rest of the data set?
- When and why do these observations occur?
- How do the outliers affect statistics of the sample?

I will address the above questions in my analysis below. This will include SQL codes, corresponding CSV outputs, descriptions, and visualizations.

Here is the pdf containing my project:
Project1_Mat259_BriannaG.pdf
(1.14 MiB) Downloaded 52 times
Below are the CSV files containing the data:
num_of_checkouts.csv
(943.74 KiB) Downloaded 39 times
avg_time_CO_per_year.csv
(286 Bytes) Downloaded 45 times
avg_stdev.csv
(55 Bytes) Downloaded 42 times
num_outliers.csv
(68 Bytes) Downloaded 38 times
outliers_by_year.csv
(111 Bytes) Downloaded 44 times
outliers_08_09_10.csv
(348 Bytes) Downloaded 45 times
3movies_checkout.csv
(953 Bytes) Downloaded 51 times
stats_not_an_outlier.csv
(56 Bytes) Downloaded 49 times
SPL_overall_couts_yearly.csv
(394 Bytes) Downloaded 45 times
SPL_overall_cins_yearly.csv
(395 Bytes) Downloaded 42 times
3D_visualization_prep_5columns.csv
(465.22 KiB) Downloaded 49 times

jhutson
Posts: 3
Joined: Thu Jan 12, 2023 1:08 pm

Re: Proj 1 - Data Exploration Concept & MySQL Query

Post by jhutson » Mon Jan 23, 2023 3:54 pm

For my project, I am curious if the titles of books checked out in 2022 can be mapped to happiness level throughout the year, inspired by the Hedonometer (https://hedonometer.org/timeseries/en_a ... =wordshift). Please see further explanation and analysis in the attached PDF.
Attachments
2022bookcouts.csv
(139.93 MiB) Downloaded 49 times
Hedonometer.csv
(389 KiB) Downloaded 43 times
HappinessPerDay.csv
(4.14 KiB) Downloaded 36 times
JenniHutsonMySQLAssignment.pdf
(293.99 KiB) Downloaded 54 times

hoverbye
Posts: 1
Joined: Thu Jan 12, 2023 1:16 pm

Re: Proj 1 - Data Exploration Concept & MySQL Query

Post by hoverbye » Mon Jan 23, 2023 4:21 pm

Do People Read Sequels… And other questions

For my first project, I wanted to ask two questions

1.) Do people read sequels?
2.) What is the impact for film adaptations on sequels?

To answer the first question, I pulled information from seven different best-selling book series (Dune, Fifty Shades, The Hunger Games, Twilight, Percy Jackson, Lord of the Rings, Harry Potter and the Chronicles of Narnia)

Code: Select all

SELECT
title, itemType, YEAR(cout), COUNT(bibNumber) AS Counts 
FROM
spl_2016.outraw
WHERE title IN ('dune', 'dune messiah', 'Children of Dune', 'God Emperor of Dune', 'Heretics of Dune', 'Chapterhouse', 
'Harry Potter and the sorcerers stone', 'Harry Potter and the Chamber of Secrets', 'Harry Potter and the Prisoner of Azkaban', 'Harry Potter and the Goblet of Fire', 'Harry Potter and the Order of the Phoenix', 'Harry Potter and the Half-Blood Prince', 'Harry Potter and the Deathly Hallows', 
'fellowship of the ring being the first part of The lord of the rings', 'two towers being the second part of The lord of the rings', 'return of the king being the third part of The Lord of the rings', 
'Lightning Thief', 'Sea of Monsters', 'Titan’s Curse', 'Battle of the Labyrinth', 'Last Olympian', 
'C S Lewis The lion the witch and the wardrobe', 'Prince Caspian the return to Narnia', 'Voyage of the Dawn Treader', 'Silver Chair', 'Horse and His Boy','Last Battle', 
'Twilight', 'New Moon', 'Eclipse', 'Breaking Dawn',
'Hunger Games', 'Catching Fire', 'Mockingjay',
'Fifty Shades of Grey', 'Fifty Shades Darker', 'Fifty Shades Freed')

AND (itemType LIKE '%acbk%' OR 'arbk' OR 'bcbk' OR 'bccd')
GROUP BY itemType, title, YEAR(cout)
ORDER BY Counts DESC;
After pulling the books, the Percy Jackson Series was dropped from the analysis due to a lack of data. Next, the data was processed and analyzed in R.

Code: Select all

# Data Vis HW 1 DUNE 
library(ggplot2) 
library(dplyr) 
library(tidyverse) 
library(readr)
library(jtools)

all_books <- read_csv("book_series_data.csv")

#### All Books #### 
all_books <- read_csv("book_series_data.csv")
names(all_books) <- c("title", "itemType", "year", "count")

##### Data Cleaning #### 

all_books$title <- tolower(all_books$title)
unique(all_books$title)

hungergames <- c("hunger games", "catching fire", "mockingjay")
twilight <- c("twilight", "new moon", "eclipse", "breaking dawn") 
fiftyshades <- c("fifty shades of grey", "fifty shades darker", "fifty shades freed") 
dune <- c("dune", "dune messiah", "children of dune","god emperor of dune", "heretics of dune", "chapterhouse")
narnia <- c("c s lewis the lion the witch and the wardrobe","horse and his boy", "prince caspian the return to narnia","voyage of the dawn treader", "silver chair")
LOTR <- c("fellowship of the ring being the first part of the lord of the rings", "two towers being the second part of the lord of the rings", "return of the king being the third part of the lord of the rings")
harrypotter <- c("harry potter and the sorcerers stone","harry potter and the chamber of secrets", "harry potter and the prisoner of azkaban")

first <- c("hunger games", "twilight", "fifty shades of grey", "dune", "c s lewis the lion the witch and the wardrobe", "fellowship of the ring being the first part of the lord of the rings", "harry potter and the sorcerers stone") 
second <- c("catching fire", "new moon", "dune messiah", "horse and his boy", "two towers being the second part of the lord of the rings", "harry potter and the chamber of secrets", "fifty shades darker")
third <- c("mockingjay","eclipse", "children of dunee", "prince caspian the return to narnia", "return of the king being the third part of the lord of the rings", "harry potter and the prisoner of azkaban", "fifty shades freed") 
fourth <- c("breaking dawn", "god emperor of dun", "voyage of the dawn treader")
fifth <- c("heretics of dune","silver chair") 
sixth <- c("chapterhouse")

all_books <- all_books %>% 
  mutate(series = ifelse(title %in% hungergames, "hungergames", 
                         ifelse(title %in% twilight, "twilight",
                                ifelse(title %in% fiftyshades, "fiftyshades",
                                       ifelse(title %in% dune, "dune",
                                              ifelse(title %in% narnia, "narnia",
                                                     ifelse(title %in% LOTR, "LOTR",
                                                            ifelse(title %in% harrypotter, "harrypotter", NA)))))))) %>% 
  mutate(book_order = ifelse(title %in% first, "first", 
                             ifelse(title %in% second, "second",
                                    ifelse(title %in% third, "third",
                                           ifelse(title %in% fourth, "fourth",
                                                  ifelse(title %in% fifth, "fifth",
                                                         ifelse(title %in% sixth, "sixth", NA)))))))

all_books$book_order <- factor(all_books$book_order, ordered = T, levels = c("first", "second", "third", "fourth", "fifth", "sixth"))


all_sequels <- all_books %>% 
  select(series, count, book_order) %>% 
  group_by(series, book_order) %>% 
  summarise(count = sum(count)) %>%
  na.omit()


ggplot(all_sequels, aes(as.numeric(book_order), count)) + 
  geom_line(aes(color = series)) + 
  geom_point(aes(color = series)) +
  ylab("Number of Books Checked Out") + 
  xlab("Book Position in Series") +
  theme_apa()
For plot, please see attached presentation.

My conclusion for my first question is that the first book in a series tends to be read the most and is followed by a steep drop in readership.

RQ2

Next, I pulled just the Dune Data from SQL

Code: Select all

SELECT
title, itemType, YEAR(cout), COUNT(bibNumber) AS Counts 
FROM
spl_2016.outraw
WHERE title IN ('dune', 'dune messiah', 'Children of Dune', 'God Emperor of Dune', 'Heretics of Dune', 'Chapterhouse')
AND (itemType LIKE '%acbk%' OR 'arbk' OR 'bcbk' OR 'bccd')
GROUP BY itemType, title, YEAR(cout)
ORDER BY Counts DESC;
And then processed and analyzed it using r

Code: Select all

dune <- read_csv("dune.csv")
dune$title <- factor(dune$title, levels = c("Dune", "Dune Messiah", "Children of Dune", "God Emperor of Dune", "Heretics of Dune"))

ggplot(dune, aes(title, Counts)) + 
  geom_col(aes(fill = title), show.legend = F) +
  scale_fill_brewer(palette = "Oranges") +
  theme_apa() + 
  xlab("Book Title")

dune_year <- read_csv("dune_year.csv")
names(dune_year) <- c("title", "itemType", "year", "count")
dune_year <- dune_year %>% 
  filter(year > 2005) %>% 
  filter(title == "Dune")

ggplot(dune_year, aes(year, count)) + 
  geom_col(aes(fill = title), show.legend = F) + 
  scale_fill_manual(values = c("#FFCC96")) +
  theme_apa()

sum(dune_year$count)

dune_books <- all_books %>% 
  filter(title %in% c("Dune", 'dune messiah', 'Children of Dune', 'God Emperor of Dune', 'Heretics of Dune', 'Chapterhouse'))

ggplot(dune_books, aes(year, count)) + 
  geom_line(aes(color = title), size = 2) + 
  scale_x_continuous(breaks = unique(dune_books$year)) + 
  ylab("Number of Books Checked Out") + 
  theme_apa()
For plot, see attached presentation.

According to this data, it seems as though the Dune movie did impact dune readership.
Attachments
book_series_dataOverbye.csv
(14.22 KiB) Downloaded 50 times
dune_yearOverbye.csv
(2.8 KiB) Downloaded 47 times
datavis_project1_overbye.pdf
(2.34 MiB) Downloaded 72 times

paulawang
Posts: 1
Joined: Thu Jan 12, 2023 4:15 pm

Re: Proj 1 - Data Exploration Concept & MySQL Query

Post by paulawang » Tue Jan 24, 2023 9:27 am

Project Description
I am a student in the Communication department. In general, the term ‘communication’ is very broadly defined and can be interpreted in many ways. In the social science field, communication research encompasses a variety of areas such as interpersonal communication (e.g., parent-child, couples etc.), mass communication (e.g., social media, news etc.), organizational communication (e.g., workplace, political entities etc.), and computer-mediated communication (e.g., online messaging, video calling etc.). My research aim was to identify which topics of communication are most interesting to the general public.

My analysis was done in two steps:
(1) Identifying how Dewey categories of communication-related books performed over time
(2) Extracting the subtopics of communication that were checked out within the most popular Dewey categories

I found that (in descending order) the general public checked out communication-related books in the Dewey categories of technology, philosophy and psychology, social science, and finally language. Upon further analysis, I identified that titles in the technology category focused on communication technologies in the workplace, in broadcasting contexts, and in parenting. Titles in the philosophy and psychology category focused on conflict communication, defensive communication, and communication more broadly across animals and people. Titles in the social science category focused on political communication, internet communication, and the history of the communication industry. Finally, titles in the language category focused on infant communication, sign language, and speaking clearly and naturally.

Attachments
- Assignment1.pdf (assignment writeup)
- Assignment1_presentation.pdf (presentation slides)
- communication.csv (query 1 output)
- technology.csv (query 2 output)
- psych.csv (query 3 output)
- social.csv (query 4 output)
- language.csv (query 5 output)
Attachments
Assignment1.pdf
(613.39 KiB) Downloaded 43 times
Assignment1_presentation.pdf
(1.01 MiB) Downloaded 53 times
communication.csv
(681 Bytes) Downloaded 40 times
language.csv
(2.61 KiB) Downloaded 44 times
social.csv
(10.47 KiB) Downloaded 45 times
psych.csv
(3.99 KiB) Downloaded 41 times
technology.csv
(16.03 KiB) Downloaded 38 times
Last edited by paulawang on Tue Jan 24, 2023 1:14 pm, edited 1 time in total.

arnavkumar
Posts: 3
Joined: Fri Jan 07, 2022 12:14 pm

Re: Proj 1 - Data Exploration Concept & MySQL Query

Post by arnavkumar » Tue Jan 24, 2023 11:40 am

Arnav Kumar’s Project 1

Initial Concept:

The focus of this analysis is on the distribution of checkouts between electronic and physical books in the field of computer science, which is known for the fact that a wealth of valuable information could be gathered about the field online, especially for free, which is likely due to its groundings in the theme of technology and the Internet in general. Thus, it is of value to attempt to discover whether there is a higher frequency of checkouts for electronic books about computer science as compared to physical books, and even more interesting would be to see the differentials during the COVID-19 pandemic as compared between the two modalities. Due to limitations with the data, as well as the relative lack of popularity of specialized topics in computer science with the general public, this analysis opted to focus on introductory topics in computer science, which is the vast majority of the content of such material in any case.
Attachments
Arnav Kumar MAT259 Project 1 Report.pdf
(276.47 KiB) Downloaded 29 times
ArnavKumarMAT259MySQLResults.csv
(544 Bytes) Downloaded 43 times
ArnavKumarMAT259SocrataAPIResults.csv
(717 Bytes) Downloaded 34 times
Last edited by arnavkumar on Mon Mar 06, 2023 10:56 am, edited 1 time in total.

yanchenlu
Posts: 3
Joined: Thu Jan 12, 2023 4:20 pm

Re: Proj 1 - Data Exploration Concept & MySQL Query

Post by yanchenlu » Tue Jan 24, 2023 12:01 pm

For this project, I want to look at how books regarding LGBTQ topics are represented and classified in the Seattle Public Library, and how has the public's interest in them has changed throughout the years. Particularly, I'm curious about which Dewey Classes do these books fall in. Additionally, since the Seattle Public Library maintained records from 2006 to today, I've also aggregated the numbers of checkouts throughout the decades of books on different groups of queer subjects to see what have gained more interests from readers and what has fallen out of popularity.

Link to write-up webpage version: https://cold-chocolate.notion.site/Proj ... 36ea547a7b
Attachments
graphs.zip
Contains 4 graphs used in the project report
(484.04 KiB) Downloaded 35 times
code.zip
Contains 3 files:
- proj1_queries.sql: all MySQL queries I've conducted for this project
- viz_dewey.py: code to generate the Dewey classification distribution bar charts, using matplotlib and pandas
- viz_stacked_trend.py: code to generate the checkout trends as stacked area charts, using matplotlib and pandas
(2.73 KiB) Downloaded 44 times
data.zip
Contains 3 sub-folders: joined_datasets, dewey, and trend.
Each sub-folder contains datasets generated from corresponding queries, used for data extraction and visualization.
- joined_datasets: joining outraw and subjects databases
- dewey: extracted books with their Dewey classification
- trend: checkout records of subgroups of books from 2006 to 2023
(973.75 KiB) Downloaded 43 times
Project_1_Write-up.pdf
Project report, pdf version
(747.55 KiB) Downloaded 41 times
Last edited by yanchenlu on Wed Jan 25, 2023 11:22 pm, edited 6 times in total.

zeyuwang
Posts: 3
Joined: Thu Jan 12, 2023 4:26 pm

Re: Proj 1 - Data Exploration Concept & MySQL Query

Post by zeyuwang » Tue Jan 24, 2023 12:20 pm

This project aims to explore the popularity of Chinese-related media in the Seattle Public Library by analyzing checkout numbers for various item types, such as books, CDs, and music, from 2006 to 2022. Using SQL to query data, this study provides a detailed examination of trends and category percentages over time, as well as an analysis of the Dewey Decimal System classification of Chinese culture. Through data visualization, this project aims to gain a comprehensive understanding of the representation and popularity of Chinese-related media in the Seattle Public Library, which can provide valuable insights for cultural studies, library management, and Seattle Public Library users. The project will compare Dewey Classification Checkout Numbers for Chinese-related books in 2006 and 2022, which can give an idea about how Chinese culture is represented and how it has evolved in the Seattle Public Library.
Attachments
mat259_project1_zeyu.zip
(912.29 KiB) Downloaded 43 times

qinghuang
Posts: 5
Joined: Thu Jan 12, 2023 1:09 pm

Re: Proj 1 - Data Exploration Concept & MySQL Query

Post by qinghuang » Tue Jan 24, 2023 2:02 pm

For this assignment, I want to examine whether the development history of social media and artificial intelligence are aligned with check-out book records in the Seattle library. Both social media and artificial intelligence are emerging technologies that have become known to the public and led to some heated discussions.

By comparing these two concept-related books' check-out records together and aligning them with the significant events in both fields, this analysis can provide us some insights into how the public gets to learn about new technologies through borrowing books from the library.

Code: Select all

SELECT
YEAR(cout) AS years,
COUNT(cout) AS counts
FROM
spl_2016.outraw
WHERE
((LOWER(title) LIKE '%social media%')
        OR '%social networking site%'
        OR '%social networking sites%')

AND YEAR(cout) <= '2022'
GROUP BY years
ORDER BY years ASC;

SELECT
YEAR(cout) AS years,
COUNT(cout) AS counts
FROM
spl_2016.outraw
WHERE
(( LOWER(title) LIKE '%artificial intelligence%')
OR '%ai%')
AND YEAR(cout) <= '2022'
GROUP BY years
ORDER BY years ASC;

SELECT
YEAR(cout) AS years,
MONTH(cout) AS months,
COUNT(cout) AS counts
FROM
spl_2016.outraw
WHERE
((LOWER(title) LIKE '%social media%')
        OR '%social networking site%'
        OR '%social networking sites%')
AND YEAR(cout) <= '2022'
GROUP BY years, months;

SELECT
YEAR(cout) AS years,
MONTH(cout) AS months,
COUNT(cout) AS counts
FROM
spl_2016.outraw
WHERE
(( LOWER(title) LIKE '%artificial intelligence%')
OR '%ai%')
AND YEAR(cout) <= '2022'
GROUP BY years, months
Attachments
social-media-year.csv
(158 Bytes) Downloaded 42 times
ai-year.csv
(166 Bytes) Downloaded 44 times
social-media-month.csv
(1.79 KiB) Downloaded 31 times
ai-month.csv
(1.99 KiB) Downloaded 41 times
Presentation-Qing.pdf
(279.35 KiB) Downloaded 46 times
MAT259-Assignment1-Qing Huang.pdf
(363.74 KiB) Downloaded 47 times

Post Reply