Proj 1 - Data Mining, Knowledge Discovery

glegrady
Posts: 203
Joined: Wed Sep 22, 2010 12:26 pm

Proj 1 - Data Mining, Knowledge Discovery

Post by glegrady » Mon Dec 30, 2019 5:53 pm

Proj 1 - Data Mining, Knowledge Discovery

Tues - January 07, 2019 - Introduction to MySQL
Thur - January 09, 2019 - Review MySQL Assignment (this)
Tues - January 14, 2019 - MySQL discussion and examples
Thur - January 16, 2019 - Student presentation

DATA MINING / KNOWLEDGE DISCOVERY:
The assignment is to explore the Seattle Public Library database for which we have hourly checkouts (and returns) of books, cds, dvds, and other items from the Seattle Public Library (SPL) since January 1, 2006. Our approach is "Knowledge Discovery in Databases" or KDD or "Data mining". The goal is the discovery and extraction of patterns of interest to gain knowledge. On a broad level, the database provides a historical overview of cultural interests of the Seattle downtown community but it can also be seen as a sampling of North American cultural interests over a fifteen year period. For those interested in the structure of the database, there is much to explore in discovering how the data is organized, or anomalies in the system. Please review the discussion at: https://www.mat.ucsb.edu/~g.legrady/aca ... _notes.pdf

There are approx. 97 million datasets in the database representing 15 year's of activity. The data consists of multivariate data and is
classified in two main categories. The library uses the Dewey Decimal system by which to organize non-fiction items: https://en.wikipedia.org/wiki/List_of_D ... al_classes but over 50% of the items in the library collection are fiction, and therefore are coded as non-Dewey items which do not have numeric systematic classification. For instance, music, movies, seem to be distributed in both Dewey and non-Dewey ways of classification. The most popular Dewey items can be viewed hourly at: http://128.111.26.109/parsing/index.php ... &d=05&h=17 The top most popular items are comic books, cookbooks, health, travel books, etc.

--------------------------------------------------------------

MYSQL ASSIGNMENT - THE CONCEPT: In any database, there lies hidden knowledge. What does a database contain, and what can MySQL queries reveal? Your first assignment is to find something of interest based on your own cultural / knowledge interests. Here are some options:

1) Topics of Cultural Content:
. What is the performance of a topic, a title, - a media over time, or by volume, or by trends, or by news events?
. How does the collection change over time?

2) The Database Organizational Structure
. What is the classification approach that the library uses
. What anomalies, errors, outliers, illogical classification methods, etc. are hidden and can be revealed.
(All databases have these as its impossible to precisely classify all things within a structured system)

3) Data Analytics Query Methods
. Explore statistical methods or algorithms to retrieve or process data

--------------------------------------------------------------

GETTING THE DATA: Use the MySQL Workbench to write a query by which to retrieve the data from the SPL database. Here are the guidelines of how to run the query: https://www.mat.ucsb.edu/~g.legrady/aca ... LayOut.pdf

Use the spl_2016 database which gets updated daily. Become familiar with the metadata that describes each checkout: https://www.mat.ucsb.edu/~g.legrady/aca ... ataDef.pdf

Decide which metadata you need/want - these will become columns in the csv file. "Select *" from the inraw or outraw tables will get you 11 columns. They are:

Categorical:
bibNumber: Defined by Library of Congress - multiple copies of the same title have the same bibnumber
collcode: Collection type, name and physical branch location
itemtype: What media (book, cd, dvd, etc.)
callNumber: Multiple copies of same item will share same call number but have different barcodes and itemNumber

Date:
cout: checkout date/timestamp
cin: checkin date/timestamp

Numeric / Time Series:
ID: The numeric location in the database
itemNumber: The items' acquisition date
DeweyClass: "" (null) if not Dewey, otherwise https://en.wikipedia.org/wiki/Dewey_Dec ... sification

Text:
title: title of items
subj: Keywords

In the example below, the query will return 4 vertical columns consisting of all the different bibnumbers (17 have been found); their itemtype (book, dvd, etc.); item's titles; and the total count for each bibnumber sequenced from most to least.

SELECT
bibNumber, itemType, title, COUNT(bibNumber) AS Counts
FROM
spl_2016.inraw
WHERE
title LIKE '%Blade Runner%'
GROUP BY bibNumber , itemType , title
ORDER BY Counts DESC

--------------------------------------------------------------

DO THE ASSIGNMENT:
Do a MySQL query of the Seattle database "spl_2016" with the approach of "knowledge discovery". Provide a question of interest and describe the steps in your exploration. Provide MySQL queries and results. You can finalize with an analysis commentary.

The assignment does not require visualization but consider that for the 3D and your final project you will need to have 4 columns of numeric values, for instance, time-based, or ordered data (itemNumber, dewey, volume, etc.) These will be used for horizontal, vertical, depth positions, and the 4th to define color density of the positioned item.

Once you have all the material - click on "POST REPLY" to this link and add your info to complete the assignment as a pdf.

--------------------------------------------------------------

Case Study of a Topic: The Sci-fi Movie Blade Runner

Rodger Liu assembled a case study a few years ago that can be used as an example: https://www.mat.ucsb.edu/~g.legrady/aca ... Report.pdf

--------------------------------------------------------------

View previous student examples in the student forum: viewtopic.php?f=75&t=303#p2015 (2018) and viewtopic.php?f=77&t=313 (2019)

A few interesting ones from last year:
Chantal Nguyen was interested in the performance of cook books: viewtopic.php?f=77&t=313#p2127
Sandy Schoettler was interested in a statistical analysis: viewtopic.php?f=77&t=313#p2128

--------------------------------------------------------------
George Legrady
legrady@mat.ucsb.edu

evgenynoi
Posts: 3
Joined: Wed Jan 08, 2020 10:54 am

Re: Proj 1 - Data Mining, Knowledge Discovery

Post by evgenynoi » Tue Jan 14, 2020 11:38 pm

For my project with Seattle Public Library (SPL) I decided to visualize a journey of a book in space and time. Unfortunately, the provided dataset does not have any information on the readers or their places of residence, so I decided to employ a Gravity model prevalent in geography and spatial analysis as well as other probabilistic methods to assess the service areas of SPL branches and hypothesize about the trajectory of the book given random or pseudorandom processes.
Attachments
asgn1_noi.pdf
(1.62 MiB) Downloaded 116 times
Last edited by evgenynoi on Thu Jan 16, 2020 12:26 pm, edited 2 times in total.

lisahan
Posts: 1
Joined: Wed Jan 08, 2020 10:59 am

Re: Proj 1 - Mediating the Oceans

Post by lisahan » Wed Jan 15, 2020 10:13 am

One of the core lessons in media studies is Marshall McLuhan’s oft-quoted claim that “the medium is the message.” That is to say, the forms of media and processes of mediation that we engage with are as important as the content that we look at. For this assignment, I am interested in exploring the types of media that are most frequently associated with the topic of our oceans. On a conceptual level, I am investigating how public imaginaries of the ocean are structured through media. On a more practical level, I hope to answer the following by engaging with the Seattle Public Library dataset:

1. What forms of media (item types) do people check out from the Seattle Public Library most in relation to the topic of the oceans/sea?
2. How does that blend of media types change over time?
3. Relatedly, how does engagement with the most popular media texts related to oceans change over a decade (2009-2019)?

Assignment is attached below.
Attachments
Han_Assignment 1.pdf
(622.34 KiB) Downloaded 135 times

erinpwoo
Posts: 3
Joined: Wed Jan 08, 2020 11:02 am

Re: Proj 1 - Data Mining, Knowledge Discovery

Post by erinpwoo » Wed Jan 15, 2020 6:32 pm

I thought it would be very interesting to see how Michael Jackson's reputation and popularity has evolved since the start of his career—or at least since the start of the SPL database in 2006. My question of interest asks how various controversies and the death of Michael Jackson influenced the popularity of his music and the type of media published about him over time. There is without a doubt that Michael Jackson’s music has had an extremely positive influence over the music industry as a whole—although, do people separate the art from the artist in times of controversy?
Attachments
Assignment 1 Writeup.pdf
(1.07 MiB) Downloaded 134 times

jingxuan
Posts: 3
Joined: Wed Jan 08, 2020 11:00 am

Re: Proj 1 - Data Mining, Knowledge Discovery

Post by jingxuan » Wed Jan 15, 2020 8:46 pm

I am interested in doing some research with the trending jobs in the last few years.
Attachments
Jingxuan_assignment1.pdf
(147.18 KiB) Downloaded 120 times

dongyumeng
Posts: 1
Joined: Wed Jan 08, 2020 11:01 am

Re: Proj 1 - Data Mining, Knowledge Discovery

Post by dongyumeng » Wed Jan 15, 2020 10:03 pm

In this assignment I tried to find out if there is a correlation between the theme of books people borrow and the time in the day these books get borrowed. Like, is it true that people tend to borrow technical books in the morning and novels at night? Inquiries like this give insight to the activity patterns of different readers or even professions.
Attachments
dongyu_meng_mat259_hw1.pdf
(458.49 KiB) Downloaded 132 times

chuanxiuyue
Posts: 3
Joined: Wed Jan 08, 2020 10:53 am

Re: Proj 1 - Data Mining, Knowledge Discovery

Post by chuanxiuyue » Wed Jan 15, 2020 11:53 pm

In this assignment, I investigated how HBO TV series "Game of Thrones" influenced people's interest in the fantasy novels "A Song of Ice and Fire" which the TV series were adapted from.
Attachments
MAT259-Assignment1-HE-20Winter.pdf
(7.93 MiB) Downloaded 121 times

yuleiyuan
Posts: 3
Joined: Wed Jan 08, 2020 10:50 am

Re: Proj 1 - Data Mining, Knowledge Discovery

Post by yuleiyuan » Thu Jan 16, 2020 5:55 am

My topic of interest is what are the popular things about Astronomy reflected on the related item types in the library. I will first explore what’s popular in different item types. The next step is to see the top-viewed works within a specific “sub-Dewey” Class (525), and the last exploration is the popularity of a classical book on cosmology over time -- A Brief History of Time.
Attachments
Universe_HW1_MAT259.pdf
(16.66 MiB) Downloaded 115 times

boningdong
Posts: 3
Joined: Thu Jan 09, 2020 4:46 pm

Re: Proj 1 - Data Mining, Knowledge Discovery

Post by boningdong » Thu Jan 16, 2020 10:18 am

I am a game lover, so in this assignment, I want to investigate the popularity of the game-development topics. Also, I want to figure out where to start as a beginner in the game development field. So, I am thinking to find the most popular material from the Seattle Public Library to answer my question.
Boning - MySQL Assignment.pdf
MySQL Assignment
(437.71 KiB) Downloaded 116 times

ziyanlin
Posts: 4
Joined: Wed Jan 08, 2020 10:55 am

Re: Proj 1 - Data Mining, Knowledge Discovery

Post by ziyanlin » Thu Jan 16, 2020 12:09 pm

I would like to use data from SPL to know the trend of iPhone based on their popularity among media and readers. With the data from the real sales of the product, I can compare these data to get my conclusion in the current situation.
proj1.pdf
(245.44 KiB) Downloaded 111 times

Post Reply