Proj 1 - Data Mining, Knowledge Discovery
Posted: Mon Dec 30, 2019 5:53 pm
Proj 1 - Data Mining, Knowledge Discovery
Tues - January 07, 2019 - Introduction to MySQL
Thur - January 09, 2019 - Review MySQL Assignment (this)
Tues - January 14, 2019 - MySQL discussion and examples
Thur - January 16, 2019 - Student presentation
DATA MINING / KNOWLEDGE DISCOVERY:
The assignment is to explore the Seattle Public Library database for which we have hourly checkouts (and returns) of books, cds, dvds, and other items from the Seattle Public Library (SPL) since January 1, 2006. Our approach is "Knowledge Discovery in Databases" or KDD or "Data mining". The goal is the discovery and extraction of patterns of interest to gain knowledge. On a broad level, the database provides a historical overview of cultural interests of the Seattle downtown community but it can also be seen as a sampling of North American cultural interests over a fifteen year period. For those interested in the structure of the database, there is much to explore in discovering how the data is organized, or anomalies in the system. Please review the discussion at: https://www.mat.ucsb.edu/~g.legrady/aca ... _notes.pdf
There are approx. 97 million datasets in the database representing 15 year's of activity. The data consists of multivariate data and is
classified in two main categories. The library uses the Dewey Decimal system by which to organize non-fiction items: https://en.wikipedia.org/wiki/List_of_D ... al_classes but over 50% of the items in the library collection are fiction, and therefore are coded as non-Dewey items which do not have numeric systematic classification. For instance, music, movies, seem to be distributed in both Dewey and non-Dewey ways of classification. The most popular Dewey items can be viewed hourly at: http://128.111.26.109/parsing/index.php ... &d=05&h=17 The top most popular items are comic books, cookbooks, health, travel books, etc.
--------------------------------------------------------------
MYSQL ASSIGNMENT - THE CONCEPT: In any database, there lies hidden knowledge. What does a database contain, and what can MySQL queries reveal? Your first assignment is to find something of interest based on your own cultural / knowledge interests. Here are some options:
1) Topics of Cultural Content:
. What is the performance of a topic, a title, - a media over time, or by volume, or by trends, or by news events?
. How does the collection change over time?
2) The Database Organizational Structure
. What is the classification approach that the library uses
. What anomalies, errors, outliers, illogical classification methods, etc. are hidden and can be revealed.
(All databases have these as its impossible to precisely classify all things within a structured system)
3) Data Analytics Query Methods
. Explore statistical methods or algorithms to retrieve or process data
--------------------------------------------------------------
GETTING THE DATA: Use the MySQL Workbench to write a query by which to retrieve the data from the SPL database. Here are the guidelines of how to run the query: https://www.mat.ucsb.edu/~g.legrady/aca ... LayOut.pdf
Use the spl_2016 database which gets updated daily. Become familiar with the metadata that describes each checkout: https://www.mat.ucsb.edu/~g.legrady/aca ... ataDef.pdf
Decide which metadata you need/want - these will become columns in the csv file. "Select *" from the inraw or outraw tables will get you 11 columns. They are:
Categorical:
bibNumber: Defined by Library of Congress - multiple copies of the same title have the same bibnumber
collcode: Collection type, name and physical branch location
itemtype: What media (book, cd, dvd, etc.)
callNumber: Multiple copies of same item will share same call number but have different barcodes and itemNumber
Date:
cout: checkout date/timestamp
cin: checkin date/timestamp
Numeric / Time Series:
ID: The numeric location in the database
itemNumber: The items' acquisition date
DeweyClass: "" (null) if not Dewey, otherwise https://en.wikipedia.org/wiki/Dewey_Dec ... sification
Text:
title: title of items
subj: Keywords
In the example below, the query will return 4 vertical columns consisting of all the different bibnumbers (17 have been found); their itemtype (book, dvd, etc.); item's titles; and the total count for each bibnumber sequenced from most to least.
SELECT
bibNumber, itemType, title, COUNT(bibNumber) AS Counts
FROM
spl_2016.inraw
WHERE
title LIKE '%Blade Runner%'
GROUP BY bibNumber , itemType , title
ORDER BY Counts DESC
--------------------------------------------------------------
DO THE ASSIGNMENT:
Do a MySQL query of the Seattle database "spl_2016" with the approach of "knowledge discovery". Provide a question of interest and describe the steps in your exploration. Provide MySQL queries and results. You can finalize with an analysis commentary.
The assignment does not require visualization but consider that for the 3D and your final project you will need to have 4 columns of numeric values, for instance, time-based, or ordered data (itemNumber, dewey, volume, etc.) These will be used for horizontal, vertical, depth positions, and the 4th to define color density of the positioned item.
Once you have all the material - click on "POST REPLY" to this link and add your info to complete the assignment as a pdf.
--------------------------------------------------------------
Case Study of a Topic: The Sci-fi Movie Blade Runner
Rodger Liu assembled a case study a few years ago that can be used as an example: https://www.mat.ucsb.edu/~g.legrady/aca ... Report.pdf
--------------------------------------------------------------
View previous student examples in the student forum: viewtopic.php?f=75&t=303#p2015 (2018) and viewtopic.php?f=77&t=313 (2019)
A few interesting ones from last year:
Chantal Nguyen was interested in the performance of cook books: viewtopic.php?f=77&t=313#p2127
Sandy Schoettler was interested in a statistical analysis: viewtopic.php?f=77&t=313#p2128
--------------------------------------------------------------
Tues - January 07, 2019 - Introduction to MySQL
Thur - January 09, 2019 - Review MySQL Assignment (this)
Tues - January 14, 2019 - MySQL discussion and examples
Thur - January 16, 2019 - Student presentation
DATA MINING / KNOWLEDGE DISCOVERY:
The assignment is to explore the Seattle Public Library database for which we have hourly checkouts (and returns) of books, cds, dvds, and other items from the Seattle Public Library (SPL) since January 1, 2006. Our approach is "Knowledge Discovery in Databases" or KDD or "Data mining". The goal is the discovery and extraction of patterns of interest to gain knowledge. On a broad level, the database provides a historical overview of cultural interests of the Seattle downtown community but it can also be seen as a sampling of North American cultural interests over a fifteen year period. For those interested in the structure of the database, there is much to explore in discovering how the data is organized, or anomalies in the system. Please review the discussion at: https://www.mat.ucsb.edu/~g.legrady/aca ... _notes.pdf
There are approx. 97 million datasets in the database representing 15 year's of activity. The data consists of multivariate data and is
classified in two main categories. The library uses the Dewey Decimal system by which to organize non-fiction items: https://en.wikipedia.org/wiki/List_of_D ... al_classes but over 50% of the items in the library collection are fiction, and therefore are coded as non-Dewey items which do not have numeric systematic classification. For instance, music, movies, seem to be distributed in both Dewey and non-Dewey ways of classification. The most popular Dewey items can be viewed hourly at: http://128.111.26.109/parsing/index.php ... &d=05&h=17 The top most popular items are comic books, cookbooks, health, travel books, etc.
--------------------------------------------------------------
MYSQL ASSIGNMENT - THE CONCEPT: In any database, there lies hidden knowledge. What does a database contain, and what can MySQL queries reveal? Your first assignment is to find something of interest based on your own cultural / knowledge interests. Here are some options:
1) Topics of Cultural Content:
. What is the performance of a topic, a title, - a media over time, or by volume, or by trends, or by news events?
. How does the collection change over time?
2) The Database Organizational Structure
. What is the classification approach that the library uses
. What anomalies, errors, outliers, illogical classification methods, etc. are hidden and can be revealed.
(All databases have these as its impossible to precisely classify all things within a structured system)
3) Data Analytics Query Methods
. Explore statistical methods or algorithms to retrieve or process data
--------------------------------------------------------------
GETTING THE DATA: Use the MySQL Workbench to write a query by which to retrieve the data from the SPL database. Here are the guidelines of how to run the query: https://www.mat.ucsb.edu/~g.legrady/aca ... LayOut.pdf
Use the spl_2016 database which gets updated daily. Become familiar with the metadata that describes each checkout: https://www.mat.ucsb.edu/~g.legrady/aca ... ataDef.pdf
Decide which metadata you need/want - these will become columns in the csv file. "Select *" from the inraw or outraw tables will get you 11 columns. They are:
Categorical:
bibNumber: Defined by Library of Congress - multiple copies of the same title have the same bibnumber
collcode: Collection type, name and physical branch location
itemtype: What media (book, cd, dvd, etc.)
callNumber: Multiple copies of same item will share same call number but have different barcodes and itemNumber
Date:
cout: checkout date/timestamp
cin: checkin date/timestamp
Numeric / Time Series:
ID: The numeric location in the database
itemNumber: The items' acquisition date
DeweyClass: "" (null) if not Dewey, otherwise https://en.wikipedia.org/wiki/Dewey_Dec ... sification
Text:
title: title of items
subj: Keywords
In the example below, the query will return 4 vertical columns consisting of all the different bibnumbers (17 have been found); their itemtype (book, dvd, etc.); item's titles; and the total count for each bibnumber sequenced from most to least.
SELECT
bibNumber, itemType, title, COUNT(bibNumber) AS Counts
FROM
spl_2016.inraw
WHERE
title LIKE '%Blade Runner%'
GROUP BY bibNumber , itemType , title
ORDER BY Counts DESC
--------------------------------------------------------------
DO THE ASSIGNMENT:
Do a MySQL query of the Seattle database "spl_2016" with the approach of "knowledge discovery". Provide a question of interest and describe the steps in your exploration. Provide MySQL queries and results. You can finalize with an analysis commentary.
The assignment does not require visualization but consider that for the 3D and your final project you will need to have 4 columns of numeric values, for instance, time-based, or ordered data (itemNumber, dewey, volume, etc.) These will be used for horizontal, vertical, depth positions, and the 4th to define color density of the positioned item.
Once you have all the material - click on "POST REPLY" to this link and add your info to complete the assignment as a pdf.
--------------------------------------------------------------
Case Study of a Topic: The Sci-fi Movie Blade Runner
Rodger Liu assembled a case study a few years ago that can be used as an example: https://www.mat.ucsb.edu/~g.legrady/aca ... Report.pdf
--------------------------------------------------------------
View previous student examples in the student forum: viewtopic.php?f=75&t=303#p2015 (2018) and viewtopic.php?f=77&t=313 (2019)
A few interesting ones from last year:
Chantal Nguyen was interested in the performance of cook books: viewtopic.php?f=77&t=313#p2127
Sandy Schoettler was interested in a statistical analysis: viewtopic.php?f=77&t=313#p2128
--------------------------------------------------------------