Proj 1 - MySQL & Knowledge Discovery in the SPL Database
Posted: Thu Dec 31, 2020 4:11 pm
Proj 1 - MySQL & Knowledge Discovery in the SPL Database
Provide a 1 paragraph description of your SQL data search project in the student forum and add a PDF that documents the work you did. It should include the following:
. The 1 paragraph description
. Concept description
. MySQl Query
. Data results
. Discussion/Analysis of results
----------------------------------
SCHEDULE
Tues - January 05, 2021 - Course Overview & Introduction to MySQL
Thur - January 07, 2021 - Review MySQL examples
Tues - January 12, 2021 - Data Processing steps
Thur - January 14, 2021 - Student presentation
----------------------------------
Data Exploration & Knowledge Discovery through a large multivariate dataset.
The first assignment is to explore a large database consisting of multivariate data with the intent to discover and extract patterns with MySQL that may reveal something of interest. SQL is the standardized language used to access the database: https://www.mysqltutorial.org/what-is-mysql/
We have access to a unique database consisting of checkouts of books, cds, dvd's documented by the hour since January 1, 2006 that represents the aggregated cultural interests of downtown Seattle, but also of the larger national interests. The database currently consists of over 98 million checkouts and returns.
----------------------------------
MYSQL ASSIGNMENT - THE CONCEPT: In any database, there lies hidden knowledge. What does a database contain, and what can MySQL queries reveal? Your first assignment is to find something of interest based on your own interests and skillsets. Here are some options:
1) Patterns, Probability & Prediction
. What is the performance of a topic(s), a title(s), media over time, or by volume, or by trends?
. What can be predicted based on previous performance?
. Can it be predicted how a sequence of data change over time?
2) The Database Organizational Structure
. What anomalies, errors, outliers, illogical classification methods, etc. may be revealed within the organizational and classification of how items are encoded
(All databases have outliers, anomalies, errors in the system as its impossible to precisely classify all things within a structured form)
3) Covid Situation
. 2020 has had an unprecedented impact on the database as the library was closed for over 5 months between March to September. Nonetheless items were circulating.
. Electronic books have been in the collection since 2009 but these are not recorded in the database we receive. Nonetheless they can be reviewed through other means:
4) Data Analytics Query Methods
. Explore statistical methods or algorithms to retrieve or process data
. Are there any machine-learning opportunities in analyzing the data?
----------------------------------
The database consists of multivariate data. For each checkout there exists the following metadata:
Ordinal (In a numeric sequence)
ID: Assigned by the database to keep track of each entry
ItemNumber: Assigned by the library when an object enters the system
Dewey Classification (Dewey numeric) The item's dewey classification if it is recorded as a Dewey (non-fiction) item
Interval Scale (Time-Stamp)
Check-out/check-in in minutes, hour, day, month, year
Categorical (Not necessarily numerically orderable)
BibNumber: Each title has a specific number, copies of titles all have same number. Defined by the LIbrary of Congress
Barcode: Each item has a unique number on RFID sticker
CallNumber: by which to locate items on shelves - Ordinal if Dewey, otherwise categorical. Multiple copies of same item may share same call number but have different barcodes and itemNumber
CollCode: What the item is and where its located: https://data.seattle.gov/Community/Libr ... /6vkj-f5xf
Semantic (Text-based)
Title: Each Item has a title
ItemType: books, cds, dvds, music sheets, etc.
Subjects: Keywords (arbitrary labeling). These are located in a separate database:
----------------------------------
The library uses the Dewey Decimal system by which to organize non-fiction item but the majority of the items in the library collection do not have Dewey classification labels. For instance, music, movies, seem to be distributed in both Dewey and non-Dewey ways of classification. The most popular Dewey tend to be comic books, cookbooks, health, travel books, etc. A daily insight to the Dewey performance can be tracked at: http://128.111.26.109/parsing/index.php ... &d=02&h=12
---------------------------------
Label your Documents
Please make sure to label your documents like csv files by the name of your project, or your name so we can identify where they come from
Provide a 1 paragraph description of your SQL data search project in the student forum and add a PDF that documents the work you did. It should include the following:
. The 1 paragraph description
. Concept description
. MySQl Query
. Data results
. Discussion/Analysis of results
----------------------------------
SCHEDULE
Tues - January 05, 2021 - Course Overview & Introduction to MySQL
Thur - January 07, 2021 - Review MySQL examples
Tues - January 12, 2021 - Data Processing steps
Thur - January 14, 2021 - Student presentation
----------------------------------
Data Exploration & Knowledge Discovery through a large multivariate dataset.
The first assignment is to explore a large database consisting of multivariate data with the intent to discover and extract patterns with MySQL that may reveal something of interest. SQL is the standardized language used to access the database: https://www.mysqltutorial.org/what-is-mysql/
We have access to a unique database consisting of checkouts of books, cds, dvd's documented by the hour since January 1, 2006 that represents the aggregated cultural interests of downtown Seattle, but also of the larger national interests. The database currently consists of over 98 million checkouts and returns.
----------------------------------
MYSQL ASSIGNMENT - THE CONCEPT: In any database, there lies hidden knowledge. What does a database contain, and what can MySQL queries reveal? Your first assignment is to find something of interest based on your own interests and skillsets. Here are some options:
1) Patterns, Probability & Prediction
. What is the performance of a topic(s), a title(s), media over time, or by volume, or by trends?
. What can be predicted based on previous performance?
. Can it be predicted how a sequence of data change over time?
2) The Database Organizational Structure
. What anomalies, errors, outliers, illogical classification methods, etc. may be revealed within the organizational and classification of how items are encoded
(All databases have outliers, anomalies, errors in the system as its impossible to precisely classify all things within a structured form)
3) Covid Situation
. 2020 has had an unprecedented impact on the database as the library was closed for over 5 months between March to September. Nonetheless items were circulating.
. Electronic books have been in the collection since 2009 but these are not recorded in the database we receive. Nonetheless they can be reviewed through other means:
4) Data Analytics Query Methods
. Explore statistical methods or algorithms to retrieve or process data
. Are there any machine-learning opportunities in analyzing the data?
----------------------------------
The database consists of multivariate data. For each checkout there exists the following metadata:
Ordinal (In a numeric sequence)
ID: Assigned by the database to keep track of each entry
ItemNumber: Assigned by the library when an object enters the system
Dewey Classification (Dewey numeric) The item's dewey classification if it is recorded as a Dewey (non-fiction) item
Interval Scale (Time-Stamp)
Check-out/check-in in minutes, hour, day, month, year
Categorical (Not necessarily numerically orderable)
BibNumber: Each title has a specific number, copies of titles all have same number. Defined by the LIbrary of Congress
Barcode: Each item has a unique number on RFID sticker
CallNumber: by which to locate items on shelves - Ordinal if Dewey, otherwise categorical. Multiple copies of same item may share same call number but have different barcodes and itemNumber
CollCode: What the item is and where its located: https://data.seattle.gov/Community/Libr ... /6vkj-f5xf
Semantic (Text-based)
Title: Each Item has a title
ItemType: books, cds, dvds, music sheets, etc.
Subjects: Keywords (arbitrary labeling). These are located in a separate database:
----------------------------------
The library uses the Dewey Decimal system by which to organize non-fiction item but the majority of the items in the library collection do not have Dewey classification labels. For instance, music, movies, seem to be distributed in both Dewey and non-Dewey ways of classification. The most popular Dewey tend to be comic books, cookbooks, health, travel books, etc. A daily insight to the Dewey performance can be tracked at: http://128.111.26.109/parsing/index.php ... &d=02&h=12
---------------------------------
Label your Documents
Please make sure to label your documents like csv files by the name of your project, or your name so we can identify where they come from