wk8 - 11.15.22 Outliers

Post Reply
glegrady
Posts: 203
Joined: Wed Sep 22, 2010 12:26 pm

wk8 - 11.15.22 Outliers

Post by glegrady » Fri Sep 16, 2022 7:59 am

11.15.22 Outliers

For this assignment, we are exploring the process of identifying outliers. All databases have outliers, data that somehow did not fit into any category, or else was incorretly classified. Your task, which hopefully will be creative for you, is to identity what may be outliers in the Seattle Library database.

Karl Yerkes, MAT lecturer did a chart some years ago about errors in the ItemNumber sequencing: https://www.mat.ucsb.edu/~g.legrady/aca ... onCode.png

He states: itemNumber and bibNumber are auto-incrementing database keys in the SPL LIS. Whenever an item is added to the library, the item is asigned a new itemNumber by adding 1 to the last, largest known itemNumber (same with bibNumber for brand new titles). We can analyse these keys to get information about that system. In particular, we can estimate the rate of acquisition of new materials by determining the slope of the plot of check out time versus itemNumber (or bibNumber ). We can estimate when big events happened by investigating the gaps in the data on this plot.

Question: What proportion of items have never been checked out? (i.e., Which are the loneliest items?) Because itemNumber is an auto-incrementing key at the SPL LIS, we only see certain keys (the ones that get checked out) and not others (the ones that never get checked out) in our database but we can estimate the percentage of “lonely” items.

--

Look online for tips on how to best explore data that are outliers. For instance:
https://dataschool.com/how-to-teach-peo ... -with-sql/

--
Post your results here. I am traveling to a conference, so we will need to set up individual meetings times. I will be 9 hours ahead so the earlier your time, the better for me.
George Legrady
legrady@mat.ucsb.edu

briannagriffin
Posts: 11
Joined: Fri Sep 23, 2022 10:04 am

Re: wk8 - 11.15.22 Outliers

Post by briannagriffin » Tue Nov 15, 2022 11:23 am

For this assignment, I queried for outliers in two different groups of data from the SPL database: checkouts of rock CDs and horror DVDs. It is valuable to detect and find outliers within a data set because these observations differ significantly from the majority. They have a heavy impact on statistics like the average and the standard deviation which we commonly rely on to explain large sets of data. Also, detecting outliers can lead to finding anomalies or problems within the database which are important to catch. In my analysis, I found outliers by assuming normality of the data and looking for data that was outside of three standard deviations from the mean in both directions. This led to interesting results and conclusions.

Here is the assignment PDF which includes my queries, analysis, and conclusion:
Week 8_ Outliers.pdf
(2.89 MiB) Downloaded 39 times
Here are the output CSV files:
rock_CD_outliers.csv
(3.54 KiB) Downloaded 35 times
outliers_horror_movies.csv
(1.06 KiB) Downloaded 30 times

ilianikiforov
Posts: 8
Joined: Tue Oct 04, 2022 10:24 am

Re: wk8 - 11.15.22 Outliers

Post by ilianikiforov » Tue Nov 15, 2022 3:03 pm

In this report, I focus on entries with incorrectly classified check-in times (earlier than check-out times). I explore overall yearly trends in those anomalies, use cross tabs to classify them by both check-in and check-out, identify the most extreme cases with the largest discrepancies, and investigate cases with both check-in time and check-out time classified incorrectly.
Attachments
top_items_with_cin_cout_errors.csv
(548 Bytes) Downloaded 32 times
test_vs_normal_items.csv
(52 Bytes) Downloaded 33 times
cin_less_cout_years.csv
(176 Bytes) Downloaded 34 times
cin_less_cout_hours.csv
(560 Bytes) Downloaded 31 times
cin_less_cout_days.csv
(175 Bytes) Downloaded 30 times
Assignment 6.pdf
(205.87 KiB) Downloaded 38 times

shaokang
Posts: 8
Joined: Fri Sep 23, 2022 10:07 am

Re: wk8 - 11.15.22 Outliers

Post by shaokang » Wed Nov 16, 2022 11:54 pm

For this week‘s assignment, I try to find outliers of different kinds.

* Using standard deviation of checkout times to find the most popular and unpopular items within CD category
* Using both purchase number and checkout times as the indicator of popularity, applying algorithms find out the outliers.
* Since itemNumber is auto incremented when entering the library, this attribute should be consecutive. I want to find out if the data follows such pattern. If not, what’s the distribution looks like? What’s the proportion of item that are never appear in the database?

Some visualizations:
https://tva1.sinaimg.cn/large/008vxvgGg ... 0moacm.jpg
https://tva1.sinaimg.cn/large/008vxvgGg ... 0u00yc.jpg

Python files has been zipped for uploading purpose.
Week 08 Outliers.pdf
(602.85 KiB) Downloaded 37 times
analysis.ipynb.zip
(282.85 KiB) Downloaded 35 times
CD_popularity.csv
(2.86 MiB) Downloaded 33 times
CD_popularity_2D.csv
(3.02 MiB) Downloaded 30 times
bibNumber_itemNumber_dist.csv
(2.11 MiB) Downloaded 34 times

nataliadubon
Posts: 15
Joined: Tue Mar 29, 2022 3:30 pm

Re: wk8 - 11.15.22 Outliers

Post by nataliadubon » Fri Nov 18, 2022 12:03 pm

Abstract
This week’s assignment calls for us to explore outliers in the Seattle Library database. For this project, I decided to conduct a statistical experiment that would allow me to search for outliers within a database in addition to statistically proving whether or not that outlier has a negative influence on the overall scope of the data and regression model. My research involves heavily on more complicated statistical approaches beyond just calculating the standard deviations of the dataset, but are explained simplistically throughout this paper in order to provide an easier understanding of the analysis attached to these methods.

All queries and csv files are attached within the document
Attachments
Week 8_ Finding Outliers in Data.pdf
(902.38 KiB) Downloaded 40 times

Post Reply