Proj 3:Data Correlation

Post Reply
glegrady
Posts: 160
Joined: Wed Sep 22, 2010 12:26 pm

Proj 3:Data Correlation

Post by glegrady » Tue Jan 14, 2014 3:31 pm

The assignment is to correlate SPL data with an external dataset like NY Times, RottenTomatoes, Twitter, etc.
The interest level of your project will be dependent on what interesting question you will query and then how to visualize the two datasets to show similarity and differences. We are not looking for anything "fancy" but something that will address the question of how can 2 datasets be compared with each other.

Project Schedule:
2.11 Introduce project
2.13 NY TImes and Twitter Demo
2.18 Your concept discussion in class
2.20 Project Presentation
George Legrady
legrady@mat.ucsb.edu

currier
Posts: 4
Joined: Tue Jan 14, 2014 11:50 am

Re: Proj 3:Data Correlation

Post by currier » Thu Feb 20, 2014 3:30 pm

Project 3: Correlation of two data sets (final)

Concept
The word “olympic”, like many words, can have multiple meanings. Natural language processing techniques such as sentiment analysis attempt to infer—through computation—meaning from text, but these techniques are complex to understand and apply properly. One relatively simple way to explore differences in how a term is used is to look at the words that accompany it and let the human brain infer the meanings. In this visualization I will use word clouds to visualize the terms that accompany the term “olympic” in both SPL and New York Times article titles. The visualization will enable one to explore how time and geography contribute to the context and use of the word.
Attachments
currier_proj3_final_revised.pdf
(523.49 KiB) Downloaded 324 times
currier_proj3_images.zip
(262.31 KiB) Downloaded 191 times
currier_proj3_code.zip
(613.64 KiB) Downloaded 190 times
Last edited by currier on Thu Feb 27, 2014 11:02 pm, edited 2 times in total.

grant.mckenzie
Posts: 4
Joined: Tue Jan 14, 2014 11:47 am

Re: Proj 3:Data Correlation

Post by grant.mckenzie » Thu Feb 20, 2014 4:52 pm

With the increase in online sources of information, the role of Libraries is changing. Sites like Wikipedia (ranked 6th in terms of web traffic) offer a plethora of information through crowd-sourced means. What is of interest is how exploration of subject matter through physical media differs from that of online content. While material “check-outs” from the Seattle Public Library are a biased subset of the use the of physical media, they still do offer insight in to subject-matter interest.

For this project I propose to explore the correlation between visits to Wikipedia subject pages and SPL media check-out data related to that same subject. Comparing these “visits” and “check-outs” visually over time will allow one to see trends in the data as well as visually see any correlation that exists. Given that the Seattle Public Library is located in Seattle, I thought it might be interesting to look at a number of major businesses that call Seattle home.

Data

The companies/topics I chose to correlate Wikipedia and SPL data on are
  • Microsoft
    Boeing
    Starbucks
    The University of Washington
SPL
Extraction of data from the Seattle Public Library dataset involved one query per company. Each query looked for the stemmed country name keyword in the Title or the Subject of the check-out record. Additionally, the data was restricted to between 2010 – 2013 inclusive. An example of such a query is:

SELECT d, count(*) as cnt
FROM
(select unix_timestamp(cout) as d
FROM spl2.outraw
WHERE (title like '%starbucks%' or subj like '%starbucks')
AND (year(cout) >= 2009) AND year(cout) < 2014) a
GROUP BY d
ORDER BY d asc;

Wikipedia
The site: “http://stats.grok.se” offers access to daily site visit data for Wikipedia entries. For example the URL “http://stats.grok.se/json/en/201212/starbucks” returns a JSON object of daily visits to the Starbucks Wikipedia page for the month of December 2012. A script was written to download and parse all daily visits from January 1, 2010 to December 31, 2013. A CSV file was produced for all subjects of interest.

Doodle

This original doodle was designed using Countries as the subjects of interest. This has since changed to Companies based in Seattle, WA.

Design Decisions

I decided to represent the data as a bidirectional wave or StreamGraph.1 The X-axis of the graph is time labeled by year, but built on daily Wikipedia site visits (first set) and daily SPL checkouts (second set). The Y-axis shows the volume of either page visits (first set) or Seattle Public Library media check-outs (second set). These volumes were collected daily, but visually aggregated to monthly buckets. It is important to note that the values presented for the Seattle Public Library are a full 1000 times smaller than those presented for Wikipedia. Forcing the same approximate size of the data allows user to compare the proportions of one set to the other. If this size adjustment was not done, the user would not be able to see any of the Seattle Public Library checkout data. In addition to the StreamGraph data visualization, two pie charts on the right are shown that represent the total percentages of page visits and media checkouts.
Attachments
McKenzie_Project3.zip
Project Code
(25.24 KiB) Downloaded 195 times
McKenzie_proj3.pdf
Assignment Document
(480.86 KiB) Downloaded 205 times
McKenzie_proj3_screen.png
Screen shot of file product
Last edited by grant.mckenzie on Wed Feb 26, 2014 10:27 am, edited 2 times in total.

milrober
Posts: 4
Joined: Tue Jan 14, 2014 11:44 am

Re: Proj 3:Data Correlation

Post by milrober » Thu Feb 20, 2014 6:02 pm

I want to explore features of a song that may contribute to its popularity in Seattle. I will use the Echo Nest’s API to query the “hottest,” that is more popular, songs contemporaneously. The Echo Nest has a database of over 35.3 million songs, and along with those songs, certain features about the song such as tonality, tempo, and valence. After getting the most popular songs from the Echo Nest, I will query the Seattle Public Library (SPL) for the number of checkouts for albums containing those songs. The visualization will compare the popularity of an album in SPL to the features computed by Echo Nest.

-Rob
Attachments
RobMiller_CorrelationFinal.docx
(16.37 KiB) Downloaded 185 times
Robert_Scan.PDF.zip
(2.02 MiB) Downloaded 196 times
RobMiller_CorrelationDoodle.docx
(104.62 KiB) Downloaded 186 times
Last edited by milrober on Tue Feb 25, 2014 3:06 pm, edited 1 time in total.

hellobuaazl
Posts: 4
Joined: Tue Jan 14, 2014 11:54 am

Re: Proj 3:Data Correlation

Post by hellobuaazl » Thu Feb 20, 2014 6:15 pm

I want to correlate the data of checked out books from Seattle Public Library and the data of articles containing key words “China” and “Japan” in New York Times from the year 2005 to 2011.
Attachments
correlation.zip
(6.81 KiB) Downloaded 200 times
Japan.csv
(1.02 KiB) Downloaded 201 times
China.csv
(1 KiB) Downloaded 209 times
Correlation report.pdf
(116.74 KiB) Downloaded 213 times
Last edited by hellobuaazl on Tue Feb 25, 2014 3:37 pm, edited 1 time in total.

mohithingorani
Posts: 5
Joined: Tue Jan 14, 2014 11:46 am

Re: Proj 3:Data Correlation

Post by mohithingorani » Thu Feb 20, 2014 8:59 pm

Concept:
Correlating data to me means making sense of two data sets at the same time, and understanding the dynamics between the two sets. For this assignment I am exploring the trends in the emerging fields of “Data & Big Data”. I will be using Article Search from the New York Times API and looking for the words ‘Data’ and a subset of it ‘Big Data’. In the Seattle Public Library, I will be searching for books with the titles containing the words ‘Data’ & ‘Big Data’. Very interesting trends and patterns emerge.

I will be looking into data from (2011-2013): a 3-year period.

Labels:
I am going for a Swiss poster design style: featuring bold fonts and a minimalistic design. The data is arranged vertically (like a time line) instead of the usual horizontal flow. I will be using a combination of simple bars & lines for the project. The mouse position will highlight the number of books/ articles that the bar represents for both SPL & NYT.

Color Scheme:
Red & White
Attachments
application.macosx.zip
working sketch
(9.66 MiB) Downloaded 191 times
data.csv
(392 Bytes) Downloaded 183 times
bigdata.csv
(73 Bytes) Downloaded 186 times
Assignment 3final.pdf
Final Submission
(1.15 MiB) Downloaded 189 times
Assignment 3.pdf
Proposal
(296.44 KiB) Downloaded 189 times
Last edited by mohithingorani on Tue Feb 25, 2014 3:18 pm, edited 2 times in total.

m_uppal
Posts: 4
Joined: Tue Jan 14, 2014 11:43 am

Re: Proj 3:Data Correlation

Post by m_uppal » Sun Feb 23, 2014 2:16 pm

Visualizing and correlating events in Ukraine. What really describes the the chaos happenings in Ukraine. The three data sets used here are New york times and SPL.
Further comparison from Twitter and NY times api is done with seattle public library database. I will be looking from 2010(since that's when the president was elected)
The idea is to depict pre- and post- president, Viktor F. Yanukovyc, fleeing Ukraine and giving power back to the people.
Attached pdf and csv files and the source code.
Attachments
correlation1.zip
(6.09 MiB) Downloaded 177 times
Assign3-Correlation[final].pdf
(154.84 KiB) Downloaded 323 times
ukraine.csv
(2.51 KiB) Downloaded 185 times
Last edited by m_uppal on Tue Feb 25, 2014 4:10 pm, edited 3 times in total.

songgaogeo
Posts: 4
Joined: Tue Jan 14, 2014 11:48 am

Re: Proj 3:Data Correlation

Post by songgaogeo » Tue Feb 25, 2014 3:14 pm

Concept
The Presidential Election is one of the most important political events in United States. The presidential debates present the party’s visions during the campaign. The major issues debated were the economy and jobs, the federal budget deficit, taxation and spending, the future of social security, medicare, healthcare reform, education, social issues, immigration, and foreign policy. The debates are very valuable during the public vote process. It is interesting to explore the trends of check-outs behaviors related to “election” or “debate” in Seattle Public Library (SPL) and the articles about these topics in New York Times (NYT) during the election periods, as well as to identify whether there is a correlation pattern between the two pubic media.
Visual Design and Color
I will use the comparison-mode visual design for the correlation project which means every single visual element should contains the binary information between two media (SPL v.s. NYT), such as the bar charts, the frequency display and the relative weight. For the color design, I am going to use the Germany Flag color scheme of Red, Yellow and Black.
SongGao_MAT259_HW3.pdf
(469.03 KiB) Downloaded 289 times
DataCorrelation.zip
(37.15 KiB) Downloaded 166 times
Last edited by songgaogeo on Tue Feb 25, 2014 5:43 pm, edited 2 times in total.

laks316
Posts: 3
Joined: Tue Jan 14, 2014 11:52 am

Re: Proj 3:Data Correlation

Post by laks316 » Tue Feb 25, 2014 4:41 pm

MAT 259, Assignment 3, Data Correlation
Lakshman Nataraj

History of Best-sellers vs Total number of checkouts

Concept:

In Assignment 2, I had observed the checkouts per month for some of the top manga between 2006 and 2011 in a 2D Spatial map. In this assignment of data correlation, I will be using almost the same list of manga and will count the total number of checkouts for every manga in the list from the Seattle Pubic Library. I will compare this with the total number of best-sellers for every manga from the NY Times best-seller history list. The list of manga considered are:

1. Naruto

2. Bleach

3. One Piece

4. Fairy Tail

5. Dengeki Daisy

6. Skip Beat

7. Vampire Knight

8. Black Bird

9. Soul Eater

10. Rosario-Vampire

MySQL Query:

A sample query to count the total number of checkouts for a manga ('naruto') will be like:

SELECT count(*) FROM spl2.inraw
where (title like "%snaruto%" and itemtype="acbk")
and deweyClass = "741.5952";

NYTimes Query:

A sample query to get the history of best-sellers for a manga ('naruto') will be like:

http://api.nytimes.com/svc/books/v2/lis ... o&api-key={}

Doodle: (in pdf)

Challenges:

One of the main challenges was understanding the NY Times API.

Sample Visualization (in pdf)

We see that there is a stunning correlation between the manga checkouts from Seattle Pubic Library and the number of manga in the history of best-sellers from NY Times.

Since this is a simple visualization, I plan to add extra objects like mouse hovering, changing the correlation diagram (like histograms, scatter plots) on keyboard change.
Attachments

[The extension rar has been deactivated and can no longer be displayed.]

assign-3-doodle-lakshman-nataraj.pdf
pdf of doodle
(1.01 MiB) Downloaded 154 times

Post Reply