Page 1 of 1

Proj 3:Data Correlation

Posted: Tue Jan 14, 2014 3:31 pm
by glegrady
The assignment is to correlate SPL data with an external dataset like NY Times, RottenTomatoes, Twitter, etc.
The interest level of your project will be dependent on what interesting question you will query and then how to visualize the two datasets to show similarity and differences. We are not looking for anything "fancy" but something that will address the question of how can 2 datasets be compared with each other.

Project Schedule:
2.11 Introduce project
2.13 NY TImes and Twitter Demo
2.18 Your concept discussion in class
2.20 Project Presentation

Re: Proj 3:Data Correlation

Posted: Thu Feb 20, 2014 3:30 pm
by currier
Project 3: Correlation of two data sets (final)

Concept
The word “olympic”, like many words, can have multiple meanings. Natural language processing techniques such as sentiment analysis attempt to infer—through computation—meaning from text, but these techniques are complex to understand and apply properly. One relatively simple way to explore differences in how a term is used is to look at the words that accompany it and let the human brain infer the meanings. In this visualization I will use word clouds to visualize the terms that accompany the term “olympic” in both SPL and New York Times article titles. The visualization will enable one to explore how time and geography contribute to the context and use of the word.

Re: Proj 3:Data Correlation

Posted: Thu Feb 20, 2014 4:52 pm
by grant.mckenzie
With the increase in online sources of information, the role of Libraries is changing. Sites like Wikipedia (ranked 6th in terms of web traffic) offer a plethora of information through crowd-sourced means. What is of interest is how exploration of subject matter through physical media differs from that of online content. While material “check-outs” from the Seattle Public Library are a biased subset of the use the of physical media, they still do offer insight in to subject-matter interest.

For this project I propose to explore the correlation between visits to Wikipedia subject pages and SPL media check-out data related to that same subject. Comparing these “visits” and “check-outs” visually over time will allow one to see trends in the data as well as visually see any correlation that exists. Given that the Seattle Public Library is located in Seattle, I thought it might be interesting to look at a number of major businesses that call Seattle home.

Data

The companies/topics I chose to correlate Wikipedia and SPL data on are
  • Microsoft
    Boeing
    Starbucks
    The University of Washington
SPL
Extraction of data from the Seattle Public Library dataset involved one query per company. Each query looked for the stemmed country name keyword in the Title or the Subject of the check-out record. Additionally, the data was restricted to between 2010 – 2013 inclusive. An example of such a query is:

SELECT d, count(*) as cnt
FROM
(select unix_timestamp(cout) as d
FROM spl2.outraw
WHERE (title like '%starbucks%' or subj like '%starbucks')
AND (year(cout) >= 2009) AND year(cout) < 2014) a
GROUP BY d
ORDER BY d asc;

Wikipedia
The site: “http://stats.grok.se” offers access to daily site visit data for Wikipedia entries. For example the URL “http://stats.grok.se/json/en/201212/starbucks” returns a JSON object of daily visits to the Starbucks Wikipedia page for the month of December 2012. A script was written to download and parse all daily visits from January 1, 2010 to December 31, 2013. A CSV file was produced for all subjects of interest.

Doodle

This original doodle was designed using Countries as the subjects of interest. This has since changed to Companies based in Seattle, WA.

Design Decisions

I decided to represent the data as a bidirectional wave or StreamGraph.1 The X-axis of the graph is time labeled by year, but built on daily Wikipedia site visits (first set) and daily SPL checkouts (second set). The Y-axis shows the volume of either page visits (first set) or Seattle Public Library media check-outs (second set). These volumes were collected daily, but visually aggregated to monthly buckets. It is important to note that the values presented for the Seattle Public Library are a full 1000 times smaller than those presented for Wikipedia. Forcing the same approximate size of the data allows user to compare the proportions of one set to the other. If this size adjustment was not done, the user would not be able to see any of the Seattle Public Library checkout data. In addition to the StreamGraph data visualization, two pie charts on the right are shown that represent the total percentages of page visits and media checkouts.

Re: Proj 3:Data Correlation

Posted: Thu Feb 20, 2014 6:02 pm
by milrober
I want to explore features of a song that may contribute to its popularity in Seattle. I will use the Echo Nest’s API to query the “hottest,” that is more popular, songs contemporaneously. The Echo Nest has a database of over 35.3 million songs, and along with those songs, certain features about the song such as tonality, tempo, and valence. After getting the most popular songs from the Echo Nest, I will query the Seattle Public Library (SPL) for the number of checkouts for albums containing those songs. The visualization will compare the popularity of an album in SPL to the features computed by Echo Nest.

-Rob

Re: Proj 3:Data Correlation

Posted: Thu Feb 20, 2014 6:15 pm
by hellobuaazl
I want to correlate the data of checked out books from Seattle Public Library and the data of articles containing key words “China” and “Japan” in New York Times from the year 2005 to 2011.

Re: Proj 3:Data Correlation

Posted: Thu Feb 20, 2014 8:59 pm
by mohithingorani
Concept:
Correlating data to me means making sense of two data sets at the same time, and understanding the dynamics between the two sets. For this assignment I am exploring the trends in the emerging fields of “Data & Big Data”. I will be using Article Search from the New York Times API and looking for the words ‘Data’ and a subset of it ‘Big Data’. In the Seattle Public Library, I will be searching for books with the titles containing the words ‘Data’ & ‘Big Data’. Very interesting trends and patterns emerge.

I will be looking into data from (2011-2013): a 3-year period.

Labels:
I am going for a Swiss poster design style: featuring bold fonts and a minimalistic design. The data is arranged vertically (like a time line) instead of the usual horizontal flow. I will be using a combination of simple bars & lines for the project. The mouse position will highlight the number of books/ articles that the bar represents for both SPL & NYT.

Color Scheme:
Red & White

Re: Proj 3:Data Correlation

Posted: Sun Feb 23, 2014 2:16 pm
by m_uppal
Visualizing and correlating events in Ukraine. What really describes the the chaos happenings in Ukraine. The three data sets used here are New york times and SPL.
Further comparison from Twitter and NY times api is done with seattle public library database. I will be looking from 2010(since that's when the president was elected)
The idea is to depict pre- and post- president, Viktor F. Yanukovyc, fleeing Ukraine and giving power back to the people.
Attached pdf and csv files and the source code.

Re: Proj 3:Data Correlation

Posted: Tue Feb 25, 2014 3:14 pm
by songgaogeo
Concept
The Presidential Election is one of the most important political events in United States. The presidential debates present the party’s visions during the campaign. The major issues debated were the economy and jobs, the federal budget deficit, taxation and spending, the future of social security, medicare, healthcare reform, education, social issues, immigration, and foreign policy. The debates are very valuable during the public vote process. It is interesting to explore the trends of check-outs behaviors related to “election” or “debate” in Seattle Public Library (SPL) and the articles about these topics in New York Times (NYT) during the election periods, as well as to identify whether there is a correlation pattern between the two pubic media.
Visual Design and Color
I will use the comparison-mode visual design for the correlation project which means every single visual element should contains the binary information between two media (SPL v.s. NYT), such as the bar charts, the frequency display and the relative weight. For the color design, I am going to use the Germany Flag color scheme of Red, Yellow and Black.
SongGao_MAT259_HW3.pdf
(469.03 KiB) Downloaded 344 times
DataCorrelation.zip
(37.15 KiB) Downloaded 221 times

Re: Proj 3:Data Correlation

Posted: Tue Feb 25, 2014 4:41 pm
by laks316
MAT 259, Assignment 3, Data Correlation
Lakshman Nataraj

History of Best-sellers vs Total number of checkouts

Concept:

In Assignment 2, I had observed the checkouts per month for some of the top manga between 2006 and 2011 in a 2D Spatial map. In this assignment of data correlation, I will be using almost the same list of manga and will count the total number of checkouts for every manga in the list from the Seattle Pubic Library. I will compare this with the total number of best-sellers for every manga from the NY Times best-seller history list. The list of manga considered are:

1. Naruto

2. Bleach

3. One Piece

4. Fairy Tail

5. Dengeki Daisy

6. Skip Beat

7. Vampire Knight

8. Black Bird

9. Soul Eater

10. Rosario-Vampire

MySQL Query:

A sample query to count the total number of checkouts for a manga ('naruto') will be like:

SELECT count(*) FROM spl2.inraw
where (title like "%snaruto%" and itemtype="acbk")
and deweyClass = "741.5952";

NYTimes Query:

A sample query to get the history of best-sellers for a manga ('naruto') will be like:

http://api.nytimes.com/svc/books/v2/lis ... o&api-key={}

Doodle: (in pdf)

Challenges:

One of the main challenges was understanding the NY Times API.

Sample Visualization (in pdf)

We see that there is a stunning correlation between the manga checkouts from Seattle Pubic Library and the number of manga in the history of best-sellers from NY Times.

Since this is a simple visualization, I plan to add extra objects like mouse hovering, changing the correlation diagram (like histograms, scatter plots) on keyboard change.