Proj 4 - Correlate with External Data

Post Reply
Posts: 160
Joined: Wed Sep 22, 2010 12:26 pm

Proj 4 - Correlate with External Data

Post by glegrady » Mon Feb 18, 2013 6:32 pm

The 4th project is to take the 2D spatial map and to correlate with an external data source such as Google Correlate, Twitter, Facebook, NY Times. This assignment builds on the 2D map which you can extend, and then possibly bring in Basak Alper's research question of how to compare two sets of data within one image.
George Legrady

Posts: 5
Joined: Fri Jan 11, 2013 10:37 am

Re: Proj 4 - Correlate with External Data

Post by ytang » Tue Feb 26, 2013 1:08 pm

Yeu-Shuan Tang
(58.97 KiB) Downloaded 544 times
(7.77 KiB) Downloaded 559 times
Screen Shot 2013-03-26 at 5.11.31 PM.png
Screen Shot 2013-02-24 at 11.25.01 PM.png
Screen Shot 2013-02-24 at 11.25.10 PM.png
Screen Shot 2013-02-24 at 11.21.00 PM.png
Last edited by ytang on Tue Mar 26, 2013 5:34 pm, edited 4 times in total.

Posts: 5
Joined: Tue Jan 15, 2013 1:22 pm

Re: Proj 4 - Correlating with External Data

Post by ramon.rovirosa » Tue Feb 26, 2013 1:30 pm

Ranking Cuisines: NY Times vs. Seattle Library
Ramon Rovirosa

This is a visualization that ranks different food cuisines by popularity at two different sources. The popularity of Cookbooks in the Seattle Public Library Ranked from 1 most popular to 6 least popular and similarly articles in the New York Times are also ranked, and compared in the visualization.

New York Times is the top square for each category, and the Seattle Public Library the bottom Square.

SPL Query:

SELECT year(cout) as Days, month(cout) as Months, cout as Dates, itemtype,
SUM(case When subj like '%chinese%' and subj like '%food%' Then 1 Else 0 End) as chinese,
SUM(case When subj like '%vegetarian%' and subj like '%food%' Then 1 Else 0 End) as vegetarian,
SUM(case When subj like '%spanish%' and subj like '%food%' Then 1 Else 0 End) as spanish,
SUM(case When subj like '%Italian%' and subj like '%food%' Then 1 Else 0 End) as italian,
SUM(case When subj like '%french%' and subj like '%food%' Then 1 Else 0 End) as french,
SUM(case When subj like '%dessert%' and subj like '%food%' Then 1 Else 0 End) as dessert

FROM inraw where year(cout)= ('2007') or year(cout)= ('2008') or year(cout)= ('2009') or year(cout)= ('2010') or year(cout)= ('2011') or year(cout)=('2012')
GROUP BY month(cout), year(cout);

NY TImes Query:

String[] words = {"Dessert+Food", "Spanish+Food", "Chinese+Food", "French+Food", "Vegetarian+Food", "Italian+Food"};
color[] colors = {#FF0000, #00FF00, #0000FF, #FF3300, #FF9900, #000000};

int barSize = 25;
int startY = 80;

String start = "20070101";
String end = "20111231";

String[] cuisines = new String[6];

for (int i = 0; i < words.length; i++) {
for(Integer _year = 2007; _year <=2011; _year++){
for(Integer _month = 1; _month<=12; _month++){
String strt = _month <= 9 ? Integer.toString(_year)+"0"+Integer.toString(_month)+"01" : Integer.toString(_year)+Integer.toString(_month)+"01";
String _end = _month <= 9 ? Integer.toString(_year)+"0"+Integer.toString(_month)+"31" : Integer.toString(_year)+Integer.toString(_month)+"31";
println(strt + " "+ _end);
//int freq = 0;
int freq = getArticleKeywordCount( words, strt,_end);
nyFoodArticles[_year-2007][_month] = freq;
println( nyFoodArticles[_year-2007][_month] );
cuisines += (_year == 2011 && _month == 12) ? nyFoodArticles[_year-2007][_month] : nyFoodArticles[_year-2007][_month] + "," ;
//out.println( nyFoodArticles[_year-2007][_month] );
//rect(0, startY + (barSize * i), freq/5, barSize);
saveStrings("nyFoodCount.txt", cuisines);



Overall, this project was an improvement on my previous visualization where I was able to more adequately show a change in food ranking by using color/brightness instead of physical position ranking more efficiently.
(7.32 KiB) Downloaded 619 times
Last edited by ramon.rovirosa on Wed Mar 20, 2013 12:04 am, edited 7 times in total.

Posts: 6
Joined: Fri Jan 11, 2013 10:32 am

Re: Proj 4 - Correlate with External Data

Post by anastasiya » Wed Feb 27, 2013 5:44 pm

Here is my correlation with the NY Times articles API. I did a search for articles containing the word 'flowers' from 2006 to 2012 and correlated the results with SPL book checkouts that contain the word 'flowers'.

Libraries used:


Query Used:
select year(cout), month(cout), count(*) from inraw where title like '%flowers%' and itemtype like "%bk%" and cout > '2006-01-01' and cout < '2012-01-01' group by year(cout), month(cout) order by year(cout), month(cout);

this fetches the checkout month and year of all the books containing the word 'flowers' from 2006 to 2012.

Processing file is attached.

There is some interactivity in the file, when the user mouses over to the right, the SPL data fades out and when they mouse over to the left the NYT data fades out.

The checkout frequency was represented visually by flowers. The red flowers represent NYT data and the blue flowers the SPL data.

Overall the SPL dataset shows greater mention of the word 'flowers' in book titles. We see that in both datasets the mention of 'flowers' increases in the summer months.

[The extension tiff has been deactivated and can no longer be displayed.]

Lazareva_Anastasiya_HW4 .zip
(84.82 KiB) Downloaded 562 times
Last edited by anastasiya on Thu Mar 07, 2013 12:17 pm, edited 1 time in total.

Posts: 5
Joined: Fri Jan 11, 2013 10:32 am

Re: Proj 4 - Correlate with External Data

Post by bkang » Wed Feb 27, 2013 6:18 pm

The code of my project with two screen shots are attached below.

A. Project Introduction:
This project has been extended from the previous dataset(transaction records of the books having the top 3 social medias(Facebook, Twitter and LinkedIn) on their title) in order to visualize the correlation-based implication between NYTimes and SPL data sets.

B. Instruction to How to Use.
The distribution of transaction records are visualized in histogram on the lower side of the horizontal axis. At the same time, number of articles from New York Times having the same three keywords are visualized on the upper side of the axis. You can hover your mouse cursor on the axis to see the spotlighted portion of the entire plot. At the area below of the graph, a disc diagram shows number of records of each category in different color, size and arc length. Each element represents as follows:

- Upper hemisphere : New York Times dataset.
- Lower hemisphere : Seattle Public Library dataset.
- Size of the disc : Total checkouts (NYT + SPL)
- Three colors : Facebook, Twitter and LinkedIn groups.
- Arc length : ratio of the group in terms of the number of records in each dataset.

In this visualization, data sets are locally stored in csv file format due to security and accessibility reasons.
Spotlight animation is also added as an additional interactivity with mouse hovering. This function helps users focus on a specific time slot since this visualization is focused on comparison task.

C. The query used to fetch SPL dataset:

Code: Select all

SELECT DATE_FORMAT(o,'%Y,%m') as coutmonth, count(*) 
FROM activity, `title` 
WHERE activity.bib = title.bib 
     AND LOWER(title) like '%__KEYWORD__%' 
     AND year(o) > 2007 
     AND year(o) < 2013 
GROUP BY coutmonth
D. Implemented Visualization
- Attached below.

E. Libraries Used.
- NYTimes - The Article Search API

- Processing Libraries
(json, bezierSql libraries)

F. Result and Analysis
As can be seen in the attached screen shots, Facebook keeps decreasing in both SPL checkouts and NYTimes articles in terms of the number of items. However, the statistics of Twitter in NYTimes article shows continuous increase and this, perhaps, can be considered as a unique impact of Twitter on specific fields such as politics or business. A number of blogs, news columns and scientific papers have claimed that Twitter is being utilized as an important medium for propagating/disseminating political opinions or campaigns and commercial advertisements in addition to general communications between individual users. This unique phenomena of Twitter distinguishes itself from the other social media.

G. Controls
[1-3] highlight individual data set by color
[0] turn off highlighting
final code
(57.91 KiB) Downloaded 570 times
Screen Shot 2013-03-26 at 3.04.31 PM.png
Last edited by bkang on Tue Mar 26, 2013 4:22 pm, edited 5 times in total.

Posts: 6
Joined: Tue Jan 15, 2013 1:21 pm

Re: Proj 4 - Correlate with External Data

Post by saeedm » Thu Feb 28, 2013 12:21 pm


The question I aimed to answer with this project was: How well rated are the most checked out movies at the SPL? To answer this question, I needed to find an external database that had reviews of all major movies. is such a website, and even has poster images and trailers of most movies, all in an easy to use API. It turned out to be a great tool, and even guided the design of my project towards prominently displaying the movies posters.


item.bib as 'bib',
title.title as 'Title',
kind.kind as 'Kind',
count(activity.o) as 'Checkouts'
activity, title, item, kind
activity.bib = title.bib and
activity.bib = item.bib and
item.item = kind.item and
kind.kind in ('acdvd', 'jcdvd') and
year(activity.o) = 2012
count(activity.o) desc

Note: this query is manually run for each year between 2006 and 2012, and the results concatenated. There may be a way of combining this into one query.

Query Explanation

Since I wanted to grab the most checked out movies from the SPL, I limit the 'kind' to DVDs, and count the number of checkouts per bib number. The result is a sorted list of the most popular movies for the provided year. Of the available movie item data on the SPL database, the only field useful for correlating with an external database is the title string. This eventually causes some issues when querying RottenTomatoes, since some movies share the same title, and sometimes the title string from the SPL is slightly incorrect (e.g. missing 'The' in 'The Black Dahlia').

Correlating with External Data

Since all I had to work with was the title string, I was limited to performing a simple search query on Rotten Tomatoes to return the detailed movie information pertaining to a movie title. The search query isn't perfect (it can't be when supplied with just a title), so sometimes multiple movie objects were returned for one movie title. Also, the results seemed to be ordered by relevance, with more recent/popular movies on top. This can be a problem, for example when the original movie in a series is queried (e.g. "Iron Man"), but the sequel is more popular so appears first in the results list ("Iron Man 3" from 2013 appears first). This was solved by looking through the results from a RottenTomatoes search query and finding the movie with the most similar title based on a string comparison. The algorithm I used for checking how similar two strings are is called the Levenshtein Distance Algorithm. It returns 0 if two strings are identical, 1 if there is one character different or missing, 2 if two characters different, etc. By picking the movie with the lowest Levenshtein distance I accurately chose the matching movie information for each title, including poster information and rating.

The RottenTomatoes API does not support batch queries, so rather than querying each of the 35+ movie titles for movie information each time the code is run, I query the JSON results to a file, and upon startup attempt to load the movie information from that file to avoid querying.

Results and Analysis

Since I had the movie posters to work with, I decided to make them the prominent feature in my project. I organized the movie posters in a grid; by year from left to right (2006 on left, 2012 on right) and checkout popularity from top to bottom (most popular on top). I used a similar approach used by the TA Yoon for displaying overlaid rating information with varying transparency when the user moves the mouse left and right. It's easy to spot movies that have ranked in the top 5 for repeated years. It's also interesting to see that while the vast majority of these movies were rated 'Certified Fresh' on RottenTomatoes (rated very highly), some received a 'Rotten' rating (rated very poorly). Very few were in between.
Screenshot 1
Screenshot 2
Screenshot 3
(220.68 KiB) Downloaded 546 times
Last edited by saeedm on Tue Mar 26, 2013 6:28 pm, edited 2 times in total.

Posts: 24
Joined: Fri Oct 12, 2012 9:27 am

Re: Proj 4 - Correlate with External Data

Post by giovanni » Wed Mar 06, 2013 7:06 pm

project 4

In this project I'm comparing the data from SPL and NYT using the keyword: Europe
upgraded version

Project description ... 5bvf0/edit#


[The extension rar has been deactivated and can no longer be displayed.]

Last edited by giovanni on Wed Mar 27, 2013 11:32 am, edited 1 time in total.

Posts: 5
Joined: Tue Jan 22, 2013 4:49 pm

Re: Proj 4 - Correlate with External Data

Post by scottbcesar » Mon Mar 25, 2013 10:19 pm

(Sorry, the project doodle was lost)

For this project i wanted to experiment with the visualization created in my last project, extending it to visualize two competing sets of data; to that end, I scraped a data set from google's search trends similar to the set I got from Seattle public library, and made the same calculations; then to display the data in an interesting and potentially useful way, I averaged the two data sets together; using mouse position on the screen to control the rate at which they were combined:

code folder: ... sp=sharing

images: ... sp=sharing

Posts: 5
Joined: Fri Jan 11, 2013 10:38 am

SPL Top 10 Albums in 2012

Post by andrewwaranis » Fri Mar 29, 2013 12:20 am

SPL Top 10 Albums in 2012

My initial question was "What were the 'Top N' album checkouts per month at the Seattle Public Library over the course of the dataset?" It eventually became "What were the top 10 album per checkouts for each month in 2012?" The results are displayed as proportionally sized album art for each month that reveals album info when clicked (as in the Project 4 image). The latter portion was enabled by correlating with's product listings. The initial plan to correlate against another monthly rating (as in the Project 3 image) have been postponed, but the visual concept can be seen by pressing the 'b' button.

This query gives the top albums checked out per month in 2012.

SELECT count(*) AS count, title, resp AS artist
FROM title, activity, extras, kind, `subject`
title.bib = extras.bib AND
title.bib = activity.bib AND
YEAR(o) = '2012' AND
MONTH(o) = 'mm' AND
activity.item = kind.item AND
substr(kind,3) = "cd" AND
title.bib = `subject`.bib AND
`subject` LIKE '%music%'
GROUP BY title.bib

There's a way to script this, but I didn't have time to. Instead I manually incremented 'mm' in the query from 1 to 12 to cover each month. Using a feature in Sequel Pro, I manually copied the results to JSON and saved them with the name 'mm-2012.json' for future processing.

Next I ran 'gather_data.rb', which contains a set of functions I wrote to combine the JSON files and augment them with data from Amazon. In addition, it downloads the required images.

Project 3.png
Project 4.png

Left/Right Arrows - Move between months.
Spacebar - See overview of months. Returns to month view.
'B' - See the "stubbed" "2D visualization".

Click album art to see their info and a QR code to their Amazon page.

The PDE (+ required Java libraries) and my data processing script is attached as a zip file.
PDE + data processing script
(2.39 MiB) Downloaded 479 times

Post Reply