Proj 5: Data Correlation / Final Project

glegrady
Posts: 203
Joined: Wed Sep 22, 2010 12:26 pm

Proj 5: Data Correlation / Final Project

Post by glegrady » Wed Jan 21, 2015 11:58 pm

The last assignment in the course. Students can choose multiple options:

1) Do a data correlation between SPL data and another using JSON API
2) Go deep into JSON API without correlation
3) Correlation between 2 APIs
4) SPL data analyzed and visualized in an innovative way
5) Innovative exploration of data query and visualization without another dataset
6) There has to be knowledge discovery through the visualization

Evaluation Criteria

Any combinations of the following:
1) Innovative query or topic
2) Innovative exploration of visualization
2) Innovative algorithm implementation with one or more of 1) and 2)
3) Stay away from bar graphs

Schedule

2.24 Data Correlation introduced
2.26 Basic Demo
3.3 Discussion
3.6 Work-in-progress
3.10 Dead week: Individual meetings after course evaluation
3.12 Individual meetings
3.17 Final project presentations
3.21 Completion of assignments and html descriptions with images and code for vislab.mat.ucsb.edu documentation
George Legrady
legrady@mat.ucsb.edu

a.lazareva
Posts: 5
Joined: Sat Jan 10, 2015 11:29 am

Re: Proj 5: Data Correlation

Post by a.lazareva » Sat Feb 28, 2015 1:55 pm

Data Visualization on GitHub

I decided to use the GitHub API for this project to collect data on repositories that are related to data visualization. I'm using the following API https://developer.github.com/v3/.

The code I wrote calls the API to collect the following data for a list of keywords related to data visualization:
repository name
repository size
number of watchers
owner
owner's profile picture
language
Then I collect statistics for the repository using the repository and owner name. The following stats are collected:
code frequency: Returns a weekly aggregate of the number of additions and deletions pushed to a repository.


commit activity: Returns the last year of commit activity grouped by week.
punch card: Each array contains the day number, hour number, and number of commit

All this gets saved as a JSON file for later use in the visualization.

I think this data could be used for an interesting data visualization. I'm personally interested on who the top contributors are, what languages they use and how much activity is happening in their repositories.
The attachment GitHubAPI.zip is no longer available
My final Data visualization is an interactive dashboard that uses the GitHub data I got from the GitHub API along with geo coordinate data from http://www.datasciencetoolkit.org/ API.

The visualization has the following features:
-legend with keywords let's users select the topics they are interested in
-repository scatter plot that shows size/watcher count information (can scroll through the entire dataset using arrows), if a repository is clicked, repository details appear which include the user picture, some info about the repository and a time card visualization showing when the user is most active.
-a rotating globe showing user locations, the bars coming out of the globe represent the number of repositories associated with the keyword.
-language window: shows how many repositories use a language for the selected set of topics
-average user activity: shows a stream graph of average commit activities over the last year.
screenshot.jpg
screenshot
https://dl.dropboxusercontent.com/u/43769469/GHVis.zip
Attachments
GitHubAPI.zip
code
(116 KiB) Downloaded 975 times
Last edited by a.lazareva on Mon Mar 16, 2015 12:28 pm, edited 5 times in total.

fabian.offert
Posts: 5
Joined: Sat Jan 10, 2015 11:32 am

Re: Proj 5: Data Correlation

Post by fabian.offert » Sun Mar 01, 2015 1:42 pm

"How 11 Year Olds Get Girls" or Advice in Times of the Internet

For my project, I will not work with the New York Times API but create a visualization of the correlation of the Seattle Public Library data with a quite interesting historical data source I came across.

Concept

In 1998, in an effort to legislate Internet pornography, the so called Child Online Protection Act (COPA) was passed (http://en.wikipedia.org/wiki/Child_Onli ... ection_Act). In 2006, the Department of Justice, trying to enforce the newly created law, demanded access to the search logs of all major Internet search engines, including Google (Gonzales vs. Google, 2006: http://news.findlaw.com/nytimes/docs/go ... 1806m.html). Google took legal action which eventually (2009) lead to the act being struck down as unconstitutional, after a long journey through the courts. Several other search engines complied and delivered the requested "multi-stage random sample of one million URL’s" from their database, and a computer file with "the text of each search string entered onto the search engine over a one-week period (absent any information identifying the person who entered such query)."

One search engine though, AOL, went a step further and decided to not only make its data available, but to make it publicly available. So, according to Wikipedia, "on August 4, 2006, AOL Research, headed by Dr. Abdur Chowdhury, released a compressed text file on one of its websites containing twenty million search keywords for over 650,000 users over a 3-month period, intended for research purposes. AOL deleted the search data on their site by the 7th, but not before it had been mirrored and distributed on the Internet" (http://en.wikipedia.org/wiki/AOL_search_data_leak). This is the text of the readme file that came with the data from AOL:
500k User Session Collection
----------------------------------------------
This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY.
Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

Brief description:

This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.

The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.

The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
AnonID - an anonymous user ID number.
Query - the query issued by the user, case shifted with
most punctuation removed.
QueryTime - the time at which the query was submitted for search.
ItemRank - if the user clicked on a search result, the rank of the
item on which they clicked is listed.
ClickURL - if the user clicked on a search result, the domain portion of
the URL in the clicked result is listed.

Each line in the data represents one of two types of events:
1. A query that was NOT followed by the user clicking on a result item.
2. A click through on an item in the result list returned from a query.
In the first case (query only) there is data in only the first three columns/fields -- namely AnonID, Query, and QueryTime (see above).
In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.

CAVEAT EMPTOR -- SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.

Basic Collection Statistics
Dates:
01 March, 2006 - 31 May, 2006

Normalized queries:
36,389,567 lines of data
21,011,340 instances of new queries (w/ or w/o click-through)
7,887,022 requests for "next page" of results
19,442,629 user click-through events
16,946,938 queries w/o user click-through
10,154,742 unique (normalized) queries
657,426 unique user ID's


Please reference the following publication when using this collection:

G. Pass, A. Chowdhury, C. Torgeson, "A Picture of Search" The First
International Conference on Scalable Information Systems, Hong Kong, June,
2006.

Copyright (2006) AOL
The ensuing public outrage that AOL, somehow, had not anticipated, mainly focused on the badly implemented anonymization of the data: While users were assigned random IDs, their search history still revealed so much information that they became recognizable again. Coincidentally, it was the New York Times that first publicly revealed - with her consent - the identity of a user, no. 4417749, on August 9, 2006 (http://www.nytimes.com/2006/08/09/techn ... d=all&_r=0). At AOL, The people responsible for the leak were let go and a written apology was published (http://news.cnet.com/AOL-apologizes-for ... 02793.html).

For the rest of the year 2006, lots of people worked with the AOL data, identifying, amongst others, a potential murderer (https://plentyoffish.wordpress.com/2006 ... mit-murder). Here is a little gem I found, a love poem by user 2708:
'2708', 'how to drive someone crazy', '2006-03-03 17:09:27'
'2708', 'how to say goodbye hurtfully', '2006-03-05 20:45:09'
'2708', 'how to send junk mail to someone else', '2006-03-18 16:22:04'
'2708', 'how to permantlly delete information from you', '2006-03-19 14:35:57'
'2708', 'how to send email anonymously', '2006-03-21 19:43:35'
'2708', 'how to humiliate someone', '2006-03-03 07:47:13'
'2708', 'how to get revenge on an old lover', '2006-03-03 17:15:12'
'2708', 'how to report child neglect in the state of n', '2006-03-11 17:01:51'
'2708', 'how to send things anounmously', '2006-03-19 12:47:27'
'2708', 'how to permanently delete information from yo', '2006-03-19 14:36:13'
'2708', 'how to stop loving someone', '2006-03-21 23:16:51'
'2708', 'how humiliate someone', '2006-03-03 16:56:05'
'2708', 'how to really make someone hurt for the pain ', '2006-03-03 17:22:50'
'2708', 'how to get back on an ex lover', '2006-03-18 15:53:45'
'2708', 'how to send things anonymously', '2006-03-19 12:47:33'
'2708', 'how to send alot of junk mail to someonne wit', '2006-03-21 19:41:29'
'2708', 'how to make someone misreable', '2006-03-03 17:05:12'
'2708', 'how to make an old lover suffer', '2006-03-05 15:02:13'
'2708', 'how to move on from a broken heart', '2006-03-18 16:01:28'
'2708', 'how to ruin someone's credit', '2006-03-19 13:19:47'
A class action law suit, filed in September 2006, however, put an end to this exploration. Where possible, data sources were removed, the few sources left were killed by time (both on web servers and in file sharing networks).

With the help of the Wayback Machine (https://archive.org/web/web.php), however, I was able to retrieve a copy of the data, put it into a MySQL database and started to explore it. I was immediately captivated by the strange poetry of the search queries themselves and the users portrayed by them. This is why, for my data correlation project, I will correlate the AOL search log data with the Seattle Public Library data from the same time period, March 1 - May 31, 2006. As a working concept, I will look at "advice" given by the Internet and the books in the library by simply querying both data sources for the word "how" - hence the working title of my project.

Visualization

For the visualization, I am currently working in the style of the last assignment (the format is DIN A4). Right now, all I have is a very rough draft that is missing a lot of features but at least has all the data. I spent way too much time implementing "3D picking" to reduce the amount of information on the screen. Anyway, here is a first look at the sketch (including a wonderful poem by user 670808 and the corresponding book titles from the Seattle Public library):

Image
Last edited by fabian.offert on Thu Mar 05, 2015 3:04 am, edited 1 time in total.

james_schaffer
Posts: 5
Joined: Sat Jan 10, 2015 11:34 am

What does reddit think of "Fifty Shades of Grey"?

Post by james_schaffer » Mon Mar 02, 2015 5:42 pm

final_rev_ss1.png
UPDATE!

For the update, the main change was the introduction of the most frequently used keywords as a starry background. Stopwords were removed and keywords are drawn from a distribution based on the contents of the posts.

I also added features to sort all posts/subreddits by sentiment for a more informative visualization.
final_rev_ss2.png
Finally, the visuals and motions of the reddestial bodies were exhaustively tweaked to create smooth and hypnotic visuals. I was wondering why processing ran so slowly in 2D, and it turns out that OpenGL is not enabled by default; doing so results in more than a 5x speedup. Unfortunately, OpenGL is not so great at rendering 2D text, so the text coordinates had to be rounded with BASELINE centering. I also included a toggle for additive image blending (which was not feasible due to the hardware acceleration), but I'm not 100% sure I like the way the visualization looks with it enabled. Maybe I can get some input?
final_rev_ss3.png
By the time you read this, the new version should be available at the link that was originally provided.
(END UPDATE)
reddit1.png
In keeping with the books theme of this class, I decided to see what reddit was saying about Fifty Shades of Grey. Due to the recent movie release, there have been a handful of highly entertaining news stories and incidents, so I thought it might be interesting to visualize the sentiment of posts from different subreddits.

For those that don't know, reddit is a large collection of forums, each with their own topic, that constitute an anonymous social network. Users that visit the site are greeted on the front page with a list of 'currently hot' posts/topics. Reddit's prime feature is the ability to upvote or downvote other user's posts (this is like adding a 'dislike' button on Facebook), so content, credibility, and moderation are effectively crowdsourced. Unfortunately, this causes a lot of cat pics to rise to the front page.

Some references for those that are interested:
Reddit API: http://www.reddit.com/dev/api#GET_controversial
Some info on Reddit rankings: http://amix.dk/blog/post/19588

First, I accessed Reddit's API and collected every post that mentioned the book "Fifty Shades of Grey." The following Java code accomplished this.

Code: Select all

package redditjson;

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.net.URL;
import java.nio.charset.Charset;

import org.json.JSONException;
import org.json.JSONObject;

public class RedditPager {

	public static void main( String args[] ) {
		int iteration = 0;
		String after = "";
		
		try {
			PrintWriter writer = new PrintWriter( "reddit.json" );
			String redditSearch = "harry+potter";
			
			while ( iteration < 100 ) {
				JSONObject nextRedditPage = null;
				try {
					String query = "http://www.reddit.com/search.json?q=" + redditSearch + "&limit=100&sort=relevance&t=all" + after;
					System.out.println( "Trying: " + query );
					nextRedditPage = readJsonFromUrl( query );
					Thread.sleep(5000);
					
					String afterString = nextRedditPage.getJSONObject( "data" ).getString( "after" );
					after = "&after=" + afterString;
					writer.println( nextRedditPage.toString() );
					writer.flush();
					iteration += 1;

					
				} catch (InterruptedException e) {
//					e.printStackTrace();
				} catch (IOException e) {
//					e.printStackTrace();
				} catch (JSONException e) {
					if ( nextRedditPage != null )
						writer.println( nextRedditPage.toString() );
					writer.flush();
					System.out.println( "Done." );
//					e.printStackTrace();
					break;
					
				}
			}
			
		} catch (FileNotFoundException e1) {
			// TODO Auto-generated catch block
			e1.printStackTrace();
		}
	}
	
	public static JSONObject readJsonFromUrl(String url) throws IOException, JSONException {
	InputStream is = new URL(url).openStream();
	try {
	  BufferedReader rd = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
	      String jsonText = readAll(rd);
	      JSONObject json = new JSONObject(jsonText);
	      return json;
	    } finally {
	      is.close();
	    }
	}

	private static String readAll(Reader rd) throws IOException {
	    StringBuilder sb = new StringBuilder();
	    int cp;
	    while ((cp = rd.read()) != -1) {
	      sb.append((char) cp);
	    }
	    return sb.toString();
	  }
	
}
Reddit's results have to be 'paged through', which makes the JSON access slightly more than nontrivial (reddit will only return 100 at a time, and requires you to pass a parameter from the previous page to get the next page).

Next, I wanted to use a machine learner to classify the sentiment of each post. Unfortunately, offline learners cost some money, so I had to use an API with a rate limit, found at http://text-processing.com/docs/sentiment.html. The analyzer returns a classification for each body of text, along with three parameters that correspond to the 'quantity' of negative, positive, and neutral text. The following java code automated the sentiment analysis process:

Code: Select all

package redditjson;

import java.io.BufferedReader;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.Charset;

import org.json.*;

public class RedditJSONParser {
	
	public static void main( String args[] ) {

		try {
			BufferedReader f = new BufferedReader( new FileReader( "fifty.json" ) );
			String line = null;
			while ( (line = f.readLine()) != null ) {
				JSONObject listing = new JSONObject( line );
				JSONObject listingData = listing.getJSONObject( "data" );
				JSONArray topics = listingData.getJSONArray( "children" );
				for ( int j = 0; j < topics.length(); j++ ) {
					JSONObject nextTopic = topics.getJSONObject( j );
					JSONObject nextTopicData = nextTopic.getJSONObject( "data" );
					String nextID = nextTopicData.getString( "id" );
					int nextScore = nextTopicData.getInt( "score" );
					int nextUps = nextTopicData.getInt( "ups" );
					int nextDowns = nextTopicData.getInt( "downs" );
					String nextSubreddit = nextTopicData.getString( "subreddit" );
					String nextTitle = nextTopicData.getString( "title" );
					String nextSelfText = nextTopicData.getString( "selftext" );
					int numComments = nextTopicData.getInt( "num_comments" );
					
					if ( !nextSelfText.equals( "" ) ) {
						
						try {
							JSONObject sentiment = new JSONObject( getSentimentString( nextSelfText ) );
							String sentimentLabel = sentiment.getString( "label" );
							JSONObject p = sentiment.getJSONObject( "probability" );
							double neg = p.getDouble( "neg" );
							double neutral = p.getDouble( "neutral" );
							double pos = p.getDouble( "pos" );
							System.out.println( 
									nextID + "|||" + 
									nextSubreddit + "|||" + 
									nextSelfText.replaceAll( "\n", "   " ) + "|||" +
									nextTitle + "|||" + 
									nextScore + "|||" + 
									nextUps + "|||" + 
									nextDowns + "|||" + 
									numComments + "|||" + sentimentLabel + "|||" + neg + "|||" +	neutral + "|||" + pos);
						}
						catch ( IOException e ) {
							continue;
						}
						
					
						Thread.sleep( 3000 );
						
					}
					
					
				}
				
			}
			f.close();
			
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	
	}
	
	public static String getSentimentString(String data) throws IOException, JSONException {
		String urlParameters  = "text=" + data;
		byte[] postData       = urlParameters.getBytes( Charset.forName( "UTF-8" ));
		int    postDataLength = postData.length;
		String request	= "http://text-processing.com/api/sentiment/";
		URL    url = new URL( request );
		HttpURLConnection cox = (HttpURLConnection) url.openConnection();           
		cox.setDoOutput( true );
		cox.setDoInput ( true );
		cox.setInstanceFollowRedirects( false );
		cox.setRequestMethod( "POST" );
		cox.setRequestProperty( "Content-Type", "application/x-www-form-urlencoded"); 
		cox.setRequestProperty( "charset", "utf-8");
		cox.setRequestProperty( "Content-Length", Integer.toString( postDataLength ));
		cox.setUseCaches( false );
		try( DataOutputStream wr = new DataOutputStream( cox.getOutputStream())) {
		   wr.write( postData );
		}
 
		BufferedReader in = new BufferedReader(new InputStreamReader(cox.getInputStream()));
		String inputLine;
		StringBuffer response = new StringBuffer();
 
		while ((inputLine = in.readLine()) != null) {
			response.append(inputLine);
		}
		in.close();
		
		//print result
		return response.toString();
	}
	
}
Next, I created a 'reddit galaxy' visualization, where each solar system is a subreddit and each planet is a post. To make the visualization useful, interaction was required, especially drilldown. Each subreddit and post are color coded to indicate sentiment, with positive sentiment appearing green, negative sentiment appearing red, and mixed sentiment appearing yellow. Clicking a subreddit will zoom in to show posts, and clicking a post will show the original text. Press any key to zoom out one level.
reddit2.png
reddit3.png
Everything is included in the following link http://128.111.28.122/controversy_v5.zip
Last edited by james_schaffer on Mon Mar 16, 2015 11:44 pm, edited 3 times in total.

menzer
Posts: 5
Joined: Sat Jan 10, 2015 11:30 am

Visualization of Data Correlation, Yelp & Foursquare example

Post by menzer » Mon Mar 02, 2015 11:55 pm

How to best visualize Data Correlation in human readable form?
A case study on yelp and foursquare user activity data.


Idea:
- explore visual techniques of how to show correlation
- in a statistics context, scatterplots are used most frequently
- in a MAT259 context, vertically or horizontally aligned barcharts seem to be the method of choice
- which other methods can be used? for example comparing distributions (with violin plots), scatterplot matrices, etc
- Can they be used to show the same types of correlation? How do they compare in general?
- What about circular regression?
- Are there any types of correlations that can not be well represented with current means of visualization? For example spurious correlations? And is the scatterplot always useful, let alone intuitive?

Data set:
Foursquare local search and discovery service mobile app, similar to yelp. Main difference is that check-ins are constrained by the current GPS location of the mobile device and it is more of a social network.

1.) interesting question: how does Foursquare checkin activity at the Seattle Public Library location compare to library activity as represented in the SPL data set? --> after connecting to Yelp API and pulling JSON data from it, turns out that there were only 730 check ins by about 220 users in history. Unfortunately, the temporal distribution of these checkins is only visible to the venue manager. Resolution: Move to next interesting question.
2.) interesting question: how does Foursquare activity in a certain area compare to Yelp activity? The query will be specialized on microbrew pubs in the San Francisco area. The selection of microbrew pubs was based on the returns when searching for "brew" and "brewery" on the foursquare API, and a subsequent manual filtering of coffee shops and other stores that were associated with the search word "brew". Subsequently, the same 18 brewpubs were queried from the yelp API
--------------------------------------------------------------------
3.) Since the initial query on brewpubs was quite limited and only yielded a small data set that had to be matched manually, I automated the matching process and iterated over 11-16 different locations and 11 different venue categories in San Francisco. Matching between the two social networks was done by writing a parser in Matlab. In total, this yielded a data set with 1028 entries of combined yelp and foursquare data. One can now compare user activity across different categories and locations using the violin plots (Fig. 3 and 4) which convey information much more clearly than a three dimensional scatterplot for this data set (Fig. 2). The color coded violin plots reveal differences between categories, and at the same time it is easy to draw comparisons between the two different data sets from Foursquare API and yelp API.

Foursquare: number of checkins, users and tips for locations around a geographic latitude/longitude coordinate pair
Yelp: number of user reviews ("review_count") and Star rating as proxy for user activity because Yelp is rather uncommonly used for check ins but more for reviewing places.
screenie_yelp_foursquare_correlation.png
Fig.1:Initial Plot on Microbrew comparison (small data set of 18 breweries): Using a simple, yet utterly complicated statistical tool called scatterplot matrix of three user activity variables from the foursquare local search app and the yelp app. The user activity variables were expected to be correlated within each of the two services, for example between foursquare checkin counts and the number of tips. The more interesting emerging pattern is the high correlation between activity on foursquare and activity on yelp. Furthermore, the Star rating is negatively correlated with the count variables, meaning that a location is likely to have a lower star rating the more reviews or checkins it receives.
Fsq_Yelp_Location.png
Fig.2: Scatterplot of the entire data set labeled by location. Problems of scale and practical color highlighting become apparent. Also, comparing point clouds across dimensions in one and the same panel seems to be cumbersome.
fsq_yelp_violins_by_loc_2.png
Fig.3: Violin plots for 11 different locations. Interesting observation: Chinatown in San Francisco and Berkeley Downtown seem to have the highest user activity and the correspondence between data from the two APIs is very high.
fsq_yelp_violins_by_cat_2.png
Fig.4: Violin plots for 11 different categories. Interesting observation: Activity in social places such as restaurants and pubs is much higher than in places where people run errands. The highest abundance of restaurants in the Bay area is "Italian" and the user activity follows that trend, too.
Tools: I used Matlab to query the Foursquare API, python to query the Yelp API, R for the scatterplot matrix (package ggplot2) and Matlab for the Violin Plots (package distributionPlot).
Attachments
OM_hw5YelpFoursquare.zip
Code: Matlab, python and R code that was written for this assignment.
(8.66 KiB) Downloaded 946 times
Final_joint_Fsq_Yelp_cat11_loc11.csv
Final Data Table containing 1028 entries on foursquare AND yelp user activity for venues in San Francisco spanning 11 different categories and 16 locations.
(104.53 KiB) Downloaded 964 times
Brewpubs_Yelp_Foursquare_comparison_w_header.csv
Limited/Obsolete data table containing user activity data from both yelp and foursquare, applied to the local search domain of 18 local brewpubs in the San Francisco area. (used for initial analysis on Fig.1)
(995 Bytes) Downloaded 948 times
Last edited by menzer on Tue Mar 17, 2015 9:40 pm, edited 6 times in total.

intae
Posts: 19
Joined: Tue Oct 14, 2014 10:56 am

Re: Proj 5: Data Correlation

Post by intae » Tue Mar 03, 2015 1:34 am

Forgotten Wars

In the history of human, conflicts between two or many countries have continued without any peace time. I set the word three major wars in the last century, "Pacific War", "Korean War", "Vietnam War", and got results by each years from 1930.

Interestingly, each graph of the "hit" of each wars showed me different patterns, this represented how did the wars were anticipated before the war was broken, and affected quite long time after the break.

I link this idea to the Seattle library data base, did brief survey that book does have title of the wars. Then break each titles into single words, and find out the most frequent used in the title.

I'd liked to show the title and changes of interests by the time. Such as, based on the time the book published, arranged words by years, and find out how did the war described by the time line.
Screen Shot 2015-03-03 at 1.32.49 AM.png
//Update
Screen Shot 2015-03-08 at 11.07.32 PM.png
Visualizing three wars (Pacific War, Korean War, Vietnam War) from the NYT API was interested, but I faced communication problem to the NYT API in the beginning of last class, it seems that it happened that I used the prohibited keywords such as "Iraqi War" or "Gulf War". After the happening, I decided to narrow down my topic only to the Pacific War, and connected it to the APT data onto the map.

I read Pacific War history from different resources such as book, DVD, and website post, three years war showed me the how fast the US Armed forces conquered many islands, some of them were ignored by the tactical reasons.

Compare to the European theater, Pacific war was not known well to the public, but the famous movie series introduced these forgotten stories, then they were able to get attention from people. I used the keyword "Pacific War" from 1942 to 1945 by every month, then used the numbers for the diameter of ellipse, the center of round are located to the major campaigns in each month.

After this, I'm going to use the library check-out data and it is not clear right now. but I'm considering to match the title of the books to the each campaigns.

The basic map interface of the processing, I adopted "Unfolding Map" (http://unfoldingmaps.org/) which was developed for the data visualization for the geographic matters. I used several tutorials form the website.
1024px-US_landings.jpg
"US landings" by General MacArthur's General Staff - MacArthur, Douglas (1994) [1950] Reports of General MacArthur, Vol. 1, Center of Military History, pp. p. 432 Retrieved on 24 February 2009.. Licensed under Public Domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File: ... ndings.jpg


In previous time, paper map used for the war data, I can see how did people strived to show many details in a small map, but as you see in the picture, it is really hard to trace each campaigns. Thus, I convert this data to the processing.

//Second updates
1. Add strength of US troops,Japan troops and casualties
2. Address headquarters of Imperial Japanese Area Army by year (from 1942 to 1945)
Screen Shot 2015-03-17 at 1.24.43 AM.png
Based on Wikipedia, I collected numbers of strength and casualties, then drew graphs. As you can see outside ellipse stands for entire army strength engaged that war, small ellipse inside shows casualties including killed, lost and POW.

http://en.wikipedia.org/wiki/Armies_of_ ... anese_Army
During Pacific War, Imperial Japanese Army has 22 Area Armies, which are commanded by a lieutenant general. This is same as one of six geographical commands in the US Army today. I placed each Area Armies onto their headquarters, this reminds that Imperial Japan Expended the territory during the war.
Interestingly, when it became true that Japan was defeated the war, the Imperial Army organized new area armies to defense US attack to the mainland Japan.

The default and drawback of the project, I couldn't applied exact number of the area armies on the ellipse graph, because, the organization is so complicated.

///Further Research
Unfolding map offers nice map interface, but I faced several problems to draw lines and shapes on the map, because, to draw line on the map, each points should coordinate their latitude and longitude not x,y points in the background.
Since the organizations are so complicated, it is almost impossible to get the components date of the area armies, if I get more detailed data, I could map the strength of the armies on the map. This will clarify the change of the war by years.
Attachments
datahomework1_5_5.zip
(782.89 KiB) Downloaded 934 times
Last edited by intae on Tue Mar 17, 2015 1:07 am, edited 6 times in total.

nataly_moreno
Posts: 5
Joined: Sat Jan 10, 2015 11:31 am

Re: Proj 5: Data Correlation

Post by nataly_moreno » Tue Mar 03, 2015 11:35 am

Bird Migration Patterns Visualization
Bird Visualization Project
Zipped Project: BirdVisualization.zip

I am using bird data given to me by Stephen Pope, a former MAT Faculty member. The data consists of frequency vectors of where birds have been spotted around the United States. Because the data is not publicly available to my knowledge, I am going to refrain from posting the data and I will keep the data private.

I am currently working on a 2D version that shows the migration patterns of birds over time on a map. Due to the bulk of the data, I will choose to show 6 different species, one at a time.

Unfolded and Control P5 were used, links to download the libraries are below:
Unfolded: http://unfoldingmaps.org
ControlP5: http://www.sojamo.de/libraries/controlP5/#installation

The Unfolding library is to create the map. I worked off of the MarkerSelectionApp example that came with the library. Since the data I have is only for the United States and is per state data, I had to constrict the map to the US area. I was able to get the rollover interaction with the shape of the state using data that I obtained from a geojson file that I found at:

http://eric.clst.org/wupl/Stuff/gz_2010 ... _500k.json

However, the center coordinates of the states were sometimes incorrect or missing, and the coordinates were also sometimes incorrect on Google, requiring further tweaking of the coordinates. I hard-coded every center coordinate into a HashMap using a mix and approximation of the coordinates I got from the geojson and Google searches.

Initial Results with Original Data:
IncorrectCenters.png
State centers are incorrect and Alaska is currently being rolled over with the mouse.
Results After Tweaking and Adding a HashMap
CorrectCenters.png
State centers are correct.
Animating the Data Over Time
Zipped Project: BirdVisualization 2.zip

Animating the data required making an entire new data structure to store the data. I used a HashMap and custom class in order to store the data in a way that made the animation straightforward. The data structure keeps track of which bird, what time of year, and how many were spotted at a particular state.

The code displays the name of the bird migration that is currently being shown as well as the time of year. Circles appear and fade out for every week of the year showing where and how many birds were spotted in that state, that year. The frequency in which the birds were seen decide the radius of the circles that appear on the states.

I added ControlP5 in order to allow the user to choose one birds species' data at a time because too many birds at once will overwhelm the screen.

I found a code sample on how to use the Google Images API and modified it to fit my needs. The code can be found here:

http://www.openprocessing.org/sketch/132752

The window now displays a photo of the bird whose migration pattern is being displayed. This will give the user a sense of what he/she is looking at and makes the animation more useful.

Animation + Description + ControlP5 + Google Images API
AnimationAndPhoto.png
The title describes the animation being displayed and a photo of the bird is displayed. Alaska is red because it is currently being rolled over with the mouse, and the red circles on the map represent the frequency vectors of the current bird in those states.
Before the Updates
Images of the project before the next update.
lost1.png
Bigger circles, different color scheme, animated timeline
lost2.png
Bigger circles, different color scheme, animated timeline

Updated Color Scheme and Suggested Modifications
Zipped Project: Bird Visualization 3.zip

I made changes based on feedback. This new iteration uses the Color Picker in addition to all the previous libraries and information from the links above.

The link to the ColorPicker is here:
ColorPicker: http://tristen.ca/hcl-picker/#/hlc/100/1/958034/77AED5

Modifications are as follows:
1. Zoom into the USA excluding Alaska and Hawaii
2. Change the layout due to the map changes
3. Add Perlin Noise to the circles that appear on the states
4. Make the circles bigger and grow as the week progresses so as to take up the entire state
5. Add 2 Color Schemes with 3+ colors, press space bar to change color
6. Add a color bar to go with the new color scheme(s)
7. Remove the animated timeline

Here are some images of the changes:
SuggChanges1.png
SuggChanges2.png
SuggChanges3.png
Bird Migration Data Chart
Bird Data Chart Project
Zipped Project: BirdDataChart.zip

The data chart shows all the data that was in the animated version in one screen. The data is ordered alphabetically by state along the y-axis, by week along the x-axis, and for every week the birds appear in alphabetical order per weekly cell.

The right hand side shows the photo of each bird in alphabetical order; their photo is being obtained using Google Images API, and they are also sorted in the order in which they appear on the chart.

There are no extra libraries used in this project. However Google Images API and the Color Picker were used.
ColorPicker: http://tristen.ca/hcl-picker/#/hlc/100/1/958034/77AED5

Here are three versions of the same data, one has the gray outline border for each cell, one has it in black, and the last one does not have a stroke at all. Black boxes represent weeks in which there was no data for that bird species.
GrayBordered.png
Bird data chart with a gray border.
BlackBordered.png
Bird data chart with a black border.
noStroke.png
Bird data chart with no stroke.
Updated with Suggested Modifications
Zipped Project: BirdBarGraph 2.zip

The project is the same as above, except now, pressing the space bar will sort by bird species along the x-axis, as in, the 48 weeks' data will appear for one bird together, then the next, so on so forth. Screenshots showing the changes are below.
BirdSort_GrayStroke.png
Sort by species, gray stroke.
BirdSort_BlackStroke.png
Sort by species, black stroke.
BirdSort_NoStroke.png
Sort by species, no stroke.
Attachments
BirdBarGraph 2.zip
Bird data chart with two sorts, change by pressing the space bar.
(753.19 KiB) Downloaded 843 times
BirdVisualization 3.zip
Suggested changes have been made. birdData.csv is excluded.
(749.18 KiB) Downloaded 850 times
BirdDataChart.zip
Bird data chart code. birdData.csv is excluded.
(752.18 KiB) Downloaded 860 times
BirdVisualization 2.zip
Updated Version: Code contains everything up until the image where the photo of a bird first appears. birdData.csv is excluded.
(745.01 KiB) Downloaded 854 times
BirdVisualization.zip
Code currently displays the correct state centers. birdData.csv is excluded.
(740.24 KiB) Downloaded 854 times
Last edited by nataly_moreno on Sun Mar 22, 2015 10:05 pm, edited 26 times in total.

nedda.amini
Posts: 5
Joined: Tue Jan 14, 2014 11:55 am

Re: Proj 5: Data Correlation

Post by nedda.amini » Thu Mar 05, 2015 4:15 am

In November of 2010, a WikiLeaks in coordinated with the Guardian, NYT, and other news affiliates to leak a series of cable communications between various US Embassies and the US Headquarters. These logs, now 5 years old, are still available on the internet, though in most cases the messages themselves have been redacted. Using a redacted database, I wanted to look at the numbers of messages sent from various embassies, and to see which embassy has the most traffic. I also wanted to highlight the various tags that were given these data points, and see if there was a correlations between the locations and the messages being sent.
The data visualization I have implemented allows for the top 11 most used tags to be highlighted within the visualization. The width of each line in going towards the center correlates to the number of cables that were set from that embassy. I wanted to make my image convoluted slightly, which is why I didn't create a particular node for the US Headquarters, but instead let all the lines converge together, like a knotted web.
The database is available here:
http://www.theguardian.com/news/datablo ... -data#data

The previous implementation is still attached. I was unable to take screenshots of my current visualization because of the resolution of my laptop.
Attachments
Nedda Amini Vis 5.zip
Processing Code plus csv
(74.81 KiB) Downloaded 842 times
Screen Shot 2015-03-05 at 4.13.23 AM.png
Image of previous vis
Last edited by nedda.amini on Sun Mar 22, 2015 3:51 pm, edited 1 time in total.

jmd
Posts: 5
Joined: Sat Jan 10, 2015 11:26 am

Re: Proj 5: Data Correlation

Post by jmd » Thu Mar 05, 2015 8:51 am

[ ░ ░ ░ ░ ░ ░ no title yet ░ ░ ░ ░ ░ ░ ]

I’ve always seen this period of time at MAT as an opportunity to make a pause -at least in the beginning- and redirect my efforts in a more conscious way. By pause I don’t mean stopping work at all, but the forging of a necessary recognition of one’s own context. This is: knowing more than vaguely the history of the field, its key players, crucial works and then -perhaps, aided by intuition- being able to make a contribution within the field.

This project (using the NYTimes API) will search for mentions of key topics and people since the previous century, and will locate all of them into a timeline for further analysis.

Is both a visualization and a study about the history about our -not that young by now- “new media/electronic art” field - - > A map of our current territory.

Hopefully (if time permits) I would like to explore this data understood (rather than visualized) as sound, and explore then other ways of representing it without the image.
IMG_8445.JPG
Attachments
NYTimesAPI_01.zip
(2.42 KiB) Downloaded 876 times

kurtoon
Posts: 5
Joined: Sat Jan 10, 2015 11:28 am

Re: Proj 5: Data Correlation

Post by kurtoon » Thu Mar 05, 2015 10:47 am

Natural Rejection
(title pending)

-work in progress-

For this project I will attempt to visualize a tree of extinct species. I am primarily interested in exploring interesting and uncommon visualization techniques. One thing I wanted to avoid was data reflecting "frequency" of something as I've used this for all my previous course projects. I had poked around the NYT API and was not satisfied with some of the results I was getting, so I began searching for an API that provided more scientific information. In my searching, I became attracted to Charles Darwin's simple "tree of life" diagram from his notebook:
598px-Darwins_first_tree.jpg
There are several resources online providing phylogenetic information, but I decided to go with "The Tree Of Life Web Project," a communal effort to provide an open access tree of life. What attracted me to this data was the metadata associated with each species, namely a tag identifying the species as extinct. While the source is not queried through an API, the totality of the database is a 200mb XML file with over 100,000 nodes arranged in a hierarchy, so I feel the complexity and potential for discovery is comparable. Documentation for how to interact with the database is provided here:
http://tolweb.org/tree/home.pages/downloadtree.html

So far, I have written a program that burrows into these XML files and generates a branching tree structure reflecting the structure of the data. At this point the visualization is far from finished and the angles of the branches are random.
Screen Shot 2015-03-05 at 10.09.02 AM.png
Aesthetically, my goal is to implement the beautiful and useful "Depth dependent halos" described in this IEEE paper:
http://www.cs.rug.nl/~isenberg/personal ... 09_DDH.pdf
Screen Shot 2015-03-05 at 10.41.28 AM.png

Post Reply