Proj 5: Data Correlation / Final Project

dimberman
Posts: 6
Joined: Sat Jan 10, 2015 11:28 am

The Financials of Fear

Post by dimberman » Tue Mar 10, 2015 8:23 am

For My final project, I have decided to go in a slightly different direction, from what had been originally assigned. I wanted a visual that conveys simplicity, beauty, and doesn't overwhelm the user. While there are certainly merits for 3D visuals for things like clustering, I wanted to create something that would actually be similar to something I would show my boss in a job. For these reasons I've decided to create a 2D visual in the style of Edward Tufte

minard_lg.gif
an example of a Tufte visualization.


For my project, The Financials of Fear, I want to see how terrorist attacks affect the gold and stock markets. Since gold is often considered as a "doomsday commodity" (one that would maintain its value when governments are weak) I want to see if there is a correlation between when people are concerned about their safety and the price of gold.


To do this, I will use the NYT and Bloomberg APIs. I will attempt to search terms like "terror" and "attack", and based on whether those terms show up in multiple newspaper headlines at the same time, I will attempt to guess whether there has been a terrorist attack in the day before that paper. Negatively correlated will be a chart of the price of gold that day, the idea being that days with high number of mentions of terror with high gold prices will be the most noticeable.

boyan
Posts: 5
Joined: Tue Jan 14, 2014 11:48 am

Re: Proj 5: Data Correlation / Final Project

Post by boyan » Sun Mar 15, 2015 3:27 pm

For the final project, I decided to use the Human Development Index data. The United Nations Development Program provides some APIs (http://hdr.undp.org/en/data/api) to query the dataset. There are altogether 187 countries and regions and hdi across 9 years.

Originally, I wanted to map these countries and regions onto a world map using the Unfolding map library, but after exploring the library for a while, I found the library not really customizable, so I decided to use the 3d visualization I did last time. I found the coordinates of the capital city of every country and geo locate them on the globe. I used points to represent the hdi value, the larger the point size, the bigger the hdi value. The points are also categorized into 4 groups based on their hdi values.

The second part of this visualization is the HDI trend. Based on the HDI trend data for each country and region, I calculated the distance matrix using standardized euclidean distance. I was going to visualize the similarity (dissimilarity) matrix originally, but a 187*187 matrix is too large to put on the canvas. So I used multidimensional scaling (MDS) (http://en.wikipedia.org/wiki/Multidimensional_scaling) to tranform the data into lower dimensions and then visualized them. In this visualization, the x and y coordinates do not necessarily mean anything, but countries or regions that are similar in terms of their human development are close to each other in this 2d space.

I also visualized the hdi trend for each country and region.
Attachments
MAT259_PROJ5_BYAN.zip
(292.54 KiB) Downloaded 778 times
Screen Shot 2015-03-15 at 4.25.48 PM.png
Screen Shot 2015-03-15 at 4.24.56 PM.png
Screen Shot 2015-03-15 at 4.24.10 PM.png
Screen Shot 2015-03-15 at 2.40.22 PM.png
Last edited by boyan on Mon Mar 16, 2015 5:40 pm, edited 1 time in total.

donghaoren
Posts: 5
Joined: Sat Jan 10, 2015 11:33 am

Re: Proj 5: Data Correlation / Final Project

Post by donghaoren » Mon Mar 16, 2015 12:53 pm

Volumetric Visualization with Point Cloud

This project is based on the previous assignment. In the volume, the X-Y plane represents all books in the library. Z represents time, from 2006 to 2014 monthly. Color at x,y,z means the "checkout density" at book location x,y and time z. The mapping from books to x,y is done by a 2-layer RBM model.

The yellow dots represents best sellers from the NYTimes API. The location of each dot is given by the same algorithm as the books, so one can compare the NYTimes and the books. In the visualization, obviously there is a cluster where most of the NYTimes best sellers are located, as well as the SPL checkouts. However, there are also many other clusters, where there is no NYTimes books, but a lot of activity from the SPL. The reason is two folds, first, there might be some differences between the two datasets, for example some books are not likely to become best sellers. Second, although the layout algorithm of the NYTimes books and the SPL books uses the same input keywords, but for the SPL books, we extracted keywords from Title and Subject, while for NYTimes best sellers, we used Title and Description, there might be some differences from this side as well.
finalscreenshot.png
Screenshot
Attachments
Assignment_Final.zip
Processing Code
(1.52 MiB) Downloaded 761 times

chicobrando
Posts: 5
Joined: Sat Jan 10, 2015 11:25 am

Matching the Fifa World Cup

Post by chicobrando » Mon Mar 16, 2015 10:36 pm

Soccerfield1.jpg
The Fifa World Cup, organized in Brazil between June 12th and July 13th of 2014, was a reason of protest and dissent. While the government promised that the event would give international projection to the country, bringing recognition and tourists, the opposition accused the administration of corruption and bad management, which led to building overpriced stadiums. Also, protesters argued that the money spent in the event would be better invested in schools, hospitals and other most important needs of the low-income population.

Query
The project intends to test the government argument and see if the money spent on the World Cup was somehow worthy. Two data sets are used: (1) the number of times "Brazil" turned up in article search in the New York Times API, indicating if the country really got more international projection in the media during the World Cup; (2) the number of checkouts in titles with the word "Brazil" or "Portuguese", to indicate if the growing attention in the press also stimulated potential tourists to look for travel guides and other books about the country and the language.

The query on the Seattle Public Library took 49.281 sec., using the following code:

Code: Select all

SELECT 
    MONTH(checkOut),
    YEAR(checkOut),
    COUNT(IF(TITLE LIKE '%Brazil%',
        1,
        NULL)) AS 'Brazil',
    COUNT(IF(TITLE LIKE '%Portuguese%',
        1,
        NULL)) AS 'Portuguese'
FROM
    _rawXmlDataCheckOuts
WHERE
    DATE(checkOut) > '2013-09-01'
        AND DATE(checkOut) < '2014-07-31'
GROUP BY MONTH(checkOut) , YEAR(checkOut)
ORDER BY YEAR(checkOut) , MONTH(checkOut)
The two data sets (SPL and NYT API) present a correlation of 0.87, which is considered very significant in Political Science. The highest values are presented in June and July 2014, the months when the World Cup was organized. In May 2014, the Seattle Public Library checked out 208 books with titles containing "Brazil" and 26 with "Portuguese". These numbers jumped to 250, in June, and 254, in July, in the case of mentions to "Brazil"; to 52 in June and 49 in July, with "Portuguese". The New York Times also published more articles about "Brazil" over time: 436, in March; 664, in April; 811, in May; 2068, in June; 1127, in July.

Design
The data is displayed in a soccer field, with the same proportion as the official dimensions by Fifa. Instead of using bars, each month is represented by an ellipse, as if they were soccer players preparing for a match. The shape also resembles a coin of the game "Matching Pennies", in which a player try to match a coin, facing "heads" or "tail", to the coin of the adversary.

As one soccer team has 11 players, the data set uses 11 months - from September 2013 to July 2014. The months are disposed like a soccer team. September is represented by the goalkeepers, who use jerseys with the number 1. Each "team" of data presents four Defenders (October to January) and four Midfielders (February to May). Finally, there are two Forwards to represent the months of June and July, when the World Cup was organized. As a coincidence, the Forwards have jerseys with the highest numbers, and these months have the highest values.

As the data sets have different ranges of values and represent different measures (books and articles), it was used a scale to match the highest values in each one. The scale is 303 to 100, in the Seattle Public Library data set, and 2068 to 100, in the New York Times data set. Their values are represented by an ellipse with a diameter of 50 pixels.

The numbers for each month are presented in the side of each "team" - as if they were the list of players. There is also mouse interaction to display the information of each cell (value, data set, month). To prevent the arrow from covering up the data itself, the values are displayed in the center of the field.

The "team" of the Seattle Public Library is represented by a yellow jersey and the New York Times "team", by a blue jersey. The colors were chosen based on the flag of Brazil (green, yellow and blue). Unfortunately, the exact colors of the flag weren't very good for the visualization of the data. The solution was to use a darker green for the soccer field. The other colors are based on the actual colors of the jerseys used by the Brazilian soccer team.

New version

After I uploaded the version, I decided to introduce another feature. Pressing keys '1' and '2', it is possible to move the "teams" to the other side of the field. This function is particularly interesting for three reasons:
(1) changing sides makes a better simulation of the game "matching pennies"
(2) an actual soccer game is divided by two halfs of 45 minutes, in each the teams changing the side where they defend. Changing sides makes a better simulation of a soccer game.
(3) pressing fast and several times the keys, the "player" can make a better comparison of each month, visualizing the difference of sizes between the data sets.

Version 3.0

Now the data of both "players" can be seen on the center of the field and a line connects both ellipses of the month. The result looks like one player kicked the ball to the adversary, or both players decided to run to the center of the field. The line has the same colors as the team jerseys. I also tested a white version of the line, but the result was confusing.
Attachments
Soccerfield.zip
(817.09 KiB) Downloaded 760 times
Soccerfield5.jpg
Soccerfieldb.zip
(557.5 KiB) Downloaded 769 times
Soccerfield4.jpg
Soccerfield.zip
(433.88 KiB) Downloaded 787 times
FlagBrazil.jpg
Soccerfield3.jpg
Soccerfield2.jpg
Last edited by chicobrando on Wed Mar 18, 2015 8:24 pm, edited 3 times in total.

rodgerljl
Posts: 5
Joined: Sat Jan 10, 2015 11:29 am

Re: Proj 5: Data Correlation / Final Project

Post by rodgerljl » Tue Mar 17, 2015 1:45 am

What's Behind?

This project is working with live data from Instagram, which is an online mobile photo and video sharing platform. Everyday, thousands of images are uploaded to the website, and each image is tagged by several keywords associated with the image.

In this project, I'm particularly curious about the numerous information behind a keyword. For example, when people upload images tagged with "China", what are those people? Where are they? And what kind of message they want to send?
4.jpg
As shown from the above draft, each exploration begins with a keyword. With TagSearch API, the most recent uploaded images containing the keyword will be found out. For each image, we will have it's geolocation, all the other tags, user id. With all the tags of all the images, we can sort those tags according to their popularity by applying TagCount APIs, so it will tell what they were really talking about while they were talking about China or any other keywords. Also, with User API, we will get user information, such as the amount of posts, follows, and numbers of being followed.
3.jpeg
2.jpeg
1.jpeg
The final layout looks like above images. All the images including the beginning searching keyword are placed on the map with their geolocations. The most 30 popular keywords among all the images are placed on the left side, and connects with the images if they have the keywords. On the right side, it tells user information when mouse moves on an image. The two rectangular shapes indicated the number of followers of the user (right), and the number of people the user follows (left). The circular shape tells how many posts the user has already posted. For the two images, the one above is the large version of the image on the map, while another image is the profile image of the user. Also, you can zoom in the map to see more details of one particular area, and start a new search by entering a new keyword. It's always the live data.
InsMap_Local.zip
(1.04 MiB) Downloaded 796 times

dimberman
Posts: 6
Joined: Sat Jan 10, 2015 11:28 am

Re: Proj 5: Data Correlation / Final Project

Post by dimberman » Tue Mar 17, 2015 7:33 am

For my project, I wanted to create a stock market visualization that actually expressed the chaos of the financial markets and the susceptibility to world events. I used the quandl api to gather information on gold, goldman sachs, lockheed martin, the S&P 500, and crude oil. In doing so I was able to create a visualization that shows these stocks as a living organism. I wanted to create a sense of health in each stock as time passed, so I made it so that stocks that are doing badly look as if the are rotting.

I found multiple fascinating correlations while doing this project, particularly that there is a fairly heavy relationships between gold and stocks. While I was not surprised that every time the S&P500 tanked gold skyrocketted, I was surprised to find that when the stock market recovered gold was a very slow drip down to its low prices.


Code: Select all

https://www.quandl.com/api/v1/datasets/YAHOO/INDEX_GSPC.csv?trim_start=2000-01-01&trim_end=2015-03-01
https://www.quandl.com/api/v1/datasets/YAHOO/LMT.csv?trim_start=2000-01-01&trim_end=2015-03-01
https://www.quandl.com/api/v1/datasets/WGC/GOLD_DAILY_USD.csv?trim_start=2000-01-01&trim_end=2015-03-01
Through Quandl I was actually able to download the data as a CSV, which I was then able to merge via a processing program.


Code to merge data

Code: Select all

exxonTable = loadTable("emt.csv");
  lockheedTable = loadTable("lmt.csv");
  spTable = loadTable("spt.csv");
  gsTable = loadTable("gst.csv");
  goldTable = loadTable("gold.csv");

  int numTotalRows = Math.max(lockheedTable.getRowCount(), exxonTable.getRowCount());
  PrintWriter output = createWriter("merged.csv"); 

  indexMap = new HashMap<String, List<Integer>>(numTotalRows);
  allDates=new HashSet<String>(numTotalRows);
  addTableToHash(lockheedTable);
  addTableToHash(exxonTable);
  addTableToHash(spTable);
  addTableToHash(gsTable);
  addTableToHash(goldTable);


  Set<String> keys = indexMap.keySet();
  String[] dates = indexMap.keySet().toArray(new String[numTotalRows]);
  HashMap<String, Float> lockheedMap = new HashMap<String, Float>();
  HashMap<String, Float> exxonMap = new HashMap<String, Float>();
  HashMap<String, Float> sanPMap = new HashMap<String, Float>();
  HashMap<String, Float> gsMap = new HashMap<String, Float>();
  HashMap<String, Float> goldMap = new HashMap<String, Float>();


  dateSort(dates);
  for (int i = 0; i < lockheedTable.getRowCount (); i++) {
    lockheedMap.put(lockheedTable.getString(i, 0), lockheedTable.getFloat(i, 1));
  }

  for (int i = 0; i < exxonTable.getRowCount (); i++) {
    exxonMap.put(exxonTable.getString(i, 0), exxonTable.getFloat(i, 1));
  }

  for (int i = 0; i < spTable.getRowCount (); i++) {
    sanPMap.put(spTable.getString(i, 0), spTable.getFloat(i, 1));
  } 

  for (int i = 0; i < gsTable.getRowCount (); i++) {
    gsMap.put(gsTable.getString(i, 0), gsTable.getFloat(i, 1));
  } 

  for (int i = 0; i < goldTable.getRowCount (); i++) {
    goldMap.put(goldTable.getString(i, 0), goldTable.getFloat(i, 1));
  } 


  output.println("Date, lockheed, exxon, S&P, goldman, gold");
  for (int i = 0; i < dates.length; i++) {
    float lVal = -1;
    float eVal = -1;
    float sVal = -1;
    float gVal = -1;
    float auVal = -1;


    if (lockheedMap.get(dates[i])!=null) {
      lVal = lockheedMap.get(dates[i]);
    } 

    if (exxonMap.get(dates[i])!=null) {
      eVal = exxonMap.get(dates[i]);
    } 
    if (sanPMap.get(dates[i])!=null) {
      sVal = sanPMap.get(dates[i]);
    } 

    if (gsMap.get(dates[i])!=null) {
      gVal = gsMap.get(dates[i]);
    } 

    if (goldMap.get(dates[i])!=null) {
      auVal = goldMap.get(dates[i]);
    } 
    output.println(dates[i] +", " + lVal +", " +eVal + ", " + sVal + ", " + gVal + ", " + auVal);
  }
  output.flush();
  println("done");
}




void addTableToHash(Table table) {
  for (int i = 1; i < table.getRowCount (); i++) {
    List<Integer> date = indexMap.get(table.getString(i, 0));
    if (date == null) {
      date = new ArrayList<Integer>();
      date.add(i);
      indexMap.put(table.getString(i, 0), date);
    } else {
      date.add(i);
      indexMap.put(table.getString(i, 0), date);
    }
  }
}

void addTableToIndexMap(Table table) {
  //    for (int i = 1; i < table.getRowCount (); i++) {
  //    List<Integer> date = indexMap.get(table.getString(i, 0));
  //    if (date == null) {
  //      date = new ArrayList<Integer>();
  //      date.add(i);
  //      indexMap.put(table.getString(i, 0), table.getFloat(i,1));
  //    } else {
  //      date.add(i);
  //      indexMap.put(table.getString(i, 0), table.getFloat(i,1));
  //    }
  //  }
}

String[] dateSort(String [] dates) {
  int len = dates.length;
  while (len!=0) {
    int nlen = 0;
    for (int i = 1; i <= len - 1; i++) {
      if (compare(dates[i-1], dates[i]) ==1) {
        String tmp = dates[i-1];
        dates[i-1] = dates[i];
        dates[i]=tmp;
        nlen = i;
      }
    }
    len = nlen;
  }
  return dates;
}

int compare(String a, String b) {
  String[] aParse = a.split("/");
  String[] bParse = b.split("/");
  if (Integer.parseInt(aParse[2])>Integer.parseInt(bParse[2])) return 1;
  else if (Integer.parseInt(aParse[2])==Integer.parseInt(bParse[2])) {
    if (Integer.parseInt(aParse[0])>Integer.parseInt(bParse[0])) return 1; 
    else if (Integer.parseInt(aParse[0])==Integer.parseInt(bParse[0])) {
      if (Integer.parseInt(aParse[1])>Integer.parseInt(bParse[1])) return 1;
    }
  }
  if (Integer.parseInt(aParse[2])<Integer.parseInt(bParse[2])) return -1;
  else if (Integer.parseInt(aParse[2])==Integer.parseInt(bParse[2])) {
    if (Integer.parseInt(aParse[0])<Integer.parseInt(bParse[0])) return -1; 
    else if (Integer.parseInt(aParse[0])==Integer.parseInt(bParse[0])) {
      if (Integer.parseInt(aParse[1])<Integer.parseInt(bParse[1])) return -1;
    }
  }


  return 0;
}




In my initial creations, I found that without any form of scaling, the stocks actually appeared quite rigid. I also found that I had to scrub the data to handle erroneous values.

non-scaled and non-scrubbed
Screenshot 2015-03-15 21.38.08.png

However, once I scaled the values, I was able to see much richer correlations
Screenshot 2015-03-16 18.45.59.png



Finally, by adding notifications of major world events and stock ticker details, I was able to make the data more understandable by giving it context.
Screenshot 2015-03-17 08.29.14.png
Screenshot 2015-03-17 08.29.38.png
finance_of_fear2.zip
(72.77 KiB) Downloaded 779 times
data_merge.zip
(248.63 KiB) Downloaded 779 times

brocknoah
Posts: 5
Joined: Sat Jan 10, 2015 11:36 am

Re: Proj 5: Data Correlation / Final Project

Post by brocknoah » Sun Mar 22, 2015 10:38 pm

For this final project I wanted to track the trends of Seattle's professional sports. Seattle currently has a MLB team and a NFL team, their NBA team the SuperSonics left after the 2008 season. I wanted to compare a general sport, the league, and a team to the number of mentions in articles written by NYT authors.

I did a constraint for athletic sports, dewey class 796. The range of query time consisted of 190 seconds for the high end to 80 seconds on the low end, the league searches taking the longest. When I first started this project I saved the SQL queries and then switched focus to other classes before coming back to explore the data. I didn't go back and examine the queries to check for noise. I now have updated search queries, I specified spaces in between the association's abbreviations and included the option for plural or possession, i.e. "NFLs".

Each team has a specific color, but the general sport, league, and NYT data remains the same color.

SQL Example

Code: Select all

SELECT spl3._rawXmlDataCheckOuts.title,
spl3.subject.subject,
spl3._rawXmlDataCheckOuts.itemType, COUNT(spl3._rawXmlDataCheckOuts.checkOut) as checkouts,
sum(CASE WHEN year(spl3._rawXmlDataCheckOuts.checkOut) = 2006 THEN 1 ELSE 0 END) as y2006,
sum(CASE WHEN year(spl3._rawXmlDataCheckOuts.checkOut) = 2007 THEN 1 ELSE 0 END) as y2007,
sum(CASE WHEN year(spl3._rawXmlDataCheckOuts.checkOut) = 2008 THEN 1 ELSE 0 END) as y2008,
sum(CASE WHEN year(spl3._rawXmlDataCheckOuts.checkOut) = 2009 THEN 1 ELSE 0 END) as y2009,
sum(CASE WHEN year(spl3._rawXmlDataCheckOuts.checkOut) = 2010 THEN 1 ELSE 0 END) as y2010,
sum(CASE WHEN year(spl3._rawXmlDataCheckOuts.checkOut) = 2011 THEN 1 ELSE 0 END) as y2011,
sum(CASE WHEN year(spl3._rawXmlDataCheckOuts.checkOut) = 2012 THEN 1 ELSE 0 END) as y2012,
sum(CASE WHEN year(spl3._rawXmlDataCheckOuts.checkOut) = 2013 THEN 1 ELSE 0 END) as y2013
FROM spl3._rawXmlDataCheckOuts
INNER JOIN spl3.subject
ON spl3._rawXmlDataCheckOuts.bibNumber = spl3.subject.bibNumber
WHERE (spl3._rawXmlDataCheckOuts.title LIKE "% mlb %" OR spl3._rawXmlDataCheckOuts.title LIKE "% mlbs %"
OR spl3._rawXmlDataCheckOuts.title LIKE "mlbs %" OR spl3._rawXmlDataCheckOuts.title LIKE "mlb %" OR spl3._rawXmlDataCheckOuts.title LIKE "% mlb"
OR spl3._rawXmlDataCheckOuts.title LIKE "%Major League%" OR spl3._rawXmlDataCheckOuts.title LIKE "%National League%" OR spl3._rawXmlDataCheckOuts.title LIKE "%professional%baseball%")
AND floor(spl3._rawXmlDataCheckOuts.deweyClass) = 796
GROUP BY spl3._rawXmlDataCheckOuts.title
ORDER BY checkouts DESC
Attachments
p5.zip
(91.34 KiB) Downloaded 678 times
0881.jpg
3049.jpg

matzewagner
Posts: 5
Joined: Sat Jan 10, 2015 11:35 am

Re: Proj 5: Data Correlation / Final Project

Post by matzewagner » Mon Mar 23, 2015 10:01 pm

Music Journalism Timeline

Coming from a music background, I had a natural motivation for making my final data visualization about music and sound. In this sense, my final project is a visualization, as well as a sonification. There are a several excellent music APIs out there, which I tried out. Perhaps most notably http://developer.echonest.com/, which sports a wide array of search criteria for music and music metadata. Unfortunately, all of the music APIs with solid databases that I encountered offered only queries about the present moment, but not how data evolved over time. For this reason, I decided to go with the New York Times API, which, as it turns out, proved quite satisfactory for my research. The JSON query is included in the Processing sketch, but it was performed separately, since it took quite some time. I saved JSON files in the data folder, from which the program reads.

I enquired about the development of specific music genres since WWII in the following ways:
- frequency of total number of occurrences per year per genre
- specific articles including:
  • - word length
    - publication date
    - content (of lead paragraph)
After I discovered that the New York Times API would provide sufficient data, I started with the sonification part by plotting the total number of occurrences per year for the genres pop, rock, country, hiphop, techno, electronic, jazz, classical and experimental. At first, I started out browsing different genres rather arbitrarily, but decided on these ones based on the data they hold. A histogram in the lower right corner of the screen scrubs through an adjustable segment of time. The occurrences of each genre are mapped to the frequency parameter of sine oscillators, which I implemented via the Minim library. The bigger the number of occurrences, the higher is the frequency of the respective oscillator. Apart from the time range, the scrubbing speed and the zoom can be adjusted. Each genre can be toggled on and off independently.

For the main part of the window, I displayed every article, which returned from the database as a cube. Depending on which genre, each cube has one of 9 colors, equally mapped over HSV color range. The size of the cubes is determined by the word count of its content. The cubes, which start out in random positions, can be organized according to category, publication date and word count. The outline of each cube lights up, when it matches the category and year of the histogram scrubber. This way, it is possible to get different perspectives on how the writing about each genre changes over time.

Lastly, since I gathered a significant amount of content, I did not want to leave it untouched. The user can type a word into a field near the upper right corner of the screen. The program checks in each article, how many times the word appeared and adds the total amount of occurrences together and displays it on the screen. Every cube object stores its own number of occurrences. This number is then used to generate a gravitational pull with all other objects, in which the word appeared. This leads to the possibility to observe how certain words have an impact on genre and time, and vice-versa. The user can activate the gravity via a toggle, which is off by default.

WARNING!!! Due to the significant computational work, this might slow down or even freeze the program, depending on the machine it is running on.
1.jpg
ordered by category
2.jpg
ordered by word count
3.jpg
ordered by publication date
4.jpg
9.jpg
Processing Sketch
matthias_final_project.zip
(627.88 KiB) Downloaded 739 times

Post Reply