## Proj 4 - Student Defined Final Project

Posts: 141
Joined: Wed Sep 22, 2010 12:26 pm

### Proj 4 - Student Defined Final Project

Final Project Schedule
Feb 26 - Introduction to JSON (to get data from APIs) and Minim (sound in Processing - http://code.compartmental.net/minim/)
Feb 28 - Lab and Individual Meetings
Mar 05 - Final project individual discussions and lab
Mar 07 - Lab and explanation of project documentation template
Mar 12 - Final Project Class Presentation
Mar 14 - Completion of documentation to be posted at vislab.mat.ucsb.edu

------------------------------------------------------------------------------------------------------
Project Definition
For the final project we are interested in the problem of how to represent multi-dimensional multivariate data in three-dimensional interactive space.

Your first task is to identify and select your data. This can be a continuation of the Seattle library data, or acquisition of data from other sources. Data can also be correlated between multiple sources. Visualization software to be used is Processing.

We are looking to see granular detail – meaning there should be a significant density of data to be visualized in 3D space. Each data’s x,y,z position should be directly defined by the data’s values.

The project should reveal an understanding of how to use spatial relationships, color coding, interaction methods, and all the features of visual language basics covered in the previous demos.

Some Links Shown in Class on February 26

Frequency Pattern Mining Paper

Karl Yerkes Notes & Code for FP Tree Algorithm
--

A broad range of data resources
http://www.researchpipeline.com/mediawi ... =Main_Page

Google Correlate (finds similar statistical trends)

Doing a MySQL search and then correlate in the Google Correlate
--

A 3D form floating in 3D space
http://esamultimedia.esa.int/images/Sci ... 701b_H.jpg

------------------------------------------------------------------------------------------------------
Data Acquisition
The project is similar to the 3D assignment except you are free to choose your own data. This is an opportunity to explore JSON as a way to get data from various sources such as:

New York Times Book Reviews: https://developer.nytimes.com/docs/book ... 1/overview
New York Times Movie Reviews: https://developer.nytimes.com/docs/movi ... 1/overview
iTunes API: https://affiliate.itunes.apple.com/reso ... earch-api/
Instagram API: https://www.instagram.com/developer/
Behance API: https://www.behance.net/dev
Yelp API: https://www.yelp.com/developers
The Open Movie API: http://www.omdbapi.com/
--
Museum of Modern Art, NYC: https://github.com/MuseumofModernArt/collection
San Francisco Museum of Art: https://www.sfmoma.org/read/why-build-a ... ollection/
Metropolitan Museum of Art: https://www.metmuseum.org/blogs/now-at- ... ection-api
Whitney Museum of Art: https://api.whitney.org/uploads/generic ... r_2013.pdf
Smithsonian Institution, Washington: https://smithsonian.github.io/api-docs/#/

Please review the JSON demo at the course syllabus Wk 8: https://www.mat.ucsb.edu/~g.legrady/aca ... 9w259.html
and JSONObject in Processing at https://processing.org/reference/JSONObject.html
------------------------------------------------------------------------------------------------------
Evaluation
Innovation in content: your query question and outcomes. How original, engaging, unusual, your query, or your approach to the query may be, and how interesting the data may be. The data has to be multivariate, and granular (meaning a lot of data) so that we can see patterns forming in the data.

Innovation in design/ form: The design needs to go beyond our demos. Areas of exploration are in how you use space, form, colors, data organization, timing, interaction, coherence, elegance, etc. Do not use bar graphs :)

Computation: The third evaluation is the computational component. Special consideration will be for unusual, elegant expression, utilizing functions, algorithms, etc that you can introduce to the class.

------------------------------------------------------------------------------------------------------

chantalnguyen
Posts: 4
Joined: Thu Jan 10, 2019 10:51 am

### Re: Proj 4 - Student Defined Final Project

Concept:
Continuing along the food theme I've followed for my projects in this class, I'm interested in visualizing what people are interested in cooking by mining online recipe databases. Luckily, the dirty work has already been done for me, and I found data already scraped from recipe-hosting websites that are posted here: https://archive.org/download/recipes-en-201706/. I specifically used the recipes taken from Epicurious.com, a popular website that primarily publishes their own recipes as well as those from some US food magazines such as Bon Appetit.

The data is in JSON format and contains, for each recipe, the title, description, ingredients, instructions, rating out of 5, number of reviews, and % of reviewers who said they would make the recipe again, among other information.

I used tf-idf to vectorize the recipes' ingredient lists, SVD as a preliminary dimensionality reduction step to reduce to 50D (otherwise t-SNE alone would take too long), and t-SNE to reduce dimensionality to 2D. I plotted the data using the t-SNE coordinates in the x-y plane and the recipes' publication dates as the z-coordinates. Each recipe is displayed as a line, where the length of the line represents the number of reviewers and the thickness of the line represents its rating. I also performed a k-means clustering to cluster the recipes into 15 clusters; the color of the line represents its cluster label and the transparency represents the % of reviewers who said they would make the recipe again. Hovering over each line will show the corresponding recipe title, date, rating, # of reviewers, and % who would make it again.

Screenshots:
I wasn't a huge fan of the rectangular-edged lines and preferred the look of rounded edges, but I didn't know how to achieve this in 3D - I tried to approximate it by plotting circular dots close together in a row, but this makes the interaction extremely slow due to the large number of objects that need to be rendered. Hence I kept the lines.
There is a very large density of recipes published around 2004, which - while not the year the site was founded - could be when the online database in its current form began being indexed.
As expected, the older recipes have more reviews. The recipe that has the most reviews is "Double Chocolate Layer Cake".

Discussion:
The k-means clustering managed to extract clusters of recipes that are intuitively similar -- one cluster has a lot of baked goods, one has a lot of chocolate desserts, one has mostly Asian recipes -- this isn't too surprising as I'm clustering based on ingredient lists, so there would be a lot of overlap in the ingredients of recipes in a cluster. Some of the clusters don't seem to have a strong theme that jumps out immediately.
I might also need to refine the stopwords I used in doing the tf-idf vectorization to potentially extract more salient clustering. The stopwords I used included common English stopwords ("and", "the", etc.) plus those related to measurement quantities ("cups", "tablespoons", etc.). I did not exclude words related to the preparation of ingredients ("diced", "minced", etc.), but I would be interested in seeing how the results change with these words excluded.
Initially, I was interested in visualizing recipes taken from a user-submitted recipe database (which Epicurious is largely not) such as Allrecipes.com, which to my knowledge is the most popular food website in the US. While data from Allrecipes was included in the archive linked above, the recipes did not include publication date (though they did include other interesting information such as the cooking time). Part of the reason I wanted to use data that included date information is to see how interest in foods/cuisines changes over time, but it is difficult to see that from the current visualization. I might add some sort of slider that allows one to filter out recipes older than a specified date. [Update: have now added a rudimentary slider, but will need to fix the label such that it shows a mapping to the actual year]
I am also interested in examining interest in different recipes as a function of geographical location -- it may be possible to extract this from Allrecipes as some users - but likely not a majority - include their location as part of their profile.

The Processing code and the Python code I used to process the data are included.
Attachments
MAT259_Proj4_CN.zip
Last edited by chantalnguyen on Tue Mar 12, 2019 3:07 pm, edited 4 times in total.

sarahwells
Posts: 4
Joined: Thu Jan 10, 2019 10:52 am

### Re: Proj 4 - Student Defined Final Project

Concept:
I explored the relationships between health data and expected life expectancy in the US by county. I use health data from 2015 on a variety of factors as well as life expectancy from the CDC calculated on data from 2010-2015. For each county in the data set, I use Kohonen self-organizing for dimensional reduction on selected health factors determining coordinates for the x and y dimensions. Life expectancy then determines the z position and scale from red to green indicating poor and good life expectancy. I allow you to select factors to display and organize, you can also rerun the algorithm with currently selected factors.

Data:
County Health Rankings from: http://www.countyhealthrankings.org/exp ... -2010-2016
2015 County Health Rankings Data - v3.xls
County Life Expectancy from: https://www.cdc.gov/nchs/nvss/usaleep/u ... expectancy
Calculated from years 2010-2015:
Life_Expectancy.csv

Screenshots and Analysis:
Screenshot 1: Shows organization based on % Unemployment, % Children in Poverty, % Severe Housing Problems. This also shows the ability to hover over to display the County, State, selected factors, and life expectancy.
Screenshot 2: Keeping the data trained using % Unemployment, % Children in Poverty, % Severe Housing Problems, we additionally select Graduation Rate. This reorganizes the data using the additional Graduation Rate factor. The similarity in overall shape indicates not much difference between this and using only the initial three factors. One can infer a positive relationship between these factors.
Screenshot 3: Similar to screenshot 2, we keep the data trained from % Unemployment, % Children in Poverty, % Severe Housing Problems and add % Smokers. We see the data becomes a little more chaotic, this may mean a slightly weaker relationship between % Smokers and the first three factors than between Graduation Rate and those factors.
Screenshot 4: I choose some environment/physical factors, this screen displays the overall organization. (% Smokers, Average Daily PM2.5 Air Quality, % Pop in Drinking Water Violation, % Severe Housing Problems),
Screenshots 5-7: Using the above factor choices. I identify a clump of similar x-y coordinates indicating similar environmental/physical factors and similar low life expectancy. Hovering over a few, we see respective values for % Smokers, Average Daily PM2.5 Air Quality, % Pop in Drinking Water Violation, % Severe Housing Problems, Life Expectancy:
Russell, Kennedy: 31.0, 13.4, 0.0, 15.0, 73.9
Grayson, Kentucky: 29.0, 13.8, 0.0, 15.0, 74.1
Starke, Indiana: 33.0, 13.2, 0.0, 16.0, 74.6
We might conclude these high number of smokers, poor air quality, and housing problems, indicate low life expectancy.
Screenshots 8-11:
Using the same factor choices, again. I identify a clump of similar x-y coordinates indicating similar environmental/physical factors and high similar life expectancy. Hovering over a few, we see respective values for % Smokers, Average Daily PM2.5 Air Quality, % Pop in Drinking Water Violation, % Severe Housing Problems, Life Expectancy:
Chelan, Washington: 13.0, 11.2, 0.0, 16.0, 80.1
Columbia, New York: 14.0, 11.0, 0.0, 16.0, 80.9
Putnam, New York: 10.0, 10.9, 0.0, 20.0, 81.8
We might conclude these much lower number of smokers than the above sample contribute to higher life expectancy.
Additional analysis: In implementing SOM, I found efficiency to be a very limiting factors. One reason the program takes so long is to allow for variety in the spatial locations data can be sorted into (nodes), the more nodes the longer the for-loop in the algorithm. This was a difficulty particularly since I wanted to be able to interact with the visualization, the algorithm needs to be rerun every time a new factors is selected or removed. I capped the iterations at 25 since through trial and error I found the data to be reasonably sorted at this point and found it took a reasonable waiting period.

Further Improvements: As mentioned, time and efficiency is a factor when implementing the algorithm, to improve this perhaps I could limit the data, find a way to load all the possibilities beforehand, or try to maximize efficiency in the coded algorithm - perhaps moving away from a double for-loop. Another interesting additional option would be to limit results by state, or also add latitude/longitude factors.
Attachments
Sarah_Wells_Final.zip
Life_Expectancy.csv
2015 County Health Rankings Data - v3.xls
Last edited by sarahwells on Thu Mar 14, 2019 8:19 pm, edited 3 times in total.

meilinshi
Posts: 4
Joined: Thu Jan 10, 2019 10:57 am

### Re: Proj 4 - Student Defined Final Project

Motivation & Introduction
Hate crimes, hate speech, and hate groups have caught my attention for quite a while, for this project I'm thinking of doing a visualization of all the hate groups across the US in 2018. Data available at Southern Poverty Law Center, who publishes annual census of hate groups operating within the United States: https://www.splcenter.org/hate-map. I borrow the title from TV series Black Mirror S3E6 "Hated in the Nation" for this project.

Concept
To visualize hate groups categorized by ideology, and map out their location across the US.
At the bottom are 1000 largest cities in the US, mapped out by geographic coordinates. On the top are 15 hate group ideologies, x and y-axis determined by the last entry (group) within each distinct ideology, z-axis and size of the dot determined by the count of hate groups (i.e. bigger dots on top of smaller ).
**The original hate group data does not include geographic coordinates, so I match the city and state name with the 1000 cities demographic data to get the location. For locations without city name or city name does not fall into the 1000 cities, I use state coordinates for the location.

Screenshot & Analysis
The overview of all hate groups. Each line represents a single hate group. The color of the line indicates ideology, connecting the hate group's location at the bottom with the ideology dot on top.
I also include the description for each ideology when making single selection.
Interesting to see some of the groups show a spatial context, for example the Neo-Confederate groups only occur at the southwest corner of the country.

Future Improvements
The lines are connected in a double for loop, which makes it really slow. The dots and the cities come from two separate datasets, and mapped in two different coordinate systems. I have to save the absolute coordinates of each point and connect them in a separate function. Need to find a better algorithm to draw the lines.
Attachments
meilin_final.zip
Last edited by meilinshi on Sun Mar 24, 2019 9:49 pm, edited 4 times in total.

wilsonmui
Posts: 5
Joined: Mon Apr 16, 2018 10:21 am

### Re: Proj 4 - Student Defined Final Project

Concept Description

The idea for this project was to create a visualisation that could display the variety of common housecats found throughout California. I was hoping that if there were some cats that were more prominent in some regions, it could be easily spotted here.

The cats are merged to create an idea of what the average cat for the area would look like. K-means clustering is also used to group up the cats based on location and density. This method of clustering would help group up areas that were closer. Hopefully, it would group cats that appeared more similar as well.

As the layers move up, the amount of clusters is further divided and the cats are grouped up even more. The topmost layer is a composite of all the cats found.

Python was used to fetch each image using Petfinder API. OpenCV was used with Python to process the images and make them more usable. The faces are extracted to prevent merging other features together.

Query

http://api.petfinder.com/pet.find?key=2 ... ormat=json

Around 1000 queries were made, and hundreds of images were received. This took around 2.5 hours, including the time to process that images. Data pre-processing was done by searching for a cat face, cropping it out, and then resizing the image. Two python scripts are used to gather and perform the data pre-processing. Part of the pre-processing involves clustering the cats using K-means algorithm, where the coordinates is the feature.
Conclusion

All the cats found share some similar features, and it is hard to tell if region or location have a factor in appearances. One common feature is the white vertical patch on their face.
wilson_final_proj.zip
Last edited by wilsonmui on Sun Mar 24, 2019 3:44 am, edited 1 time in total.

jiaheng
Posts: 4
Joined: Wed Jan 16, 2019 10:17 am

### Re: Proj 4 - Student Defined Final Project

The dataset I will be using is the NCAA basketball on Kaggle https://www.kaggle.com/ncaa/ncaa-basketball
I'm thinking about using the self-organizing map algorithm to visualize the relationship of each university based on their win rate, match history, and other stats.

The form will be a network graph, where each of the university is a node in the graph, and each edge represents a correlation between the two universities(might be match history, who won and who lost, etc.)

Update:
I found that the NCAA basketball dataset isn't downloadable, which means I can only work with the data in the provided programming environment on Kaggle. Thus, I've decided to use another dataset called "Trending YouTube Video Statistics" as my final project data source. This dataset is provided as a csv file, and it contains the daily trending YouTube videos from many different countries. I will be focusing on videos in the U.S. Here is the link to the dataset. https://www.kaggle.com/datasnaek/youtube-new

Concept Description
For this project, I want to explore the trending videos on YouTube. As we know, each day YouTube has a list of its trending videos. The list can go up to 200 videos. I want to explore the categories of these videos, as well as how they change over time. I found a good dataset on Kaggle, which contains all the trending videos from 2017/11/14 to 2018/6/14, with their view count, likes count, dislikes count, etc.

Design
My design is a 3d cube space, where each of the axes represents a different aspect of the data. The x-axis shows the likes / dislikes ratio; the y-axis shows the number of views, and I used a log() function to make the distribution smoother; the z-axis shows the trending date, from the earliest to the latest date. I picked these inputs I don't want the visuals to look too clustered. Below are some screenshots.
I've also integrated with the YouTube Data API to get the channel information when the user clicks on each point.

Final Results & Analysis
There is a lot of information on the graph. We can see that there are some longer ribbons on the right side of the screen, and most of these are music videos. I suspect it's because that MVs usually longer last on the trending list, especially when they first came out. Another interesting thing is that a lot of the channels have disappeared. When the API searches for the channel name, it comes back with no results. It seems like some of the hot channels last year are declining or disappearing.

Future Improvements
I'm still trying to find a better way to visualize the information, especially the channel information. I want to highlight all the trending videos of a channel, but I couldn't find a good way to visualize it.
source.zip
What I found on how to disable drag when you click on the slider
The idea is to disable the drag handler for the ControlP5 slider when the mouse enters the slider "area", and re-enables it when the mouse leaves the slider area. I found two callbacks from the ControlP5 slider source code that are perfect for this. Here are the steps.
First, when you initialize the camera, you will need to save the drag handler in a global variable.

Code: Select all

``singleDragHandler = cam.getRotateDragHandler();``

Code: Select all

``````cp5.addSlider("sliderValue")
... // your code for set up cp5
.onEnter(new CallbackListener() {
public void controlEvent(CallbackEvent theEvent) {
cam.setLeftDragHandler(null);
}
})
.onLeave(new CallbackListener() {
public void controlEvent(CallbackEvent theEvent) {
cam.setLeftDragHandler(singleDragHandler);
}
});
``````
Installing
To run the app, you will need to install three libraries, ControlP5, PeasyCam, and HTTP Requests for Processing. You can install them using Processing's Library Manager.

Open Processing, and go to Sketch -> Import Library -> Add Library, and search for these three libraries. Click Install to install them.

Obtain an API Key
The app uses YouTube Data API v3. I've included my key in the program, but there's no guarantee of how long the key will be valid. Therefore, it is recommended that you get your own API key from Google. The link for getting an API key is here.https://developers.google.com/youtube/v ... ng-started
Last edited by jiaheng on Thu Mar 14, 2019 10:09 am, edited 4 times in total.

yichenli
Posts: 7
Joined: Mon Apr 16, 2018 10:23 am

### Re: Proj 4 - Student Defined Final Project

Concept
The Bpi at the Pompidou Center collects data on news browsing, translation, and video watching activities on its public computers. It is located near several other museums in the 3rd quarter of Paris. According to its 2018 visitor report, 47% of its visitors came from the suburb, 48% came from Paris, and 14% came from Seine-Saint-Denis. Out of all visitors, 24% are of foreign nationality, 12% have dual nationality, and 64% have French nationality. 55% of its visitors only spoke French at home.
Looking at its visitor report, I assumed that many of the visitors must be immigrants, or spoke another language. Therefore, I chose to visualize news browsing activities around foreign news during the week of 8/14 to 8/21 in 2018.
Query
There was no query for this project.
The first step was to delete records of domains such as "lefigaro" and "lemonde", which are based on France. However, since the data itself only provided timestamp, session_id, domain, url, title of each record, several web scrapers written in python were used to infer the country of where most users of the news website are based, the distance of that country from France, the utc offset (roughly responding to timezone) of that country, and the country in which the website is hosted. It is not encouraged to send such large quantities of requests from websites.
The utc offset and distance data were chosen because someone who is far from home internationally often encounter a distance from home not only spatially, but also temporally (an illusive one created by how sunlight travels across the globe). For example, an immigrant could be calling their loved ones at 3am, which would be day time for them.
Preliminary Sketches
Initially, I wanted to make a foggy environment in which news articles' distance from the viewer correspond to their respective countries' distance from France. Due to lack of coding skills, a more clear-cut design was used.
Final Result
In the final version, news browsing records are represented by a circle, with their colors determined by the word length of their respective countries.
Mouse hovering over one circle shows their country, website host country, title, columns. Some titles are missing in the original data, several link-clicking shows that they are often not written in the Latin alphabet, therefore these titles are shown as "title not written in the latin alphabet".
Estimated travel time by plane (from France) was also added, with cruising speed of plane as 780km/h and take-off/landing time as 0.5h/flight, to make the distance seem more personal. It is a flawed way to visualize travel speed given that some of the countries are not connected to France by land, even though realistically, not everyone would have the privilege of taking the plane.
Another metric was added to represent latitude of the web server hosting the news website. Despite the fact that some of the countries those news articles talk about are in the Southern Hemisphere, all but two news articles visited (see below, Angola) were hosted by servers in the Northern Hemisphere.
source.zip
Last edited by yichenli on Wed Mar 20, 2019 5:21 pm, edited 3 times in total.

yokoebata
Posts: 4
Joined: Thu Jan 10, 2019 11:00 am

### Re: Proj 4 - Student Defined Final Project

I will be using Behance's API https://www.behance.net/dev/api/endpoints/, for my final visualization project.

Utilizing the API, I will be creating a visualization based on the values of appreciation (quantified in Behance API), location, and creative fields (e.g., graphic design, animation, fashion, etc.)

My data visualization is not visually not where I want it to be and it is not complete, I will continually be working at it until classtime.
Attachments
behanceX.zip
Last edited by yokoebata on Tue Mar 12, 2019 2:30 pm, edited 5 times in total.

suburtner
Posts: 4
Joined: Thu Jan 10, 2019 10:58 am

### Re: Proj 4 - Student Defined Final Project

Examining Segregation through Travel Activity in Los Angeles & Orange County

I am using the California Household Travel Survey data for the Los Angeles and Orange counties. Data is available here: http://www.dot.ca.gov/hq/tpp/offices/om ... /chts.html

This project takes each person's travel activity, represented as connected lines, and puts them within a 3D layer of others with the same race.

My final goal is to press a key where you can generate a "summary network" of all individuals in one network, and would allow you to compare how the networks different in geographic space. This would show how segregation is revealed through travel activity.

Update: Tuesday, March 12th

Query

The "query" I create first involves getting my data into travel activity sequences (with a place and coordinates and start and end times). I then take all of the travel activity for all individuals in the LA and Orange County regions who have the top 50 number of trips by race. Thus, I "query" for location (LA & Orange County), race, and top 50 number of trips taken.

It is unclear how long the processing time is since I had to manually clean the data in R.

Results
Analysis
In class, I'd like to discuss some of the differences in the size and diameter of the networks. There are some expected results, such as the Bike network being small, but other unexpected results, such as the Monday travel activity network being so small. The racial networks tell an interesting story as well, particularly in "where" people of different races are more like to go in LA & Orange County.

Update 2: March 14th
I was able to implement creating and drawing convex hulls around the networks, and this has been added to the code. You can access the convex hulls of the networks by pressing 'h.' The third dimension of each network is determined by the hour of arrival.
Attachments
MultiLayered_TravelActivity_Networks.zip
Last edited by suburtner on Thu Mar 14, 2019 1:33 pm, edited 3 times in total.

aschoettler
Posts: 4
Joined: Thu Jan 10, 2019 11:03 am

### Re: Proj 4 - Student Defined Final Project

In this project I made a visualization of lightning strike data from 2012 over a map of the US on a globe.

This project used a shapefile library to import data from a shapefile, a format used in cartography.
This allowed me to show a map of the states of the US, and I was able to transfer that to the surface of a sphere using spherical coordinates.

The project also incorporates the use of a heightmap which uses a Delaunay mesh computation in order to compute the triangles necessary to draw the surface of the heightmap.
Attachments
weather0.zip