Proj 2 - 3D Visualization

Post Reply
glegrady
Posts: 203
Joined: Wed Sep 22, 2010 12:26 pm

Proj 2 - 3D Visualization

Post by glegrady » Thu Dec 31, 2020 4:33 pm

Proj 2 - 3D Visualization

The 3D visualization project consists in visualizing downloaded MySQL multi-dimensional data within a 3D space using the Java-based Processing language: https://processing.org/

------------------------------------------------------------------
SCHEDULE
1.19 Visual Language Overview, Intro to Processing
1.21 PeasyCam | 3D Processing demo
1.26 3D labeling, InfoGraph (to introduce camera perspective, and log() for data
1.28 Control P5 | 3D Treemap | Individual Meetings
2.02 Lab & Individual Meetings
2.04 3D Student project presentations (may extend to next class)

------------------------------------------------------------------
PROCESSING LIBRARIES
PeasyCam: is the Processing library that allows for the 3D spatialization and mouse interaction: http://mrfeinberg.com/peasycam/
Control P5: http://www.sojamo.de/libraries/controlP5/ to add buttons if needed
Color Sampler: http://tristen.ca/hcl-picker/#/hlc/6/1/A7E57C/2C4321

Some Processing functions for 3D:
P3D: https://processing.org/tutorials/p3d/
Translate, pushmatrix, popmatrix functions are introduced. Information about push, pop and translation can be found at: https://www.processing.org/tutorials/transform2d/

------------------------------------------------------------------
The MySQL Query
As with the previous assignment, you determine your topic for the data content but reconsider the query to provide multiple columns in your csv file as 3D space requires the following:
x value - for the horizontal location
y value - for the vertical location
z value - for the depth location
c value - for the color value
s value - for any other value such as the scale of the data

------------------------------------------------------------------
Some previous 3D Projects to review
http://vislab.mat.ucsb.edu/2020/p2/Lu_Ye/index.html, Yu le: "Movie Genre of the SPL"
--
http://vislab.mat.ucsb.edu/2020/p2/Evge ... nyNoi.html, Evgeny Noi: "Slow Readers"
--
http://vislab.mat.ucsb.edu/2020/p2/Guan ... index.html, Guanyu Chen: "True vs Prediction"
--
http://vislab.mat.ucsb.edu/2020/p2/Erin_Woo/index.html, Erin Woo: "Trends in Parasychology & the Occult at SPL"
--
http://vislab.mat.ucsb.edu/2017/p3/Kimberly/index.html, Kim Schlesinger: "Capital Bike Flow"
--
http://vislab.mat.ucsb.edu/2018/p2/Junx ... index.html, Junxiang Yao: "Star Wars Nebula"
--
http://vislab.mat.ucsb.edu/2017/p2/MertToka/index.html, Mert Toka: "Word Temperatures"
--
http://vislab.mat.ucsb.edu/2017/p2/Hann ... index.html, Hannah Wolfe: "Lost & Forgotten Books",
used word2vec: https://www.tensorflow.org/tutorials/text/word2vec
------------------------------------------------------------------
EVALUATION FEEDBACK
Your project will be reviewed according to these criteria. You can redo the project throughout the length of the course.

What are the conditions for a good grade?:
1) An interesting MySQL query
2) A working, interactive visualization in 3D in the java-based Processing environment
3) Data to determine the shape of the visual form: Let the metadata values determine where and how the data is to be organized within the 3D space, rather than to have a predetermined form
4) Visual Coherence: Visualization should follow standard design rules. Consider space, the function of color, clean fonts (Arial, Helvetica, Futura, etc.) Review examples at the course website: https://www.mat.ucsb.edu/~g.legrady/aca ... ences.html

CONTENT INNOVATION: your query question and outcomes. How original, engaging, unusual, your query, or your approach to the query may be, and how interesting the data may be. The data has to be multivariate, and granular (meaning a lot of data) so that we can see patterns forming in the data.

DESIGN: The design can build on our demos but hopefully go beyond. Areas of exploration are in how you use space, form, colors, data organization, timing, interaction, coherence, direction, etc.

COMPUTATION: The third evaluation is the computational component. First of all, the code needs to work. Special consideration will be for unusual, elegant expression, utilizing functions, algorithms, etc that you can introduce to the class.

------------------------------------------------------------------
Label your Documents
Please make sure to label your documents like csv files by the name of your project, or your name so we can identify where they come from

This is a lot to cover in the short time we have. Take one step at a time!
Last edited by glegrady on Thu Dec 31, 2020 6:26 pm, edited 1 time in total.
George Legrady
legrady@mat.ucsb.edu

lfloegelshetty
Posts: 4
Joined: Thu Jan 07, 2021 3:02 pm

Re: Proj 2 - 3D Visualization

Post by lfloegelshetty » Fri Feb 05, 2021 4:29 pm

My concept: I am interested in seeing the shared common interests of those who use the SPL and wanted to base my visualization on it. For my data, I have decided to look at the top 2000 books checked out each year from SPL. I am more specifically focused on the Dewey Classification and year as those will be the deciding factor for how the data will be grouped. I only attached a small snippet of the data I am collecting as the computation has been taking a long time and I need to add in some more conditional statements to get the data I want.

Query:
SELECT
YEAR(cout) AS years,
deweyClass,
title,
COUNT(bibNumber) AS Counts
FROM
spl_2016.outraw
WHERE
(deweyClass >= 000 AND deweyClass < 1000)
AND deweyClass != ''
AND deweyClass != ' '
AND itemType = 'acbk'
AND YEAR(cout) = 2006
GROUP BY deweyClass,title,YEAR(cout)
ORDER BY Counts DESC
limit 2000;

My idea: I wanted to display the data in the form of constellations going along the lines of wanting to see the beautiful connections that form from everyone represented as a data point. My design is based on the attached constellation image.

There will be a 3D sphere made up of the data points collected from the query. Around the edges will be the corresponding years i.e. 2006, 2007, 2008. The sphere will be divided into fourteen segments where the years 2006 to 2012 will be represented on the top half and the years 2013 to 2019 will be represented on the lower half. Each segment represented by a year will make up 1/14 of the sphere. Each segment will contain the 2000 data points pulled from the query, the data points randomly placed in the segments but placed in a way that will avoid complete uniformity or clusters from occurring. The user will have the ability to make any year and segments of the sphere visible or invisible to allow them to better see the data formations in each segment. Within the segments themselves, lines connecting data points that fall under the same dewey classes will form the 3D constellations. The constellations will be color coded based on the dewey classes i.e. items under the science category and in the year 2006 will form their own constellation. Even though they may share items in the same dewey classes, each segment will serve as a separation for all of the constellation. So 2006 will have its own constellation made up of items under the science category and 2007 will have its own, but the constellations of the same dewey classes will be the same color to let the user to easily discern all of the different constellations. I have provided visuals below that try to demonstrate what I have stated above in a more concrete manner as well as a basis of the data I have already created in Processing. The user can also hover over each data point to see the item that they represent.

My goal: In the end, what I wanted to achieve was to create a small universe with seemingly random points that define it but are actually related with one another to create these beautiful 3D line art.

Concept design is in the attached drawing.

top view: The gaps between the segments are from an error in my math in how wide each segment should be. I am going to change the lines inside the sphere to be more visible and make the line around the sphere wider. That line is where the years will display and I want to color code segments of the line to provide the user a better idea of where one segment ends and another begins.

side view: There is a current bias for the points to appear around the z axis but that is due to me not randomizing the occurrence of points around the x and y axis to equalize the overall bias. There are points within the sphere to allow for 3D line formations.

Points to Consider:
1. The way I will form the lined connections between all the points. Do I want to choose to connect it by closest neighbors or should I form sub connections for points of the same dewey classes to make connections even more closely related.

2. Should I leave the constellations constrained by the segments they are in or should I allow the user to be able to pull apart the constellation they want to closer examine from the sphere?
Attachments
2006top.csv
(119.97 KiB) Downloaded 99 times
unnamed-2.png
unnamed-1.png
unnamed.png
unnamed.jpg
Last edited by lfloegelshetty on Thu Mar 18, 2021 12:13 pm, edited 6 times in total.

wsheppard
Posts: 3
Joined: Thu Jan 07, 2021 3:09 pm

Re: Proj 2 - 3D Visualization

Post by wsheppard » Mon Feb 08, 2021 8:34 am

This reply details my progress on Project 2.

Concept. I wanted to explore trends in music CD checkouts over time. This seemed like an interesting question because different music genres gain prevalence in American culture at different times, and also because the ways people listen to music have changed dramatically since 2006, with CDs giving way to other media such as iTunes, Spotify, Pandora, Soundcloud, YouTube, etc.

I've chosen to visualize the number of CD checkouts each day from January 2006 to February 2021 that correspond to 10 different music genres. I wanted the visualization to emulate the circular shape of CDs, so I chose to represent the time axis as an upward spiral (I may put an option to flatten the spirals into disjoint circles later, one for each year). The checkout data for each genre is represented by a bar of a certain color whose length along the radial direction is proportional to the number of checkouts. These radial lines were intended to look like the rainbow colors you see on the back of a CD, but they happily also look like the moving equalizer bars you see on some soundsystems. Either way, the data is meant to evoke the imagery of listening to music.

Query.
SELECT
DATE(cout) AS date, COUNT(IF(spl_2016.subject.subject LIKE "%jazz%", 1, NULL)) AS 'jazz',
COUNT(IF(spl_2016.subject.subject LIKE "%rock%", 1, NULL)) AS 'rock',
COUNT(IF(spl_2016.subject.subject LIKE "%pop%", 1, NULL)) AS 'pop',
COUNT(IF(spl_2016.subject.subject LIKE "%country%", 1, NULL)) AS 'country',
COUNT(IF(spl_2016.subject.subject LIKE "%folk%", 1, NULL)) AS 'folk',
COUNT(IF(spl_2016.subject.subject LIKE "%electronic%", 1, NULL)) AS 'electronic',
COUNT(IF(spl_2016.subject.subject LIKE "%soul%", 1, NULL)) AS 'soul',
COUNT(IF(spl_2016.subject.subject LIKE "%blues%", 1, NULL)) AS 'blues',
COUNT(IF(spl_2016.subject.subject LIKE "%rap%", 1, NULL)) AS 'rap',
COUNT(IF(spl_2016.subject.subject LIKE "%musical%", 1, NULL)) AS 'musical',
FROM
spl_2016.subject,
spl_2016.outraw
WHERE
spl_2016.outraw.bibNumber = spl_2016.subject.bibNumber
AND itemtype LIKE "%cd"
AND deweyClass LIKE "78%"
GROUP BY date
ORDER BY date ASC

The data and some sketches are included below. Some insights so far are that CD checkouts have indeed gone down appreciably since 2006. The data also reflect a large spike in checkouts just before the Covid-19 pandemic in March 2020. Perhaps unsurprisingly, rock and pop music have been consistently the most popular genres among those I'm exploring by a wide margin. This may suggest that plotting on a log scale is more appropriate for ease of viewing.
Attachments
sheppard3dprojectCD.csv
Data
(227.23 KiB) Downloaded 90 times
sheppard3dprojectCD.zip
3d project zip file
(90.04 KiB) Downloaded 94 times
cd sketch.jpg
Sketch of concept
20210208_102717.jpg
20210208_102518.jpg
20210208_102446.jpg
Last edited by wsheppard on Sat Feb 13, 2021 2:03 pm, edited 2 times in total.

ingmar_sturm
Posts: 3
Joined: Thu Jan 07, 2021 3:10 pm

Re: Proj 2 - 3D Visualization

Post by ingmar_sturm » Mon Feb 08, 2021 12:53 pm

Concept Title

Transitions -- How do Dewey classifications of library items change over their lifespan?

Concept Description

How are library items classified? The Seattle Public Library (SPL) database contains the Dewey classification of items for each time they are checked out. So the database makes it possible to see how they are classified in different points in time and allows us to see the following interesting pieces of information:
  • When were they first classified? (since some items were checked out without having a Dewey classification or having a NA classification)
    What was their initial classification?
    When did the classification change?
    Are there any patterns regarding the direction of change, i.e. did many books from one particular Dewey class change to a particular other Dewey class?
As these questions show, I am interested in the how and when of these changes, not whether such a change occurred at all. I, therefore, retrieve only records from the database that have changed their Dewey classification.

MySQL Query

SELECT * # Select all columns from the database
FROM spl_2016.outraw AS a # here I save one copy of the whole database as "a"
WHERE EXISTS(
# here I start a filter, anything that follows from here just serves to filter "a"
SELECT 1
# this could be really anything (e.g. select *) because we just want to have the row-indices for which to filter
FROM (SELECT itemnumber,
Count(DISTINCT( deweyclass )) AS deweyct
# this counts the deweyclasses per item
FROM spl_2016.outraw
WHERE deweyClass REGEXP '^[7|8].*' # regex finds all deweyClasses that start with 7 or 8.
GROUP BY itemnumber
HAVING deweyct > 1) AS b
# then we take only those items that have more than 1 deweyclass and save this whole filter dataset to "b"
WHERE a.itemnumber = b.itemnumber
# here is where the magic happens: we filter a by the itemnumbers that occur in the dataset "b" ➤ this returns a dataset with all columns but filtered to those itemnumbers that have more than 1 deweyclass.
)

Sketches

Click on images to enlarge
7k_items_dewclass_change_b.jpg
book_sample3.jpg
These plots show a non-random subset of my query results. It is clear that at certain points in time, many books with a similar class were first checked out (presumably they were purchased shortly before) and that they change their Dewey class at certain points in time.
sketch1.jpg
This is my first idea: putting books into a cube and trace them over time with both the z-axis and colors representing Dewey classes.
image (1).png
alluvial_plot.png
This alluvial plot is an alternative representation. My idea is to have each cluster of nodes (the thing that looks like a wall) represent a point in time and within this point in time (let's say one year), have each node (the things that look like bricks of the wall) be an individual Dewey class. The ribbons trace which classes lead to which.

Preliminary Insights

The query yields about 500,000 check-out records, suggesting that a large number of items has changed their Dewey class between 2006 and 2021. It looks like some books start out with an NA Dewey class, suggesting that they might not have been classified. A large number of books change their Dewey class during a short time interval, suggesting that some event prompted a change that affected many items at once. Most of the books change the category to a nearby Dewey class although some are re-classified to very different categories. Interestingly, some items change their barcode, which frequently happens at a different point in time than their Dewey class change. I'm still struggling to understand whether there is a recognizable pattern.

Final Visualization
scrn-0176.jpg
scrn-0097.jpg
scrn-0003.jpg
Attachments
itemnums_cube_v4.zip
(1.48 MiB) Downloaded 89 times
scrn-0003.jpg
major_changers.csv
(4.8 MiB) Downloaded 89 times
deweyClasses.csv
(664 Bytes) Downloaded 75 times
Last edited by ingmar_sturm on Fri Mar 19, 2021 9:37 am, edited 1 time in total.

colette_lee
Posts: 3
Joined: Thu Jan 07, 2021 3:07 pm

Re: Proj 2 - 3D Visualization

Post by colette_lee » Mon Feb 08, 2021 7:57 pm

This is my progress so far on Project 2: 3D Visualization

Concept: i wanted to explore checkouts of items related to astrology over time. I wanted to connect the checkouts to the checkins of these items along a spiral time "axis" with curved lines to give off the feeling of a birth chart. I am visualizing every checkout that includes 'astrology' in the title represented as a Bezier curve, where one end of the curve is the date it was checked out and the other is the date it was checked in. The control point of the curve is determined by the hour it was checked out so the later in the day the item was checked out, the further away the control point is to the endpoint, resulting in a steeper curve.

Query
SELECT
itemnumber,
bibNumber,
DATE(cout) AS coutDate,
DATE(cin) AS cinDate,
TIME(cout) AS coutTime,
TIME(cin) AS cinTime,
title
FROM
spl_2016.inraw
WHERE
title LIKE '%astrology%'
ORDER BY itemNumber
LIMIT 30000;

The data is attached below. Every different color represents a distinct item. The query sorts the checkouts by itemNumber, and then each bibNumber is mapped to a different hue, so the H value in HSB color for each curve is indicative of when the item was acquired by the library. Some insights so far are that these checkouts are more popular from 2006-2010 (approximately) . 2011-2020 checkouts are less common and also have smaller duration which I can tell because the endpoints of the curves are close together. Because of the variation in the number of checkouts per year, I decided to do a polynomial regression data fit using https://arachnoid.com/polysolve/ using the number of checkouts per year and use the resulting polynomial to determine the size of the time spiral at each particular point in time.
astrologyall.csv
(1002.04 KiB) Downloaded 91 times
Screen Shot 2021-02-08 at 7.52.15 PM.png
Screen Shot 2021-02-08 at 7.52.33 PM.png
Screen Shot 2021-02-08 at 7.51.57 PM.png
astrology.zip
(122.87 KiB) Downloaded 83 times
Last edited by colette_lee on Fri Feb 12, 2021 2:43 pm, edited 1 time in total.

zhuowei
Posts: 3
Joined: Thu Jan 07, 2021 3:00 pm

Re: Proj 2 - 3D Visualization

Post by zhuowei » Tue Feb 09, 2021 9:14 am

Concept:
The goal is to show the information of the checkout duration and checkout amounts for different dewey classes and subclasses over the past 15 years. For the design, I want the product to be able to show the pattern of popularity and checkout duration within each year and also yearly or seasonal trend on the checkout duration and checkout amounts for different dewey classes and subclasses. The idea of the design is to have z axis as time axis, and to have polygons to show the information for each year or month. Line plots on the time axis shows the trend over years or months. In the design:
Vertices to center length of polygon: checkout duration of each class or subclass.
Color fill of polygon: the total number of checkouts in that year or month.
Value of line plot: checkout duration of each class or subclass.
Color of line plot: the total number of checkouts of each class or subclass.

To keep the design clean and easy to extract information, I want to use interaction to give users the ability to show or fold the information. When one wants to see the monthly trend, they can click on the polygon of each year and it will unfold and show the monthly data. When one wants to see the subclass information, they can click on each class to show the information for subclasses.

SQL Code 1:

Code: Select all

SELECT 
    SUM(CASE
        WHEN deweyClass > 0 AND deweyClass < 100 THEN TIMESTAMPDIFF(HOUR, cout, cin) / 24.0
        ELSE 0
    END) / SUM(CASE
        WHEN deweyClass > 0 AND deweyClass < 100 THEN 1
        ELSE 0
    END) AS 'dewey_000',
    SUM(CASE
        WHEN deweyClass > 100 AND deweyClass < 200 THEN TIMESTAMPDIFF(HOUR, cout, cin) / 24.0
        ELSE 0
    END) / SUM(CASE
        WHEN deweyClass > 100 AND deweyClass < 200 THEN 1
        ELSE 0
    END) AS 'dewey_100',
    SUM(CASE
        WHEN deweyClass > 200 AND deweyClass < 300 THEN TIMESTAMPDIFF(HOUR, cout, cin) / 24.0
        ELSE 0
    END) / SUM(CASE
        WHEN deweyClass > 200 AND deweyClass < 300 THEN 1
        ELSE 0
    END) AS 'dewey_200',
    SUM(CASE
        WHEN deweyClass > 300 AND deweyClass < 400 THEN TIMESTAMPDIFF(HOUR, cout, cin) / 24.0
        ELSE 0
    END) / SUM(CASE
        WHEN deweyClass > 300 AND deweyClass < 400 THEN 1
        ELSE 0
    END) AS 'dewey_300',
    SUM(CASE
        WHEN deweyClass > 400 AND deweyClass < 500 THEN TIMESTAMPDIFF(HOUR, cout, cin) / 24.0
        ELSE 0
    END) / SUM(CASE
        WHEN deweyClass > 400 AND deweyClass < 500 THEN 1
        ELSE 0
    END) AS 'dewey_400',
    SUM(CASE
        WHEN deweyClass > 500 AND deweyClass < 600 THEN TIMESTAMPDIFF(HOUR, cout, cin) / 24.0
        ELSE 0
    END) / SUM(CASE
        WHEN deweyClass > 500 AND deweyClass < 600 THEN 1
        ELSE 0
    END) AS 'dewey_500',
    SUM(CASE
        WHEN deweyClass > 600 AND deweyClass < 700 THEN TIMESTAMPDIFF(HOUR, cout, cin) / 24.0
        ELSE 0
    END) / SUM(CASE
        WHEN deweyClass > 600 AND deweyClass < 700 THEN 1
        ELSE 0
    END) AS 'dewey_600',
    SUM(CASE
        WHEN deweyClass > 700 AND deweyClass < 800 THEN TIMESTAMPDIFF(HOUR, cout, cin) / 24.0
        ELSE 0
    END) / SUM(CASE
        WHEN deweyClass > 700 AND deweyClass < 800 THEN 1
        ELSE 0
    END) AS 'dewey_700',
    SUM(CASE
        WHEN deweyClass > 800 AND deweyClass < 900 THEN TIMESTAMPDIFF(HOUR, cout, cin) / 24.0
        ELSE 0
    END) / SUM(CASE
        WHEN deweyClass > 800 AND deweyClass < 900 THEN 1
        ELSE 0
    END) AS 'dewey_800',
    SUM(CASE
        WHEN deweyClass > 900 AND deweyClass < 1000 THEN TIMESTAMPDIFF(HOUR, cout, cin) / 24.0
        ELSE 0
    END) / SUM(CASE
        WHEN deweyClass > 900 AND deweyClass < 1000 THEN 1
        ELSE 0
    END) AS 'dewey_900'
FROM
    spl_2016.inraw
WHERE
    itemtype LIKE '%bk'
        AND YEAR(cout) >= '2006'
        AND YEAR(cout) <= '2020'
GROUP BY YEAR(cout), MONTH(cout)
ORDER BY YEAR(cout), MONTH(cout) ASC;

SQL Code 2:

Code: Select all

SELECT 
    SUM(CASE
        WHEN deweyClass > 000 AND deweyClass < 100 THEN 1
        ELSE 0
    END) AS 'dewey_000',
    SUM(CASE
        WHEN deweyClass > 100 AND deweyClass < 200 THEN 1
        ELSE 0
    END) AS 'dewey_100',
   SUM(CASE
        WHEN deweyClass > 200 AND deweyClass < 300 THEN 1
        ELSE 0
    END) AS 'dewey_200',
   SUM(CASE
        WHEN deweyClass > 300 AND deweyClass < 400 THEN 1
        ELSE 0
    END) AS 'dewey_300',
    SUM(CASE
        WHEN deweyClass > 400 AND deweyClass < 500 THEN 1
        ELSE 0
    END) AS 'dewey_400',
    SUM(CASE
        WHEN deweyClass > 500 AND deweyClass < 600 THEN 1
        ELSE 0
    END) AS 'dewey_500',
    SUM(CASE
        WHEN deweyClass > 600 AND deweyClass < 700 THEN 1
        ELSE 0
    END) AS 'dewey_600',
    SUM(CASE
        WHEN deweyClass > 700 AND deweyClass < 800 THEN 1
        ELSE 0
    END) AS 'dewey_700',
    SUM(CASE
        WHEN deweyClass > 800 AND deweyClass < 900 THEN 1
        ELSE 0
    END) AS 'dewey_800',
    SUM(CASE
        WHEN deweyClass > 900 AND deweyClass < 1000 THEN 1
        ELSE 0
    END) AS 'dewey_900'
FROM
    spl_2016.inraw
WHERE
    itemtype LIKE '%bk'
        AND YEAR(cout) >= '2006'
        AND YEAR(cout) <= '2020'
GROUP BY YEAR(cout), MONTH(cout)
ORDER BY YEAR(cout), MONTH(cout) ASC;


csv files in attachments
Sketches and screen shots in attachments.

Insights:
1. The total number of checkouts went up in the first four years but started to go down after 2009.
2. The checkout duration slowly went down from 2006, and had a big jump down at year 2011. The checkout duration in year 2020 was more than triple of the previous year. This is probably caused by the policy change because of the COVID situation.
3. In general the average checkout duration is longer than the policy of the library. This is a little weird and I don’t have a good explanation for it. Maybe it took a few extra days for the returned item to get checked in?
4. The dewey classes with shortest checkout durations is arts and recreation. The class with longest duration changed from literature to language from year 2011
5. The most popular dewey class is arts and recreation and the lear popular is language.
6. Because of the COVID situation, there are close to zero checkouts from April to July, 2020.
More to add...

I have attached the code for the current version of the project. I am working on adding the labeling and descriptive data.
Attachments
proj_2.zip
(99.42 KiB) Downloaded 47 times
csv_data.zip
(40.71 KiB) Downloaded 61 times
sketches.png
screenshot3.jpg
screenshot2.jpg
screenshot1.jpg
Last edited by zhuowei on Fri Feb 12, 2021 12:12 am, edited 2 times in total.

richardjiang
Posts: 3
Joined: Thu Jan 07, 2021 3:05 pm

Re: Proj 2 - 3D Visualization

Post by richardjiang » Tue Feb 09, 2021 9:37 am

Concept

The use and popularity of certain words change over time as culture evolves. In this project, I wanted to explore this evolution using data from the SPL. While, from one perspective, we could better map these by looking at the popularity of words in titles as a function of publication date, this does not capture the direct response from the audience about which words are most attractive and it does not utilize the benefits of the SPL dataset well.

The topic I settled on was to visualize the number of checkouts of each word in a particular year. Initially, the hope here was to see how some words would naturally lose popularity over the years. The data is mapped into 3D in the following way:

1. Each word is given a 2D coordinate using Word2Vec and a dimensionality reduction technique called UMAP. Using this, we can algorithmically determine some sort of clusters where similar words are close to each other
2. The 3rd coordinate would be dependent on the relative popularity of that word in a particular year
3. Each year is located within its own reference frame.
4. Absolute frequency of the word among the entire 'active' corpus would be encoded by a color

The interactive component allows the user to:
1. Specify the word/words to track over the years - in which case a line will connect the location through time (if it exists in that year)
2. Select a window of popularity to view i.e. top 100, 20 - 50, 100 - 200 which will scale all of the colors/computations into that particular corpus
3. Select color gradients/design elements

While it would be great to represent every single word at the same time, and will be technically possible, this not only becomes an incredible computational burden due to the number of words to render, but it also adds a lot of noise making it difficult to interpret and explore.

Query

The query is relatively simple but with a few post-processing steps, which will be attached to this entry.

Code: Select all

SELECT 
    YEAR(s.cout) as year, s.title, COUNT(*) as checkouts
FROM
    spl_2016.outraw s
WHERE
		s.itemtype LIKE "%bk"
        AND s.title != '' 
        AND s.collcode NOT LIKE "%comic"
        AND s.callNumber NOT LIKE "CHINESE%" 
        AND s.callNumber NOT LIKE "JAPANESE%"
        AND s.callNumber NOT LIKE "SPANISH%"
        AND s.callNumber NOT LIKE "KOREAN%"
        AND s.callNumber NOT LIKE "VIETNAM%"
        AND s.callNumber NOT LIKE "FRENCH%"
        AND s.callNUmber NOT LIKE "GERMAN%"
        AND s.callNumber NOT LIKE "ITALIAN%"
        AND s.callNumber NOT LIKE "RUSSIAN%"
        AND s.callNUmber NOT LIKE "ARABIC%"
        AND s.callNUmber NOT LIKE "SWEDISH%"
        AND s.callNumber NOT LIKE "PORTUGU%"
        AND s.cout > NOW() - INTERVAL 13 YEAR
        AND s.cout < NOW() - INTERVAL 1 YEAR
GROUP BY YEAR(s.cout), s.title
The data is quite large so a reduced dataset will be attached but the query runs relatively quickly (<15 minutes). All together, this produces approximately 2.9M rows.
Attachments
word_popularity_over_time.zip
(13.74 MiB) Downloaded 40 times
screen-0126.png
screen-0327.png
screen-0916.png
screen-1370.png
Last edited by richardjiang on Tue Feb 16, 2021 10:51 am, edited 2 times in total.

ashleybruce
Posts: 11
Joined: Thu Jan 07, 2021 2:59 pm

Re: Project 2: Correlation of "Water Keywords" with Rainfall in Seattle

Post by ashleybruce » Tue Feb 09, 2021 5:37 pm

If one were to try and associate a city in the US with rain, Seattle would be one of the first cities that comes to mind.

Concept: I wanted to look and see if the relationship between the number of checkouts related to "water" and the amount of rainfall that occurred on that day.

First I did a series of queries to gather all the entries with the keywords I was looking for. These keywords were: rain, water, river, lake, ocean, and sea. This list could have been expanded to cover a much broader list of words, but each query was already providing such a huge dataset, I decided to keep it as these 6 keywords.

The code below shows my query to the database, where YEAR is the year I was gathering data for (2007-2019). I did each year separately for two reasons. The first is because the amount of entries being returned was so many that I decided to break it up year by year. The second is because I wanted each year as it's own CSV file.

Code: Select all

select *
from spl_2016.outraw
where (
	LOWER (title) like '% rain %'
	or LOWER (title) like '%water%'
	or LOWER (title) like '%river%'
	or LOWER (title) like '% lake %'
	or LOWER (title) like '%ocean%
	or LOWER (title) like '%sea %''
	)
	and year(cout) = YEAR
I then needed to gather precipitation data that occurred in Seattle over the dates that I wanted to look at. Through the National Centers for Environmental Information (NOAA), I was able to put in an order for precipitation data in Seattle over the years 2007 to 2019. A processed CSV with the precipitation data was then emailed to me. The CSV is included in this post.

Cleaning the Data:After obtaining all the data, I initially wanted to include every data point that was obtained from my query so I could include the titles on hover. When trying to populate my processing file with this data, the results were overwhelming. I realized that I needed to condense my data to better visualize the results. Instead of looking at each book as a point, I decided I wanted to look at the daily checkout totals for each keyword from the query. To do this, I wrote a short code in Python that took each CSV I obtained from my query, counted the daily total, and returned the results in another CSV. The results from this are included in the folder "CondensedData".

Visualization Sketch: With the new condensed data, I had a better idea in how I wanted to visualize it.
Doodles.png
My badly drawn graph
I wanted to use concentric circles to represent time, with each year being represented by a circle. Contained within the circle would be the data. The distance from the center to the ring of the circle would be relative checkout numbers and the height would be precipitation. Once all the data was inside, I would implement the convex-hull algorithm to connect the outer edge points, giving each year a unique "raindrop" look.

Output: The following pictures are after implementing this idea in Processing.
Isolated Year.png
One isolated year
CheckoutColors.png
Data colored according to relative checkout numbers
NoCH.png
No Convex Hull algorithm on data
Results: As what was partially expected, there is no noticeable correlation between the checkouts of books with "water" keywords and the amount of precipitation that occurred in Seattle that day.
Attachments
AshleyBruceProject.zip
Processing project files
(123.38 KiB) Downloaded 50 times
CondensedData.zip
Data used in Processing files
(295.4 KiB) Downloaded 56 times
Precipitation.csv
CSV File obtained through NOAA
(64.3 KiB) Downloaded 45 times

lfloegelshetty
Posts: 4
Joined: Thu Jan 07, 2021 3:02 pm

Re: Proj 2 - 3D Visualization

Post by lfloegelshetty » Mon Feb 15, 2021 11:00 am

My idea for this project was based on seeing the interesting shapes that the data I used would make, replicating somewhat constellations.

Concept: I wanted to look at the most popular non fiction books at the SPL from the years 2006 to 2019 and create connections between them based on their dewey classes to see the resulting shapes that would come from it.

I used the following query to get the top 2000 books of each year and repeated the search for all years used in the data visualization. I chose 2000 data points for each year as that was the highest my laptop could go before it started affecting performance and I did not want too much of a clutter to occur with a high number of data points but the amount of data points can be increased by changing the n variable in the code.

Code: Select all

SELECT
YEAR(cout) AS years,
deweyClass,
title,
COUNT(bibNumber) AS Counts
FROM
spl_2016.outraw
WHERE
(deweyClass BETWEEN 000 and 999) 
AND deweyClass != ''
AND deweyClass != ' '
AND itemType = 'acbk'
AND YEAR(cout) = 2006
GROUP BY deweyClass,title,YEAR(cout)
ORDER BY Counts DESC
limit 2000;
Using the data points, I randomly placed them into fourteen segments based on the year they belonged to to create a 3D spherical shape. I am working on placing data points based on astronomy positions to provide an even more interesting visual, but that will come a bit later. Each data point is color coordinated based on the dewey class that they represent and a color table is provided to show the dewey color representations. The points in each segment are then connected based off if they are in the same dewey class with the lines having the same color as the points they connect. The order of the connection is entirely random and results to every iteration of the data visual having different constellations form.

For the user interface, the user is able to preform the following functions:
1. choose which years and segments to see allowing for easier comparison and closer looks into each year's formed constellations
2. 0 to 9 keypad that allows user to see each dewey class
3. R allows user to see the rotation animation of the sphere
4. L allows user to see the labels of each data points and the title they represent when they hover over them with their mouse
5. D allows user to turn off the data points to see only the constellations

Result: Although there are definitely improvements to be made and are being made in the spacing of the segments and the cluster of points around the outer rings of the sphere, I really like the visuals that the data has created and I thought it was interesting how vastly more popular some subjects are then others, creating much more dense constellations.
Screen Shot 2021-02-15 at 10.53.26 AM.png
Screen Shot 2021-02-15 at 10.53.58 AM.png
Screen Shot 2021-02-15 at 10.56.24 AM.png
Screen Shot 2021-02-15 at 10.56.52 AM.png
starsatnight.zip
(721.36 KiB) Downloaded 46 times

Post Reply