Proj 3 - 3D Visualization

glegrady
Posts: 203
Joined: Wed Sep 22, 2010 12:26 pm

Proj 3 - 3D Visualization

Post by glegrady » Thu Jan 03, 2019 1:27 pm

In this assignment, we are exploring interactive 3D volumetric visualization. Use the full capabilities of 3D where each x,y,z location and the color or scale of the positioned voxel represents data.

Query: Your query will need to have minimum 4 columns for x,y,z location and then to color or scale the pixel/voxel. Your MySQL query should be more complex then previous ones and cross the full spectrum of the database. As usual, you are expected to come up with an interesting question that will give interesting results! Your query should use MySQL syntax that goes beyond those presented in the demos, and the query should go for granular detail, so mapping main Dewey categories will not be sufficient, nor fulfill the expectation of an interesting query. "Change over Time", correlating data, or spatially distributing data based on algorithms can be interesting to explore. The important thing is that the x,y,z location of where the data is placed should be determined by the data.

Feb 05: Introduce the 3D assignment
Feb 07: Label and other demos
Feb 12: Discussion of spatial algorithms
Feb 14: Lab, work-in-progress
Feb 19: Project is Due

-----------------------
Previous Student Work Examples:

Mert Toka (2017): Relationship of weather data and checkouts based on weather words: http://vislab.mat.ucsb.edu/2017/p2/MertToka/index.html

Hannah Wolfe (2017): Lost and Forgotten Books in Visualization http://vislab.mat.ucsb.edu/2017/p2/Hann ... index.html

-----------------------
Libraries:
PeasyCam library: 3D spatial navigation and interactivity is introduced using the : http://mrfeinberg.com/peasycam/

ControlP5 is a GUI and controller library that includes functions like Sliders, Buttons, Toggles, Knobs, Textfields, RadioButtons, etc. http://www.sojamo.de/libraries/controlP5/

OpenGL (Open Graphics Library), a cross-platform graphics interface is used in this demo. Translate, pushmatrix, popmatrix functions are introduced. Information about push, pop and translation can be found at:
https://www.processing.org/tutorials/transform2d/
http://processing.org/tutorials/p3d

-----------------------
The EVALUATION CRITERIA will be the following:

Innovation in content: your query question and outcomes. How original, engaging, unusual, your query, or your approach to the query may be, and how interesting the data may be. The data has to be multivariate, and granular (meaning a lot of data) so that we can see patterns forming in the data.

Innovation in design/form: The design can build on our demos but hopoefully go beyond. Areas of exploration are in how you use space, form, colors, data organization, timing, interaction, coherence, direction, etc.

Computation: The third evaluation is the computational component. First of all, the code needs to work. Special consideration will be for unusual, elegant expression, utilizing functions, algorithms, etc that you can introduce to the class.
George Legrady
legrady@mat.ucsb.edu

meilinshi
Posts: 4
Joined: Thu Jan 10, 2019 10:57 am

Re: Proj 3 - 3D Visualization

Post by meilinshi » Tue Feb 19, 2019 2:21 pm

Concept Description
I’m interested in the checkouts within Dewey Class 400 – Language. For this project, I focused on 10 languages listed below and would like to see if there is a relative popularity in language learning at Seattle Public Library. To measure popularity, I used total checkouts number, average keeping time, and all checkout language related items from 2006 to 2018.
language.png

MySQL Queries & Processing Time
Query #1 On total checkouts (27.002 sec)

Code: Select all

SELECT 
    YEAR(cout) AS years,
    MONTH(cout) AS months,
    COUNT(IF(deweyClass >= 420 AND deweyClass < 430,
        1,
        NULL)) AS 'English',
    COUNT(IF(deweyClass >= 430 AND deweyClass < 440,
        1,
        NULL)) AS 'German & related',
    COUNT(IF(deweyClass >= 440 AND deweyClass < 450,
        1,
        NULL)) AS 'French & related',
    COUNT(IF(deweyClass >= 450 AND deweyClass < 460,
        1,
        NULL)) AS 'Italian & related',
    COUNT(IF(deweyClass >= 460 AND deweyClass < 470,
        1,
        NULL)) AS 'Spanish & related',
    COUNT(IF(deweyClass >= 491.7
            AND deweyClass < 491.8,
        1,
        NULL)) AS 'Russian',
    COUNT(IF(deweyClass >= 492.7
            AND deweyClass < 492.8,
        1,
        NULL)) AS 'Arabic',
    COUNT(IF(deweyClass >= 495.1
            AND deweyClass < 495.2,
        1,
        NULL)) AS 'Chinese',
    COUNT(IF(deweyClass >= 495.6
            AND deweyClass < 495.7,
        1,
        NULL)) AS 'Japanese',
    COUNT(IF(deweyClass >= 495.7
            AND deweyClass < 495.8,
        1,
        NULL)) AS 'Korean'
FROM
    spl_2016.outraw
WHERE
    deweyClass >= 420 AND deweyClass < 495.8
        AND YEAR(cout) BETWEEN 2006 AND 2018
GROUP BY MONTH(cout),YEAR(cout)
ORDER BY YEAR(cout), MONTH(cout);
Query #2 On average keeping time of items (first half: 39.168 sec, second half: 65.520 sec)

Code: Select all

SELECT 
    class, years, months, AVG(TIMESTAMPDIFF(DAY, cout, cin)) AS AVG_TIME
FROM
    (SELECT 
        SUBSTRING(deweyClass, 1, 5) AS class,
        #SUBSTRING(deweyClass, 1, 2) AS class, #for deweyClass 420 to 470
            YEAR(cout) AS years, MONTH(cout) AS months,
            cin,
            cout,
            TIMESTAMPDIFF(DAY, cout, cin)
    FROM
        spl_2016.inraw
    WHERE
        YEAR(cout) BETWEEN 2006 AND 2018
            AND TIMESTAMPDIFF(DAY, cout, cin) > 0
  #    AND deweyClass >= 420 AND deweyClass < 470
           AND (deweyClass >= 491.7
           AND deweyClass < 491.8
             OR deweyClass >= 492.7
             AND deweyClass < 492.8
           OR deweyClass >= 495.1
           AND deweyClass < 495.2
           OR deweyClass >= 495.6
           AND deweyClass < 495.7
           OR deweyClass >= 495.7
           AND deweyClass < 495.8)
    GROUP BY years , months , class , cin , cout) AS aTable
GROUP BY class , years , months
Query #3 On all checkout items (60.197 sec)

Code: Select all

SELECT 
    title,
    itemtype,
    deweyClass,
    COUNT(title) AS Counts,
    COUNT(IF(YEAR(cout) = 2006, 1, NULL)) AS '2006',
    COUNT(IF(YEAR(cout) = 2007, 1, NULL)) AS '2007',
    COUNT(IF(YEAR(cout) = 2008, 1, NULL)) AS '2008',
    COUNT(IF(YEAR(cout) = 2009, 1, NULL)) AS '2009',
    COUNT(IF(YEAR(cout) = 2010, 1, NULL)) AS '2010',
    COUNT(IF(YEAR(cout) = 2011, 1, NULL)) AS '2011',
    COUNT(IF(YEAR(cout) = 2012, 1, NULL)) AS '2012',
    COUNT(IF(YEAR(cout) = 2013, 1, NULL)) AS '2013',
    COUNT(IF(YEAR(cout) = 2014, 1, NULL)) AS '2014',
    COUNT(IF(YEAR(cout) = 2015, 1, NULL)) AS '2015',
    COUNT(IF(YEAR(cout) = 2016, 1, NULL)) AS '2016',
    COUNT(IF(YEAR(cout) = 2017, 1, NULL)) AS '2017',
    COUNT(IF(YEAR(cout) = 2018, 1, NULL)) AS '2018'
FROM
    spl_2016.outraw
WHERE
    YEAR(cout) BETWEEN 2006 AND 2018
        AND (deweyClass >= 420 AND deweyClass < 470)
        OR (deweyClass >= 491.7
        AND deweyClass < 491.8)
        OR (deweyClass >= 492.7
        AND deweyClass < 492.8)
        OR (deweyClass >= 495.1
        AND deweyClass < 495.2)
        OR (deweyClass >= 495.6
        AND deweyClass < 495.8)
GROUP BY title , itemtype , deweyClass
ORDER BY COUNT(title) DESC
Screenshots & Analysis
For the 3D histogram, I put time on x-axis, languages on y-axis and checkout counts on z-axis. Because the checkouts number for English and Spanish are way too high, (i.e. they went out of the screen), so I used log scale to show the numbers at this time. As we can see from the screenshot, the variation within each language is not big because of the log scale, but we can still see the variation across languages.
hist.png
Keepting time is shown as a line graph, time on x-axis, languages on y-axis and keeping time (in days) on z-axis.
One thing to be noticed is the peak around January and Feburary 2018. And the pattern is same for all languages.
I found out from my checkouts query that Jan and Feb 2018 data were missing. But in the keeping time query they are not, and the numbers are big (50-70 days) compared to the nearby months (10-30 days). I guess it is either because of the library closure or the library system went wrong at that time.
time.png
time.gif
I also added a point view of all checkout items within these languages and map their checkout counts as the size of the point. All items are grouped by language or item type.
overview_points.gif
The legend buttons can be clicked to see the checkout items within each language or each item type.
German.png
cd.png
pick.gif

Future Improvements
I would like to upgrade my buttons to the control p5 checkbox, so that I can use the active, deactive and all other functions. I've tried with it but then it would be messed up with the two graphs I already had (for some reason, the GUI() drawn on top of the original layer makes all the texts blurry.) Anyway, the buttons don't work perfectly. I have to click reset everytime to make single selection (but it enables multiple selections, kind of a bonus). I would also like to add a timeline slider to show the variation over time.
Attachments
meilin_3dvisualization.zip
(155.25 KiB) Downloaded 161 times
Last edited by meilinshi on Thu Feb 21, 2019 12:32 pm, edited 6 times in total.

aschoettler
Posts: 4
Joined: Thu Jan 10, 2019 11:03 am

Sandy Schoettler - Bar Chart with lots of data

Post by aschoettler » Tue Feb 19, 2019 2:27 pm

In this project I show a bar chart of checkout information for a given item type (book or video),
grouped by day of the week & week of the year. I also compare this data to similar data from previous years.


Here is a method I found online to make text face the screen:

Code: Select all

// At the top of the file, for example after "PeasyCam cam"
float[] rotations = new float[3];
// then inside draw() or after any pushMatrix(), add this:
rotations = cam.getRotations();
rotateX(rotations[0]);
rotateY(rotations[1]);
rotateZ(rotations[2]);
// Now any text() will face the screen
I used colors from this library by downloading their preview images:
https://github.com/d3/d3-scale-chromatic
Attachments
checkoutBarChart2 Final.zip
(14.26 MiB) Downloaded 151 times
preview.png
checkoutBarChart1.zip
(11.52 MiB) Downloaded 147 times
Last edited by aschoettler on Sun Mar 10, 2019 1:17 pm, edited 5 times in total.

jiaheng
Posts: 4
Joined: Wed Jan 16, 2019 10:17 am

Re: Proj 3 - 3D Visualization

Post by jiaheng » Tue Feb 19, 2019 2:50 pm

Project 3 - 3D Visualization
Jiaheng Tang

Concept Description
For this project, I decided to explore the most popular checked out items of each year, from 2006 to 2018. I picked the top 50 most popular item of each year, as well as the top 50 of the total checkouts.
MySQL Queries
I first constructed my SQL query to get the top 1000 most checkout items, below is the query I used.

Code: Select all

SELECT COUNT(itemNumber) AS Counts, title, itemNumber, cout, itemType, deweyClass
FROM spl_2016.inraw
WHERE YEAR(cout) BETWEEN 2006 AND 2018
GROUP BY itemNumber,title, cout, itemType, deweyClass
ORDER BY Counts DESC
LIMIT 1000
This query took really long, for about 45 minutes.
I then ran a query to get the popular checkout items each year.

Code: Select all

SELECT COUNT(itemNumber) AS Counts, title, itemNumber, cout, itemType, deweyClass
FROM spl_2016.inraw
WHERE YEAR(cout)=2018
GROUP BY itemNumber,title, cout, itemType, deweyClass
ORDER BY Counts DESC
LIMIT 50
I just changed the `YEAR(cout)` to each year.
But I realized that I should get each checkout records of the top 50 most popular items to make it more interesting. Therefore I wrote a Python script to read the item number from the CSV file I got back, and dynamically construct the SQL query I needed. Here is the Python script,

Code: Select all

import csv
import sys


def main(filename: str):
    with open(filename, 'r') as fin:
        with open(f'cout_{filename[:-4]}.sql', 'w+') as fout:
            year = filename[6:10]
            csv_reader = csv.DictReader(fin, delimiter=',')
            item_num = ''
            for row in csv_reader:
                item_num += f" OR itemNumber={row['itemNumber']}"
            item_num = item_num[4:]
            fout.write(
                f'SELECT title, itemNumber, cout, cin, itemType, deweyClass\nFROM spl_2016.inraw\nWHERE ({item_num}) AND YEAR(cout)={year}\nGROUP BY itemNumber, title, cout, cin, itemType, deweyClass\n')


if __name__ == '__main__':
    main(str(sys.argv[1]))
and one of the queries it produced

Code: Select all

SELECT title, itemNumber, cout, cin, itemType, deweyClass
FROM spl_2016.inraw
WHERE (itemNumber=6168318 OR itemNumber=4814851 OR itemNumber=937827 OR itemNumber=5446948 OR itemNumber=5456901 OR itemNumber=5299584 OR itemNumber=3702279 OR itemNumber=3287481 OR itemNumber=4181717 OR itemNumber=4501511 OR itemNumber=2498093 OR itemNumber=4041489 OR itemNumber=2566498 OR itemNumber=4795932 OR itemNumber=5524467 OR itemNumber=4692088 OR itemNumber=4955520 OR itemNumber=5099578 OR itemNumber=4247243 OR itemNumber=4424525 OR itemNumber=3177807 OR itemNumber=5317788 OR itemNumber=5248119 OR itemNumber=5082517 OR itemNumber=5119956 OR itemNumber=2817060 OR itemNumber=5032075 OR itemNumber=5140938 OR itemNumber=4568479 OR itemNumber=3864614 OR itemNumber=5464418 OR itemNumber=3760081 OR itemNumber=4910008 OR itemNumber=4876121 OR itemNumber=2559433 OR itemNumber=5445782 OR itemNumber=4990761 OR itemNumber=5517891 OR itemNumber=3213417 OR itemNumber=4441172 OR itemNumber=3572168 OR itemNumber=3494842 OR itemNumber=4281289 OR itemNumber=1360002 OR itemNumber=4784028 OR itemNumber=4758313 OR itemNumber=3601773 OR itemNumber=5505648 OR itemNumber=5007385 OR itemNumber=5285182) AND YEAR(cout)=2018
GROUP BY itemNumber, title, cout, cin, itemType, deweyClass
Design
I started out with the idea that I want to use a tunnel-like shape to represent time passing by. Then I will plot each record as a line, from the checkout date to check-in date. Here's my first draft.
final7.png
wip1.png
wip2.png
I used different colors for the Dewey classes. However, I wasn't using x and y-axis for any information. For each record, I just generated a random point inside the circle.

I decided to draw the checkout records as arcs around the circle. The circle starts from 0 to 360 degrees, and the degree can be mapped to month, from January to December. I also used brighter colors for different Dewey classes. Below are some screenshots.
final1.png
final1.png
final2.png
I also added features like toggles for Dewey classes and year, and a dropdown list for selecting the most popular titles of each year.
final3.png
final4.png
final5.png
final6.png
Final Results & Analysis
From the visualization, we can clearly see that the checkout times dropped year by year, which matches the conclusions that some of my peers draw. Also, a large portion of the popular checkouts is from non-dewey class, which means they don't have a Dewey class associated with them. I also think that Peasycam and ControlP5 are two very good libraries and it really saved me a lot of time when developing this application.

Future Improvements
I want to add hover text on each record for future improvements. I'm also considering adding a presentation mode which moves the camera slowly to show each year's titles.

Update * 2/19/2019
I made some updates to add the hover text using a slider, since I found it rather hard to check if the mouse is on an arc. Here are some screenshots.
update1.png
update2.png
The full list of updates:
  • make text cleared using `textMode()`
  • add a slider to let the user see titles of the checkout
popular_items.zip
(8.03 MiB) Downloaded 167 times
Attachments
wip3.png
Last edited by jiaheng on Tue Feb 19, 2019 10:00 pm, edited 1 time in total.

yokoebata
Posts: 4
Joined: Thu Jan 10, 2019 11:00 am

Re: Proj 3 - 3D Visualization

Post by yokoebata » Tue Feb 19, 2019 3:33 pm

Concept
For this 3D Visualization project, I wanted to explore the prominence of feng shui in the SPL over the decade between 2008-2018, in addition to the variations of nonfictional feng shui Dewey Classes that have been checked out in SPL. My intentions were to create a visualization creating an environment that infused feng shui practices, but first wanted to see what the data would express.

MySQL Query
This Query created the master data that I needed to capture all my axes information as well as the title. I included the collection of Unix times in hopes of using it for mapping purposes, which I ended up not using due to my decision to make my x,y, and z axes based on segments of dates (year, time, and month, respectively).

Code: Select all

SELECT title as Title, YEAR(cout) as year, TIME(cout) as time, MONTH(cout) as month, SUBSTRING(deweyClass, 1, 3) as dewey, UNIX_TIMESTAMP(DATE(cout)) as Unix, Counts  FROM (
SELECT 
		cout,
    bibNumber,
    deweyClass,
    title,
    COUNT(bibNumber) AS Counts
FROM
    spl_2016.outraw
WHERE
		YEAR(cout) BETWEEN 2008 AND 2018
        AND title LIKE '%feng shui%' AND (deweyClass IS NOT NULL) AND deweyClass NOT LIKE ' ' 
        AND deweyClass NOT LIKE ''
GROUP BY cout, deweyClass ASC, bibNumber , title
ORDER BY DATE(cout) ASC, deweyClass ASC, bibNumber , title) as tbl	
Visualizations
An Odd Happy Accident in part of the progression:
Screen Shot 2019-02-19 at 3.08.04 PM.png
First Designs:
Screen Shot 2019-02-19 at 3.09.01 PM.png
Screen Shot 2019-02-19 at 3.09.56 PM.png
Screen Shot 2019-02-19 at 3.11.53 PM.png
Final Product:
Screen Shot 2019-02-21 at 2.07.42 PM.png
Screen Shot 2019-02-21 at 2.08.16 PM.png
Seeing the circular aspect of one of the first visualizations I made (by accident), I hoped to preserve this style seeing that Google Search images about feng shui always lead to something radial:
Screen Shot 2019-02-19 at 3.16.26 PM.png
Unfortunately, due to skillset limitations I was unable to achieve this look.

I color coded the various Dewey Classes that the feng shui checkout items were under with the color palette choice of earth tones, as the philosophy of feng shui parallels natural beliefs.

The base was colored a navy blue for the look of a blueprint that held up structures of the book items.

Results & Analysis
As a result, the main significance of this data expresses that the main classifications of feng shui are dominantly under Dewey Class 133: Specific topics in Parapsychology and Occultism. All other topics are minor distributions of the feng shui topic makeup (for all nonfictionals). A unique observation can also be made with the assortment of Dewey Classes that existed in the earlier half of the decade, versus the lack of variety of Dewey Classes in the second half of the decade. It can be seen that the second half of the decade is mainly composed of (within the minor deweyClasses) Housekeeping, Landscape Architecture, and Interior Decoration. As an overall trend observation, there also seems to be a significant drop of feng shui items in the last months of 2018.

Future Implementations
I would love to recreate this visualization by wrapping the data visualization in a radial platform (like a cylinder) to give the similar effect of the feng shui diagrams shown in the Google image results. I would also like to utilize the GUIs from ControlP5 to create a switch panel to show and individualize the displays of deweyClasses in the data to closely observe the different classes (which I had some trouble displaying).
Attachments
fengShui_3Dvis.zip
(100.1 KiB) Downloaded 67 times
Last edited by yokoebata on Thu Feb 21, 2019 3:57 pm, edited 3 times in total.

wilsonmui
Posts: 5
Joined: Mon Apr 16, 2018 10:21 am

Re: Proj 3 - 3D Visualization

Post by wilsonmui » Tue Feb 19, 2019 3:41 pm

Concept Description

The idea behind this project was to create a visualization that can show the checkout time for the large Dewey categories. Multiple categories can be displayed alongside each other to compare. Data from all book and dvd checkouts between January 2018 - December 2018 is used. The code can be easily modified for longer queries. Each category has its own color. Books are shown as spheres and DVDs as cubes. The speed of the shapes is in relation to how quickly the book is returned. Changes in checkout duration of specified categories over time can be seen. Differences in checkout volume between categories is noticeable as well.

Entering 000, 100, ..., 900 will load and display all the books and DVDs checked out in the 12 month period belonging to the Dewey class. Multiple categories can be inputted for comparison.

MySQL Queries & Processing Time

execution time: 44.61 sec

Code: Select all

SELECT 
    DATE(cout) AS 'cout',
    DATE(cin) AS 'cin',
    itemtype,
    deweyClass As 'class',
    title 
FROM
    spl_2016.inraw
WHERE
	YEAR(cout) = 2018
    AND MONTH(cout) between 1 and 12
    AND (itemType = "acbk" or itemType = "acdvd")
	AND deweyClass != ''
ORDER BY date(cout) DESC
Screenshots
Screen Shot 2019-02-19 at 4.46.35 PM.png
Screen Shot 2019-02-19 at 4.51.21 PM.png
Screen Shot 2019-02-19 at 4.52.12 PM.png
Book or DVD titles can be seen if zoomed in. If there are interesting outliers, this could help to identify them.

Conclusion

DVDs are rarely seen due to their low popularity in comparison to books. Each category seems to have some outliers that are returned very quickly.
Books related to Social Sciences seem to be popular and also are returned relatively quick. Books in the technology category appear to also be returned quickly. Patterns are not easily detected however, maybe showing more data at a given moment would better reveal patterns. Anomalies are also more apparent than desirable.
Attachments
wilson_proj_3.zip
(19.01 MiB) Downloaded 76 times
Last edited by wilsonmui on Thu Feb 21, 2019 3:48 pm, edited 4 times in total.

chantalnguyen
Posts: 4
Joined: Thu Jan 10, 2019 10:51 am

Re: Proj 3 - 3D Visualization

Post by chantalnguyen » Tue Feb 19, 2019 3:46 pm

Concept description
I obtain the most common words found in the titles of Dewey 641 (food & drink) media and use t-SNE for dimensionality reduction in order to explore the data in 3-D space. I also generate a network that links words co-occurring in the same title and extract its community structure.

MySQL queries
I queried the titles of all Dewey 641 media that were checked out between 2006 and 2018. I noticed that there was a text encoding problem with the database - special characters were encoded incorrectly, for instance I would get 'caf‚' instead of 'café'. I was unable to find a solution, but the following code mitigates this somewhat (the words with special characters are still wrong, but they aren't as gnarly).

Code: Select all

SELECT 
    CONVERT( CAST(CONVERT( title USING LATIN1) AS BINARY) USING UTF8) AS fixedTitle,
    COUNT(bibNumber) AS Counts
FROM
    spl_2016.inraw
WHERE
    deweyClass >= 641 AND deweyClass < 642
        AND YEAR(cout) BETWEEN 2006 AND 2018
GROUP BY fixedTitle
ORDER BY Counts DESC
LIMIT 20000
The query took 32 seconds.

Next, I processed the titles using word2vec and extracted coordinates in 3-dimensional space using t-SNE. I also generated a network by defining words as nodes with edges between two words that occur in the same title. The weight of an edge represents how many times those two words have occurred in the same title. The code is attached as a Jupyter notebook, though the corresponding python code is here:

Code: Select all

#!/usr/bin/env python
# coding: utf-8

# MAT 259: 3D Visualization
# 
# Chantal Nguyen
# 
# Instructor: George Legrady
# 
# 
# The following code reads in a list of titles of all items at the Seattle Public Library in Dewey category 641 (food & drink). Words are extracted and processed using word2vec, and dimensionality reduction is performed using t-SNE to facilitate visualization in 3D space.
# 
# Adapted from Andy Patel (https://labsblog.f-secure.com/2018/01/30/nlp-analysis-of-tweets-using-word2vec-and-t-sne/)



from sklearn.manifold import TSNE
from collections import Counter
from collections import OrderedDict
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
from scipy import sparse
import os
import sys
import io
import re
import json
import multiprocessing


# Define functions that can either load existing data or execute a function to acquire data.



def try_load_or_process(filename, processor_fn, function_arg):
    load_fn = None
    save_fn = None
    if filename.endswith("json"):
        load_fn = load_json
        save_fn = save_json
    else:
        load_fn = load_bin
        save_fn = save_bin
    if os.path.exists(filename):
        return load_fn(filename)
    else:
        ret = processor_fn(function_arg)
        save_fn(ret, filename)
        return ret
 
def print_progress(current, maximum):
    sys.stdout.write("\r")
    sys.stdout.flush()
    sys.stdout.write(str(current) + "/" + str(maximum))
    sys.stdout.flush()
 
def save_bin(item, filename):
    with open(filename, "wb") as f:
        cPickle.dump(item, f)
 
def load_bin(filename):
    if os.path.exists(filename):
        with open(filename, "rb") as f:
            return cPickle.load(f)
 
def save_json(variable, filename):
    with io.open(filename, "w", encoding="utf-8") as f:
        f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))
 
def load_json(filename):
    ret = None
    if os.path.exists(filename):
        try:
            with io.open(filename, "r", encoding="utf-8") as f:
                ret = json.load(f)
        except:
            pass
    return ret


# Load the raw data:



def process_raw_data(input_file):
    lines = []
    print("Loading raw data from: " + input_file)
    if os.path.exists(input_file):
        with io.open(input_file, 'r', encoding="utf-8") as f:
            lines = f.readlines()
    num_lines = len(lines)
    ret = []
    for count, text in enumerate(lines):
        if count % 50 == 0:
            print_progress(count, num_lines)
        text = u''.join(x for x in text)
        text = text.strip()
        if text not in ret:
            ret.append(text)
    return ret


# Tokenize titles (sentences):



def tokenize_sentences(sentences):
    ret = []
    max_s = len(sentences)
    print("Got " + str(max_s) + " sentences.")
    for count, s in enumerate(sentences):
        tokens = []
        words = re.split(r'(\s+)', s)
        if len(words) > 0:
            for w in words:
                if w is not None:
                    w = w.strip()
                    w = w.lower()
                    if w.isspace() or w == "\n" or w == "\r":
                        w = None
                    if len(w) < 1:
                        w = None
                    if w is not None:
                        tokens.append(w)
        if len(tokens) > 0:
            ret.append(tokens)
        if count % 50 == 0:
            print_progress(count, max_s)
    return ret


# Clean titles using list of common stopwords in addition to self-defined list ("cookbook", etc)



def clean_sentences(tokens):
    all_stopwords = load_json("stopwords-iso.json")
    extra_stopwords = ["cook", "cookbook", "book", "cooking", "recipe", "recipes", "food", "foods"]
    stopwords = []
    if all_stopwords is not None:
        stopwords += all_stopwords["en"] + all_stopwords["fr"] + all_stopwords["es"] + all_stopwords["de"] + all_stopwords["it"]
        stopwords += extra_stopwords
    ret = []
    max_s = len(tokens)
    for count, sentence in enumerate(tokens):
        if count % 50 == 0:
            print_progress(count, max_s)
        cleaned = []
        for token in sentence:
            if len(token) > 0:
                if stopwords is not None:
                    for s in stopwords:
                        if token == s:
                            token = None
                if token is not None:
                    if re.search("^[0-9\.\-\s\/]+$", token):
                        token = None
                if token is not None:
                    cleaned.append(token)
        if len(cleaned) > 0:
            ret.append(cleaned)
    return ret


# Get word frequencies:



def get_word_frequencies(corpus):
    frequencies = Counter()
    for sentence in corpus:
        for word in sentence:
            frequencies[word] += 1
    freq = frequencies.most_common()
    return freq


# Get word2vec representations:    



def get_word2vec(sentences):
    num_workers = multiprocessing.cpu_count()
    num_features = 200
    epoch_count = 10
    sentence_count = len(sentences)
    w2v_file = os.path.join(save_dir, "word_vectors.w2v")
    word2vec = None
    if os.path.exists(w2v_file):
        print("w2v model loaded from " + w2v_file)
        word2vec = w2v.Word2Vec.load(w2v_file)
    else:
        word2vec = w2v.Word2Vec(sg=1,
                            seed=1,
                            workers=num_workers,
                            size=num_features,
                            min_count=3,
                            window=5,
                            sample=0)

        print("Building vocab...")
        word2vec.build_vocab(sentences)
        print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
        print("Training...")
        word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
        print("Saving model...")
        word2vec.save(w2v_file)
    return word2vec


# Get word associations:



def most_similar(input_word, num_similar):
    sim = word2vec.wv.most_similar(input_word, topn=num_similar)
    output = []
    found = []
    for item in sim:
        w, n = item
        found.append(w)
    output = [input_word, found]
    return output




def test_word2vec(test_words):
    output = []
    associations = OrderedDict()
    test_items = test_words
    for count, word in enumerate(test_items):
        if word not in associations:
            associations[word] = []
        similar = most_similar(word, num_similar)
        output.append(similar)
        for s in similar[1]:
            if s not in associations[word]:
                associations[word].append(s)
    filename = os.path.join(save_dir, "word2vec_test.json")
    save_json(output, filename)
    filename = os.path.join(save_dir, "associations.json")
    save_json(associations, filename)
    filename = os.path.join(save_dir, "associations.csv")
    handle = io.open(filename, "w", encoding="utf-8")
    handle.write(u"Source,Target\n")
    for w, sim in associations.iteritems():
        for s in sim:
            handle.write(w + u"," + s + u"\n")
    return output


# Run t-SNE with 3 output dimensions:



def calculate_t_sne():
    vocab = word2vec.wv.vocab.keys()
    vocab_len = len(vocab)
    arr = np.empty((0, dim0), dtype='f')
    labels = []
    vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
    labels_file = os.path.join(save_dir, "labels.json")
    if os.path.exists(vectors_file) and os.path.exists(labels_file):
        print("Loading pre-saved vectors from disk")
        arr = load_bin(vectors_file)
        labels = load_json(labels_file)
    else:
        print("Creating an array of vectors for each word in the vocab")
        for count, word in enumerate(vocab):
            if count % 50 == 0:
                print_progress(count, vocab_len)
            w_vec = word2vec[word]
            labels.append(word)
            arr = np.append(arr, np.array([w_vec]), axis=0)
        save_bin(arr, vectors_file)
        save_json(labels, labels_file)

    x_coords = None
    y_coords = None
    z_coords = None
    x_c_filename = os.path.join(save_dir, "x_coords.npy")
    y_c_filename = os.path.join(save_dir, "y_coords.npy")
    z_c_filename = os.path.join(save_dir, "z_coords.npy")
    if os.path.exists(x_c_filename) and os.path.exists(y_c_filename) and os.path.exists(z_c_filename):
        print("Reading pre-calculated coords from disk")
        x_coords = load_bin(x_c_filename)
        y_coords = load_bin(y_c_filename)
        z_coords = load_bin(z_c_filename)
    else:
        print("Computing T-SNE for array of length: " + str(len(arr)))
        tsne = TSNE(n_components=3, random_state=1, verbose=1)
        np.set_printoptions(suppress=True)
        Y = tsne.fit_transform(arr)
        x_coords = Y[:, 0]
        y_coords = Y[:, 1]
        z_coords = Y[:, 2]
        print("Saving coords.")
        save_bin(x_coords, x_c_filename)
        save_bin(y_coords, y_c_filename)
        save_bin(z_coords, z_c_filename)
    return x_coords, y_coords, z_coords, labels, arr


# Run the code:



input_dir = ""
save_dir = ""
# if not os.path.exists(save_dir):
#     os.makedirs(save_dir)

print("Preprocessing raw data")
raw_input_file = os.path.join(input_dir, "items.csv")
filename = os.path.join(save_dir, "data.json")
processed = try_load_or_process(filename, process_raw_data, raw_input_file)
print("Unique sentences: " + str(len(processed)))

print("Tokenizing sentences")
filename = os.path.join(save_dir, "tokens.json")
tokens = try_load_or_process(filename, tokenize_sentences, processed)
print('\n')

print("Cleaning tokens")
filename = os.path.join(save_dir, "cleaned.json")
cleaned = try_load_or_process(filename, clean_sentences, tokens)
print('\n')

print("Getting word frequencies")
filename = os.path.join(save_dir, "frequencies.json")
frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
vocab_size = len(frequencies)
print('\n')
print("Unique words: " + str(vocab_size))




print("Instantiating word2vec model")
word2vec = get_word2vec(cleaned)
vocab = word2vec.wv.vocab.keys()
vocab_len = len(vocab)
print("word2vec vocab contains " + str(vocab_len) + " items.")
dim0 = word2vec.wv[vocab[0]].shape[0]
print("word2vec items have " + str(dim0) + " features.")


print("Calculating T-SNE for word2vec model")
x_coords, y_coords, z_coords, labels, arr = calculate_t_sne()




with io.open(os.path.join(save_dir, 'labels.csv'), mode='w', encoding='utf-8') as f:
    for row in labels:
        f.write(row+'\n')
        
coords = np.transpose(np.vstack((x_coords, y_coords, z_coords)))  
with open(os.path.join(save_dir,'coords.csv'), 'w') as f:
    for x, y, z in coords:
        f.write('%f, %f, %f\n' % (x,y,z))        




freq = dict(frequencies)
reordered_freqs = []
for word in labels:
    reordered_freqs.append(freq[word])

with open(os.path.join(save_dir,'frequencies.csv'), 'w') as f:
    for num in reordered_freqs:
        f.write('%f\n' % num)
        
num_similar = 20
test_words = []
for item in labels:
    test_words.append(item)
associations = test_word2vec(test_words)


# Create word network: each node is a different word in the vocab, and an edge connects two words that are found in the same title. The network is represented by a sparse adjacency matrix where the indices are in the same order as the t-SNE labels.



labeldict = OrderedDict()
count = 0
for word in labels:
    labeldict[word] = count
    count = count + 1




adjacency_matrix = np.zeros((len(labels),len(labels)))
    
for token in cleaned:
    for i in range(0,len(token)-1):
        for j in range(i+1,len(token)):
            if token[i] in vocab and token[j] in vocab:
                adjacency_matrix[labeldict[token[i]]][labeldict[token[j]]] += 1
                adjacency_matrix[labeldict[token[j]]][labeldict[token[i]]] += 1




sAdj = sparse.csr_matrix(adjacency_matrix)




# save as mtx file
from scipy import io
io.mmwrite('adjacency.mtx', sAdj, field='integer', symmetry='general')
# ignoring first 3 lines, resave as csv file with comma delimiters instead of whitespace
with open('adjacency.mtx', 'r') as f, open('adjacency.csv', 'w') as f2:
    next(f)
    next(f)
    next(f)
    for row in f:
        f2.write(re.sub('\s+',',',row.strip())+'\n')

I also obtained the community structure of the word network in Matlab. The code requires the GenLouvain package available athttp://netwiki.amath.unc.edu/GenLouvain/GenLouvain. The community detection algorithm determines clusters (communities) of nodes that are relatively densely connected with each other and sparsely connected with the rest of the network.

Code: Select all

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% MAT 259 Project 3                                             %
% Determine community structure for word network                %
% Author: Chantal Nguyen                                        %
% Supervisor: George Legrady                                    %
%                                                               %
% Reads in csv file containing sparse adjacency matrix          %
% Outputs csv file containing community number for each word    %
% Requires GenLouvain package available at                      %
% http://netwiki.amath.unc.edu/GenLouvain/GenLouvain            %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% read in csv file containing 3 columns: source node, target node, edge
% weight
sparse_adj = csvread('~/Documents/classes/MAT259/proj3/data/adjacency.csv');

% convert to graph object
A = zeros(max(sparse_adj(:,1)));
for i = 1:length(sparse_adj)
    A(sparse_adj(i,1),sparse_adj(i,2)) = sparse_adj(i,3);
end
G = graph(A);
A = adjacency(G,'weighted');

% do the community detection
resp = 1.2;
C = modularity(A,resp);
[S,Q,n_it]=iterated_genlouvain(C);

% write to csv file
csvwrite('~/Documents/classes/MAT259/proj3/data/communities.csv',S);
Screenshots and final results
I plotted each word, represented as a circle, using the coordinates obtained in t-SNE. The size of the circle is proportional to the frequency of the word in the body of titles. I also showed the network edges, but the network is pretty dense, so by default they are hidden; pressing keys 1-9 or Q-P will show the edges where the corresponding key is an edge weight threshold (e.g. pressing 3 shows all edges with weight greater than or equal to 3, and Q-P represent threshold values of 10-19). Pressing 0 will hide the edges. Hovering over a point will bring up the label along with the 20 most similar words. Pressing C will toggle the color scheme between one in which the colors are proportional to the respective frequencies of the words, and one in which identical colors indicate the same community. The user can also type in a word to highlight the corresponding node using the search bar in the lower right corner.
Screen Shot 2019-02-19 at 3.34.58 PM.png
Screen Shot 2019-02-19 at 3.35.12 PM.png
Screen Shot 2019-02-19 at 3.35.56 PM.png
Screen Shot 2019-02-19 at 3.40.51 PM.png
Screen Shot 2019-02-19 at 3.41.53 PM.png
MAT259_Proj3_CN.zip
(12.4 MiB) Downloaded 89 times
Last edited by chantalnguyen on Thu Feb 28, 2019 3:52 pm, edited 4 times in total.

anaghauppal
Posts: 3
Joined: Thu Jan 10, 2019 10:50 am

Re: Proj 3 - 3D Visualization

Post by anaghauppal » Wed Feb 20, 2019 10:46 pm

I was interested in the long-standing debate between science and religion and wanted to see if we could mark the noted resurgence of interest in religious belief through transactions (inraw) on these topics at Seattle Public Library.
For this, I found the China_Japan demo to be a very flexible piece of code and particularly useful for my needs.

I used the following SQL queries to grab all titles related to these two subjects.

Code: Select all

RELIGION 

select YEAR(cout), TRUNCATE(FLOOR(`deweyClass`), -1) AS Dewey, count(*)
from spl_2016.inraw
where YEAR(cout) > '2005' and deweyClass > '0' and title like '%God%'
or title like '%god%' or title like '%religion%' or title like '%religious%'
group by
year (cout),  TRUNCATE(FLOOR(`deweyClass`), -1)
order by
year (cout), TRUNCATE(FLOOR(`deweyClass`), -1)

and

Code: Select all

SCIENCE 

select YEAR(cout), TRUNCATE(FLOOR(`deweyClass`), -1) AS Dewey, count(*)
from spl_2016.inraw
where YEAR(cout) > '2005' and deweyClass > '0'
and title like '%science%' or title like '%scientific%' 
group by
year (cout),  TRUNCATE(FLOOR(`deweyClass`), -1)
order by
year (cout), TRUNCATE(FLOOR(`deweyClass`), -1)
The work led to a 3D visualization that looked something like this.

Image

Perhaps some future work could be doing a simple sentiment analysis to recategorize each of religion and science into positivity and negativity and visualizing both, one on top and one on bottom. Alternatively, if we could produce a scale of sentiment, it could replace the dewey scale, which doesn't give us a lot of information.
Attachments
Screen Shot 2019-02-20 at 10.43.08 PM.png
Proj2AU.zip
(126.33 KiB) Downloaded 72 times

suburtner
Posts: 4
Joined: Thu Jan 10, 2019 10:58 am

Re: Proj 3 - 3D Visualization

Post by suburtner » Thu Feb 21, 2019 12:28 pm

Concept description

This project creates a multi-layered network where layers represent different map collections at Seattle Public Library. Within each collection and across collections (layers), nodes are represented by titles, and an edge is drawn for titles that have the same subject. The goal for the visualization is that it communicates which titles are grouped together by subjects within and across collections.

MySQL queries

Code: Select all

USE spl_2016;

SELECT DISTINCT title, subject, collectionCode
FROM title, subject, itemType, itemToBib, collectionCode
WHERE itemType.itemtype LIKE '%map' AND
    collectionCode IN ("camap", "canf", "camapr") AND
    subject.subject != "" AND
    collectionCode.itemNumber = itemType.itemNumber AND
    itemType.itemNumber = itemToBib.itemNumber AND
    itemToBib.bibNumber = subject.bibNumber AND
    subject.bibNumber = title.bibNumber
ORDER BY collectionCode ASC;
Processing time

13.9 s

Sketches and work-in-progress screenshots
assignment3_sketch.JPG
The above image shows my initial sketch for the idea. I had thought about looking at different types of relationships within the data, but there wasn't quite enough within Dewey classifications to visualize them. I eventually settled on looking at titles with shared subjects as separated by collections.

Final results & analysis

The images below show the mostly final product. The "camap" collection has the largest number of titles, and they seem to share far more subject within the collection than across the collections. The "camapr" collection has the fewest amount of titles, and interestingly enough, hardly any have the same subject. This may also be a consequence of the subjects being recorded differently, despite the topic being semantically the same. This is why it might be interesting to perform word embeddings on the subjects (and titles) and then try to reconnect them.

In future work, I would like to have a highlighting capability where if you mouse over a node, the title will show up. This absolutely seems doable, but currently, all titles show up once a node is moused over, so I need to remedy this code. Furthermore, while this type of visualization shows connections between the collections, the top and bottom layers do not show connected relations (in so far as a connection is absent from the middle layer). With additional work, I would like to make the connections, layers, and text labels more dynamic.

REVISON: March 4th, 2019

I have modified this assignment in the following ways:
  • changed interconnected edges to have gradient color
  • include functionality to hover over a node to see its title
  • reduced layer distance
  • rotated all layers for more optimal initial position
as3_full.png
as3_inter.png
as3_title.png
Attachments
MultiLayered_Networks.zip
(14.35 KiB) Downloaded 61 times
Last edited by suburtner on Mon Mar 04, 2019 3:12 pm, edited 5 times in total.

sarahwells
Posts: 4
Joined: Thu Jan 10, 2019 10:52 am

Re: Proj 3 - 3D Visualization

Post by sarahwells » Fri Feb 22, 2019 7:23 pm

Sarah Wells - 3D Visualization:
Concept Description: Exploration of the Sciences: Dewey Class 500. I was interested in SPL interest in science related books (Dewey 500-599) over time, particularly I hypothesized a decrease in adults getting information in the age of the internet, but perhaps less of a decrease in children checking out materials. So I explored all of Dewey 500 data in the inraw table from 2006 to 2019 pulling itemType, count, and checkout time data.

Queries:
Query 1: Collect all the titles in this category and analyze the count, average checkout time, and type (adult, child, book, video) grouped by each month from 2006 to 2019.

Code: Select all

SELECT 
    left(deweyClass, 3) as deweyC,
    DATE_FORMAT(cout, '%Y-%m') AS yearmonth,
    title,
    itemType,
    COUNT(*) AS count,
    AVG(TIMESTAMPDIFF(DAY, cout, cin)) AS avgcouttime
FROM
    inraw
WHERE
    deweyClass LIKE '5%'
        AND (itemType LIKE '%bk'
        OR itemType LIKE '%dvd'
        OR itemType LIKE '%vhs')
        AND YEAR(cout) > 2005
GROUP BY deweyC , yearmonth , title , itemType;
Query 2: Since Query 1 obtained too much data to work with in a timely fashion, I narrowed the data down to group by the first three digits of the dewey and no longer separating each title. This query collects the count, average checkout time, and type (adult, child, book, video) grouped by each dewey integer number and month now instead of title and month.

Code: Select all

    SELECT 
    left(deweyClass, 3) as deweyC,
    DATE_FORMAT(cout, '%Y-%m') AS yearmonth,
    itemType,
    COUNT(*) AS count,
    AVG(TIMESTAMPDIFF(DAY, cout, cin)) AS avgcouttime
FROM
    inraw
WHERE
    deweyClass LIKE '5%'
        AND (itemType LIKE '%bk'
        OR itemType LIKE '%dvd'
        OR itemType LIKE '%vhs')
        AND YEAR(cout) > 2005
GROUP BY deweyC , yearmonth , itemType;
Results and Analysis:
I first plotted the results of Query 1, but the data set was too large and most of the counts were too small. The placement is determined by dewey, date, and count and the size is correlated to length of the checkout. This was hard to view as the results included every title, so I switched to Query 2 to build a better picture.
first.png
I plotted the results of Query 2 in a similar way, position by dewey, date, and total count for each dewey integer from 500-599 and year in the months from 2006 to 2019. Again the size is determined by the average checkout time, but to be able to see each point they are still relatively close in size. The color is associated to the categories of Adult Books - pink, Children Books - green, Adult Videos - red, Child Videos purple. One can also hover over point to obtain it's specific information.
gen.png
The general trend shows that most categories peaked between 2008 and 2011. My hypothesis is that getting information on the sciences through the internet increased around 2010. Additionally, I added in a transparent shape connecting the points to easier see trends.
trans1.png
Using this, we notice that as hypothesized Adult Books checkouts show a strong decrease as time moves forward, but we still see reasonably consistent usage of Children's Books.
transab.png
transcb.png
Examining videos, I find it interesting that though neither began with a huge number of checkouts relative to books to start with, Adult Videos and Children's Videos seemed to decrease in a proportionally similar manner.
transav.png
transcv.png
Another thing I noticed was a few gaps in the Dewey data showing very few checkouts. I examined a few to find they are the topics of Topology and Analysis - both higher level Mathematics topics. As a Mathematics student, I am curious the cause of this. Perhaps higher level math is only accessible to mathematics scholars whereas anyone with interest might have more success or curiosity about other scientific topics?
missingmath.png

Future Improvements: As I noted, one of the reasons I switched to a broader way of displaying the data is because my first query had so much information it took too long to process. I would like to see if there is some other way to visualize or code this without losing the specificity I had with specific titles as data points. Another issue I ran into was displaying the average checkout time, I correlated it to size but since I had to fit all the data the scale made it hard to really see the effect of the different sizers between just a few pixels. I would like to see if there is a better way to convey this additional information.
Attachments
SarahWells3D500.zip
(9.72 MiB) Downloaded 71 times

Post Reply