2D Frequency Pattern Visualization

glegrady
Posts: 203
Joined: Wed Sep 22, 2010 12:26 pm

2D Frequency Pattern Visualization

Post by glegrady » Wed Jan 04, 2017 6:26 pm

2D Frequency Pattern Visualization

Assignment due:
Jan 24: Rough concept/sketch discussion
Jan 26: 1st version due

Description
2D MATRIX: Create a 2D visualization in Processing using results from a MySQL query, stored in a csv file. Each cell's vertical (Y) and horizontal (X) position and color value to be determined by the 3 csv columns.

SIZE/DATA: Length and Height of the matrix each represent a data value. The 3rd value is to be represented by coloring cells either using, color in RGB, or saturation, or brightness in HSB, or any other way where each cell has a x,y, location, and it is differentiated from the other cells through color, or scale or any other means.

COLOR CODING: 99% of science visualizations use the green – yellow- red color system. Experiment with/invent a different color scheme. Try http://tristen.ca/hcl-picker/#/hlc/6/1/21313E/EFEE69

LABELS and TEXTS: Use standard sans-serif fonts from the Swiss Graphic Design tradition for your texts and labels. Limit font sizes to a few. These include any of the following: Arial, Helvetica, Futura, Univers and related fonts.

CENTER VISUALIZATION: Your visualization should be at the center of your screen.
---
Your REPLY POST should include the following:

Concept description
--
Sketch if useful
Screen Shots of your project with descriptions
--
Add the MySQL query as a comment in a separate tab
Processing time
--
Please zip your processing code to include data folder for "ready to run". Add the zip file as an attachment.
--
Analysis
George Legrady
legrady@mat.ucsb.edu

griessbaum
Posts: 4
Joined: Wed Jan 11, 2017 10:40 am

Re: 2 2D Frequency Pattern Visualization

Post by griessbaum » Mon Jan 23, 2017 11:14 pm

I wanted to figure out how long (in terms of number of words( a good title, respective to the discipline (dewey class), is. I was hoping to see results such as: People prefer books on religion with a title length of more than x words, while people prefer books on science with more than y words.

To do so, I aggregated the data first by their dewey class (only the first 3 digits) and then sub-aggregated by the number of words in the title. For each of these sub-groups, I summed up the number of checkouts and the number of books that exist in this sub-group. I then defined the popularity of the subgroup as the ratio of number of checkouts to number of books (of this dewey class and group of title lenght).

The three dimensions I consequently extracted from the database are
a) the dewey class
b) the title length
c) the popularity

For each dewey category, I then plotted the popularity vs. the title length.

I encountered several issues with my initial idea:
1) No pattern whatsoever was visible. It basically was just a clatter of seemingly random lines.
2) There was too much data. This makes problem 1) worse, but also is a performance issue.
Considering that we have 1000 dewey classes (only first 3 digits) and that the title lengths may span from anywhere between 1 to 40 (in average, there were 14.25 distinct title lengths per dewey), we end up in the ten-thousands of rows.
3) Some dewey categories are obviously more demanded than others. Therefore the lines are quite far out scattered.

To overcome the issues, I implemented following changes:
1) I aggregated the data by the first 2, instead of the first 3 digits of the dewey class. This reduced the data to approx 3000 rows and consequently results in 100 distinct lines. I was hoping that due to the aggregation and therefore more data per group being available, we would also be able to see trends (instead of randomness) more clearly.
2) I extracted the main dewey class from the database (rather than computing it during runtime in processing) in the hope to use it for a better color-coding.
3) I implemented interactive filters.

What I still want to do:
1) Move the filtered class to foreground when filter is activated.
2) Add a legend.

To get the data, I used following query (contains some columns, that I do not actually use).

Code: Select all

SELECT 
    main.DeweyMain as Dewey, 
    main.Dewey as Dewey, 
    title_word_count, 
    checkouts, 
    number_of_titles, 
    checkouts_per_title,
    checkouts_per_dewey,
    checkouts/checkouts_per_dewey AS checkouts_share,
    checkouts/number_of_titles/checkouts_per_dewey AS checkouts_per_dewey_share,
    number_of_titles_per_dewey,
    number_of_titles/number_of_titles_per_dewey AS number_of_titles_share
FROM
	(SELECT 
		LEFT(deweyClass, 1) AS DeweyMain,
		LEFT(deweyClass, 2) AS Dewey,
		LENGTH(title) - LENGTH(REPLACE(title, ' ','')) + 1 AS title_word_count,
		COUNT(id) AS checkouts,  
		COUNT(DISTINCT(bibNumber)) AS number_of_titles,		
		COUNT(id) / COUNT(DISTINCT(bibNumber)) AS checkouts_per_title
	FROM
		spl_2016.inraw
	WHERE 
		length(deweyClass) > 0
	AND 
		cout > '2011-01-01'
	GROUP BY 
		Dewey, title_word_count
	) AS main
LEFT JOIN 
	(SELECT 
		LEFT(deweyClass, 2) AS Dewey,	
		COUNT(DISTINCT(bibNumber)) AS number_of_titles_per_dewey,
        COUNT(id) AS checkouts_per_dewey
	FROM
		spl_2016.inraw
	WHERE 
		cout > '2011-01-01'
	GROUP BY 
		Dewey
	) AS count
ON (count.Dewey=main.Dewey)
WHERE 
	title_word_count <= 40
ORDER BY 
	main.Dewey, title_word_count;
Which takes 281 seconds.

And the following processing code.

Code: Select all

Table table;
Table deweyMainNames;
float horzMargin = 40;
float vertMargin = 30;
boolean deweyFilter = false;
int deweyFilterValue = 0;
int rowCount;
boolean boldness = false;
float xMin = 1;
float xMax = 40;
float yMin = 0;
float yMax = (float)300;
float xStretch ;
float yStretch;


public void settings() {
    size(1000, 800); // setup the size of the window
    table = loadTable("query_first2digits.csv");
    deweyMainNames = loadTable("dewe_main.csv");
    rowCount = table.getRowCount();
    xStretch = (width-2*horzMargin)/(xMax-xMin);
    yStretch = (height-2*vertMargin)/(yMax-yMin);
    smooth(40);
}

public void lines() {
    int dewey;
    int deweyMain;
    int deweyPrev = 100;
    int wordCount;
    float checkoutsPerTitleShare;
    float x1 =0;
    float y1 =0;
    float x2 ;
    float y2 ;

    for (int row=0; row<rowCount; row ++){
        deweyMain = table.getInt(row, 0);
        if (deweyFilter == true && boldness == false && deweyMain != deweyFilterValue ){
            continue;
        }

        dewey = table.getInt(row, 1);
        wordCount = table.getInt(row, 2);
        checkoutsPerTitleShare = table.getFloat(row, 5);
        x2 = horzMargin + (wordCount-1) * xStretch;
        y2 = height - checkoutsPerTitleShare * yStretch - vertMargin;
        if (deweyPrev == dewey) {
            colorMode(HSB, 10, 100, 10);
            stroke(deweyMain, 50, dewey%10, 60);
            strokeWeight(3);
            if (boldness == true && deweyMain == deweyFilterValue){
                stroke(deweyMain, 50, 70, 100);
                strokeWeight(7);
            }

            line(x1, y1, x2, y2);
            //text(round(checkoutsPerTitleShare), width-horzMargin, y2);
        }
        x1 = x2;
        y1 = y2;
        deweyPrev = dewey;
    }
}
public void axis(){
    xAxis();
    yAxis();
}

public void yAxis(){
    text("checkouts", width-70, height-10);
    float y ;
    for (int checkouts=0; checkouts<=280; checkouts +=10){
        y = height - ((checkouts) * yStretch + vertMargin);
        text(checkouts, width-vertMargin, y);
    }
}

public void xAxis(){
    text("Title length", horzMargin, 20);
    float x;
    for (int wordCount=1; wordCount<=40; wordCount ++){
        x = (wordCount-1) * xStretch + horzMargin;
        text(wordCount, x, horzMargin);
    }
}

public void title(){
    if (deweyFilter){
        text("Category "+deweyFilterValue+"00 ("+deweyMainNames.getString(deweyFilterValue,1)+")", width -300, 100);
    }
}

public void legend(){
}

public void keyPressed() {
    int number;
    number = Integer.valueOf(key)-48;
    System.out.println(number);
    if (number >= 0 && number <=9){
        deweyFilter = true;
        deweyFilterValue = number;
        colorMode(RGB);
    }
    else if (key=='c') {
        deweyFilter = false;
    }
    else if (key=='b') {
        boldness = !boldness;
    }
}
public void draw(){
    colorMode(RGB, 100);
    background(0);
    lines();
    axis();
    title();
    legend();
} 
Trying to move away from a stereotypical histogram representation, I also implemented a heatmap for typical title lengths of each dewey category. The x-axis hereby represents the length of the title, the y-axis separates the space into the dewey categories, while the brightness of each pixel (rectangle) represents the popularity of the title length.

The same data / query as in the previous attempt can be used.

Code: Select all

Table table;
Table deweyMainNames;
float rightMargin = 300;
float leftMargin = 200;
float vertMargin = 70;
boolean deweyFilter = false;
int deweyFilterValue = 0;
int rowCount;
boolean boldness = false;
float xStretch ;
float yStretch;
float deltaX = 20;
float deltaY = 10;
float rectWidth;
float rectHeight; 


public void settings() {
    size(1400, 800); // setup the size of the window
    table = loadTable("query_main.csv");
    deweyMainNames = loadTable("dewey_main.csv");
    rowCount = table.getRowCount();    
    rectWidth = (width - (leftMargin + rightMargin))/ deltaX*0.95;  
    rectHeight = (height - 2 * vertMargin)/ deltaY*0.95;
    smooth();
}


public void rectangles() {  
  int deweyMain;
  int wordCount;
  float checkoutsPerTitleShare;
  float x ;
  float y ;
    
  xStretch = (width- (leftMargin +rightMargin))/(deltaX);
  yStretch = (height-2*vertMargin)/(deltaY);
  
  for (int row=0; row<rowCount; row ++){
    deweyMain = table.getInt(row, 0);
    wordCount = table.getInt(row, 2);
    
    checkoutsPerTitleShare = table.getFloat(row, 7)*2000;
    x = leftMargin + (wordCount-1) * xStretch;
    y = deweyMain * yStretch + vertMargin;    
    fill(checkoutsPerTitleShare);    
    rect(x, y, rectWidth, rectHeight);    
  }  
}


public void axis(){
    xAxis();
    yAxis();
}


public void yAxis(){
    fill(255);    
    float y ;
    for (int dewey=0; dewey<=9; dewey +=1){
        y = height - ((dewey) * yStretch + vertMargin) -rectHeight/2;        
        text(deweyMainNames.getString(dewey,1), vertMargin, y);
    }
}

public void xAxis(){
    fill(255);    
    text("Title length", leftMargin, 20);
    textAlign(LEFT);
    float x;
    for (int wordCount=1; wordCount<=20; wordCount ++){
        x = (wordCount-1) * xStretch + leftMargin;
        text(wordCount, x + rectWidth/2, vertMargin-10);
    }
}

public void title(){
    if (deweyFilter){
        text("Category "+deweyFilterValue+"00 ("+deweyMainNames.getString(deweyFilterValue,1)+")", width -300, 100);
    }
}


public void legend(){
  float y ;
  int colVal;
  for (int cell = 0; cell <=5; cell+=1){
    colVal = cell * 50;
    fill(colVal);
    stroke(255); 
    y = (cell * yStretch) + vertMargin +50 ;
    rect(width-rightMargin/2, y, rectWidth, rectHeight);
    fill(255);    
    textAlign(LEFT, CENTER);
    text(colVal, width-rightMargin/2-50, y+rectHeight/2);
    stroke(0);
  }
}


public void keyPressed() {
    int number;
    number = Integer.valueOf(key)-48;
    System.out.println(number);
    if (number >= 0 && number <=9){
        deweyFilter = true;
        deweyFilterValue = number;
        colorMode(RGB);
    }
    else if (key=='c') {
        deweyFilter = false;
    }
    else if (key=='b') {
        boldness = !boldness;
    }
}


public void draw(){    
    background(0);
    //lines();
    rectangles();
    axis();
    title();
    legend();
} 
Since I finally felt pretty unsatisfied with the overall result and its insights into the data, I made a final attempt in a different approach:
The popularity of a specific title length is one thing, but I wanted to further know what words a popular title would contain. I therefore queried for each dewey category (first two digits) the 5 most popular first, second, third, fourth and fifth word in the title. My hope was that if I build a graph and connect these words, I could come up with the perfect title for each dewey category.

Unfortunately, I was not able to write a query to get all the data in one request. Therefore the following query is iterated in a python script with changing parameters:

Code: Select all

SELECT
    count(id) AS checkouts, 
    LEFT(deweyClass, 2) AS deweySub,
    {wordNum},
    SUBSTRING_INDEX(SUBSTRING_INDEX(title, ' ', {wordNum}),' ', -1) as word        
FROM
	spl_2016.inraw
WHERE
    LEFT(deweyClass, 2) = '{dewey}'
GROUP BY
	word
ORDER BY
	checkouts DESC   
LIMIT 
	5
The query terminates in approximately 19 seconds.
Attachments
query_3.csv
(647 Bytes) Downloaded 233 times
HW3_2.zip
(91.78 KiB) Downloaded 222 times
HW3_1.zip
(1.99 MiB) Downloaded 208 times
screenshot1.png
snapshot.png
snapshot4.png
snapshot3.png
snapshot2.png
snapshot1.png
Last edited by griessbaum on Mon Jan 30, 2017 8:27 pm, edited 4 times in total.

sara.lafia
Posts: 5
Joined: Wed Jan 11, 2017 10:41 am

Re: 2 2D Frequency Pattern Visualization

Post by sara.lafia » Mon Jan 23, 2017 11:14 pm

Concept
The relationship between book checkouts over time are shown relative to their popularities over the course of a decade. An increasing number of books checked out, pertaining to each programming language, translates to an increasing popularity of the programming language. The three variables exposed in this visualization are:
  • 1. Length (time in months and years)
    2. Height (programming language)
    3. Depth (share of book checkouts)
export.png
2D Revised Project Screenshot
Screen Shot 2017-01-24 at 10.58.51 AM.png
Popularity of Programming Language index
Analysis
The brighter the brick in the tower of each language, the greater its share of checkouts. Thus, we see the rise and fall of each tower; brighter sections indicate periods of stronger popularity, while darker periods indicate declining popularity. This translates roughly to the chronological trends approximated by Google in their keyword search frequency of each programming language over the same decade.

Query

Code: Select all

SELECT 
    MONTH(t.checkOut) AS monthOut,
    YEAR(t.checkOut) AS yearOut,
    SUM(i.title LIKE '%java%') AS Java,
    SUM(i.title LIKE '%python%') AS Python,
    SUM(i.title LIKE '%php%') AS PHP,
    SUM(i.title LIKE '%javascript%') AS JavaScript,
    SUM(i.title LIKE '%perl%') AS Perl,
    SUM(i.title LIKE '%ruby%') AS Ruby,
    SUM((i.title LIKE '%java%') + (i.title LIKE '%python%') + (i.title LIKE '%php%') + (i.title LIKE '%javascript%') + (i.title LIKE '%perl%') + (i.title LIKE '%ruby%')) AS totalPYPL,
    SUM(d.deweyClass = 005) AS monthly005Total
FROM
    (spl_2016.transactions AS t
    INNER JOIN spl_2016.title AS i ON t.bibNumber = i.bibNumber)
        INNER JOIN
    spl_2016.deweyClass AS d ON (t.bibNumber = d.bibNumber)
WHERE
    (d.deweyClass = 005)
        AND (YEAR(t.checkOut) BETWEEN 2006 AND 2016)
GROUP BY yearOut , monthOut
ORDER BY yearOut , monthOut
Processing Time
The processing time was 4.001 seconds for this query.

CSV and Processing code
The query, data, and code are attached as a zip file.
SaraLafia_2DVer2.zip
2D Revised Project Download
(121.26 KiB) Downloaded 232 times
-----UPDATE-----
Additional functionality including toggle view for "x-ray" (grayscale key interaction) and mouse-click interactivity added. Normalization colors adjusted to HSB and contrast increased per review suggestion. Screenshot from Processing in-line (commented out) rather than from print screen. Query added to Processing code as separate tab. All updates have been added to code and zipped.
Last edited by sara.lafia on Tue Jan 31, 2017 7:48 pm, edited 3 times in total.

merttoka
Posts: 21
Joined: Wed Jan 11, 2017 10:42 am

Re: 2 2D Frequency Pattern Visualization

Post by merttoka » Tue Jan 24, 2017 9:14 am

Hand Sketch:
IMG_7691.JPG
Hand Sketch
Explanation:
brooksjaredc wrote:Harry Potter: The Book or the Movie?
After seeing @brooksjaredc 's idea, I wanted to explore the similar realm by applying the idea to multiple films and series. I am curious about the pattern when a film/tv adaptation released, i.e. when do people start reading the books, how long do the books maintain their popularity compared to adaptations, and etc.

EDIT:

Six titles have been selected for motion picture and printed medium comparison: Orange is the New Black, Game of Thrones, Divergent, Hunger Games, Maze Runner, and Twilight. The vertical axis denotes time, and different items constitute the horizontal axis. For each title, left half shows total number of checkouts for books, whereas the right side is for the total number of movies or tv shows. Each title is assigned a different color for a quick comparison, and in display mode (1), the opacity of bar denotes the number of checkouts in increasing fashion. The display mode (3) displays the same property using the widths of the bars.

Alternatively, anytime in the execution, the background color can be changed from dark to light and vice versa.
  • 35439.png
    39432.png
    Color display only (1)
  • 44152.png
    42045.png
    Color and width display (2)
  • 46742.png
    48271.png
    Width display only (3)
Keyboard Shortcuts:
  • 1 -- Color display mode
  • 2 -- Color and width display mode
  • 3 -- Width display mode
  • L -- Switches to log scaling
  • SPACE -- Toggles background brightness
Queries:

Code: Select all

SELECT 
    MONTH(c.checkOut) 'Month',
    YEAR(c.checkOut) 'Year',
    COUNT(*) '# Checkouts',
    i.itemType 'Type',
    t.title 'Title'
FROM
    title t
        JOIN
    itemToBib b ON t.bibNumber = b.bibNumber
        JOIN
    itemType i ON b.itemNumber = i.itemNumber
        JOIN
    transactions c ON c.itemNumber = b.itemNumber
WHERE
    t.title LIKE 'NAME_VAL%'
        AND YEAR(c.checkOut) > YEAR_VAL
GROUP BY MONTH(c.checkOut) , YEAR(c.checkOut) , i.itemType
HAVING Year > YEAR_VAL
Above query has two variables, which has picked differently for each title. YEAR_VAL is the year in which the first item is released (books), and NAME_VAL is the name of the title.

Each query took a different amount of time, in overall they covered 200-300 seconds range for each title.

Processing Code, Queries and CSV files:
MertToka_2DFrequency.zip
Processing Code
(281.5 KiB) Downloaded 229 times
Attachments
booksandmovies.xlsx
Some of the titles that I will explore
(9.1 KiB) Downloaded 217 times
Last edited by merttoka on Mon Feb 06, 2017 8:42 pm, edited 1 time in total.

wolfe
Posts: 5
Joined: Wed Jan 11, 2017 10:43 am

Re: 2 2D Frequency Pattern Visualization

Post by wolfe » Tue Jan 24, 2017 10:23 am

What fiction items are checked out once, or checked out once and never returned?
datavisualization1.png
Author name vs Time are the axis
Red represents young adult books
Blue represents genre books like "Mystery" and "Western"
Yellow represents fiction
Circles are books checked out that were never returned
Rectangles are books that were only checked out once with length representing the checkout length.

Code: Select all

SELECT 
    c, checkout, checkin, title, callNumber,
    CASE WHEN callNumber LIKE '%SCI-FIC%' THEN 1 ELSE 0 END AS sci,
    CASE WHEN callNumber LIKE '%MYSTERY%' THEN 1 ELSE 0 END AS mys,
    CASE WHEN callNumber LIKE '%WESTERN%' THEN 1 ELSE 0 END AS wes,
    CASE WHEN callNumber LIKE 'FIC%' or callNumber LIKE 'CAS FIC%' or callNumber LIKE 'CD FIC%' THEN 1 ELSE 0 END AS fic,
    CASE WHEN callNumber LIKE 'YA%' or callNumber LIKE 'CAS YA%' or callNumber LIKE 'CD YA%' THEN 1 ELSE 0 END AS YA,
    spl_2016.transactions.bibNumber, t.itemNumber
FROM
    (SELECT 
        COUNT(checkOut) AS c, spl_2016.callNumber.itemNumber
	FROM spl_2016.callNumber
	INNER JOIN spl_2016.transactions ON spl_2016.transactions.itemNumber = spl_2016.callNumber.itemNumber
	where callNumber LIKE '%SCI-FIC%' OR callNumber LIKE '%MYSTERY%' OR callNumber LIKE '%WESTERN%'
		or callNumber LIKE 'FIC%' or callNumber LIKE 'CAS FIC%' or callNumber LIKE 'CD FIC%'
        or callNumber LIKE 'YA%' or callNumber LIKE 'CAS YA%' or callNumber LIKE 'CD YA%'
	GROUP BY spl_2016.callNumber.itemNumber
	ORDER BY c) t
        INNER JOIN spl_2016.transactions ON spl_2016.transactions.itemNumber = t.itemNumber
        LEFT JOIN spl_2016.callNumber ON spl_2016.callNumber.itemNumber = spl_2016.transactions.itemNumber
        LEFT JOIN spl_2016.title ON spl_2016.transactions.bibNumber = spl_2016.title.bibNumber
        LEFT JOIN spl_2016.deweyClass ON spl_2016.transactions.bibNumber = spl_2016.deweyClass.bibNumber
WHERE
    c = 1 AND YEAR(checkout) > 2004 and deweyClass = ""
ORDER BY checkout; 
runtime duration: 684.374 sec, fetch: 0.170 sec, 18757 rows
output.csv
(1.72 MiB) Downloaded 231 times
--------------------------------UPDATE 1-25-2017--------------------------------------------------------
Question: What fiction items are checked out once, or checked out once and never returned?

Colored by "Genre Books", "Young Adult", and "Fiction"
DataVis2d.png
Colored by "Books", "Cassettes" and "CDs"
DataVis2Da.png
Only "genre" books, colored by "Sci-Fi", "Mystery" and "Western"
DataVis2Db.png
All items by authors with the last name starting with B.
DataVis2De.png
Layout/Design: Author name is the Y axis and time is the X axis. Circles are books checked out that were never returned and rectangles are books that were only checked out once with length representing the checkout length. I rounded the length to 3 weeks with the minimum length as 3 weeks so that the checked out items are visible. The circles diameter is also the length of 3 weeks. According to the library website, when you checkout a book you can have it for 3 weeks and renew it twice for up to 9 weeks total. The Y axis is grouped by the first two letters of the author's name in the main view, and by the author's name in the sub views. I chose the colors red, blue and yellow with an opacity of 50/255 for the views with multiple types and just black for the views with only one type of data. I chose these colors because they are very distinct, clean and mix well together. I chose yellow for the most popular type in the visualizations because it overpowers the screen less. I added a legend at the bottom of the visualization so that the viewer knows what the colors stand for.

Keyboard Interaction: Pressing the numbers 0-9 show 10 different views of the data.
0 charts the items in the categories of "genre", "fiction" and "young adult".
1 charts the items by item type, "book", "CD" and "casette"
2 charts the "genre" books in the categories "mystery", "western" and "science fiction"
3 charts the CDs and Casettes only because the books overpower the view which also includes books
4-9 charts "books", "science fiction", "mystery", "western", "fiction" and "young adult" separately
Pressing a-z filters the chart to the authors who's books start with the typed letter. Pressing the spacebar resets it to be able to see all the books.

Mouse Interaction: I added hover text so that the viewer can see the first two letters of the author's name, or the author's name for each row. This allows viewers to easily identify if books are by the same author.

Query/Code Design: I used the query described earlier. After the query was run I wrote a python script to extract information from the dataset. I did this because the query already took so long I didn't want to make it any more complicated and I didn't want to rerun it. Also, python has a lot of nice string manipulation functionality that mysql doesn't. I used the script to get the item type, year if available, and author's name from the call number. I also added a field for if it had been returned, and calculated checkout and checkin dates relative to 1/1/2005. A lot of this I could have done in sql but it was easier to add to the python script then wait 10+ minutes for the query to run.
I spent quite a bit of time trying to get a query that would find books that had been checked out multiple times and not returned and found that some items had been not returned many times. These inconsistencies in the data turned out to make this not a viable dataset. This does mean potentially that the books labeled as "stolen/lost" may have been returned, but either way they have only been checked out once and fit within the critera for this visualization.
I put the query, raw data, python script and final data in the zip file with the processing code.
HannahWolfe2DAssignment.zip
(3.96 MiB) Downloaded 214 times
Analysis:
Books checked out for years and returned later happened a lot before 2010, but not much after. Maybe these books were returned earlier but there was an issue with the checkin return system. Most of the books that were checked out once and returned later after 2010 were checked out for less than a month. There were no books checked out and not returned before 2006. There are a lot of books that have been checked out recently and not returned, most likely because they are being read. There were a lot more books checked out only once than tapes or CDs. For the genres there were a lot more Mysteries, and science-fiction books checked out only once than westerns. This might show overall lack of popularity for westerns or there may have a smaller western collection. Also for the western books there is only one book that was checked out for over a year. Comparing books on tape to CD it looks like people stopped checking out tapes in 2009 though CDs are increasing in popularity. Certain authors and books have been checked out multiple times only once, like Salinger's "Catcher in the Rye", Jane Austen, Beatty's "Sellout", Chabon's "Moonglow" or "Best American Short Stories". Sometimes the author is not actually the author like for "Best American Short Stories" the call number has "Best" instead of the author's name. There are also a ton of copies of "Books on Bikes" and "Open Air" that have no author name, though those are the only ones labeled FICTION without an author in the call number. Also weirdly a ton of fiction books which the first letter of the authors name started with the letter B were checked out and returned in the beginning of 2013.
Last edited by wolfe on Wed Jan 25, 2017 10:41 pm, edited 2 times in total.

freeberg
Posts: 5
Joined: Wed Jan 11, 2017 10:39 am

Re: 2 2D Frequency Pattern Visualization

Post by freeberg » Tue Jan 24, 2017 11:05 am

Analysis of Jason Bourne Movies

Primary Question: Does the release of a new Jason Bourne movie prompt the library's patrons to check out the previous titles?
Secondary Questions: How often are the patrons checking out these items? For how long do they keep them checked out?

Layout: Each move title is given a "lane" on the Y-axis. Time is plotted along the X-axis. The width of each rectangle in the lane corresponds to the amount of time that item was checked out. I used the bottom margin to inform the user on the interactivity, and to print the selected date (see "Interactivity").

Interactivity: Pressing 'R' will plot the release dates of the newer Jason Bourne movies on the visualization. The up and down arrow keys adjust the transparency of the visualization. When the mouse is within the main portion of the visualization, the corresponding month and year is printed in the bottom right portion of the legend.

Processing Code: This is the main script of processing code, there are some helper and interactivity functions that are not implemented in this code chunk. They are available in the attached zip folder.

Code: Select all

/*
  January 26, 2016
  Jason Freeberg
  MAT 259A - Winter 2017
  
  Description:
  Visualizing usage of the library's Jason Bourne DVD movies, and investigating
  a correspondance with the releases of newer movies in the series. 

  GitHub Repo: 
  https://github.com/JasonFreeberg/processing/tree/master/bourne_v2
*/

// Date and sound libraries
import java.util.*;
//import ddf.minim.*;

Table table;
int nRows, nCols;
float minTime, maxTime;
float[][] data;

float vMargin = 100;
float hMargin = 160;
int textColor = #EFEFEF;

boolean showReleases = false;

//Minim minim;
//AudioPlayer player;
PImage background;
long year;
PFont font;

// Some hard-coded arrays
String[] titles = {"Identity", "Supremacy", "Ultimatum", "Legacy"};
long[] years = {1136073600, 1167609600, 1199145600, 1230768000, 1262304000, 1293840000, 1325376000, 1356998400, 1388534400, 1420070400, 1451606400, 1483228800};
long[] releases = {1188432000, 1197331200, 1344297600, 1355184000, 1469750400, 1479168000};
String[] releaseNames = {"Ultimatum in Theaters", "Ultimatum on DVD", "Legacy in Theaters", "Legacy on DVD", "Jason Bourne in Theaters", "Jason Bourne on DVD"}; 
String[] monthNames = {"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"};
int[] colors = {#DB0353, #029797, #FCF003, #78E403};

void setup() {
  size(1280, 720);
  surface.setResizable(true);

  // Read Data
  table = loadTable("bourne.csv", "header");
  nRows = table.getRowCount();
  nCols = table.getColumnCount();

  // Data is set up: matrix[rows][columns] because I come from
  // an R and Python background
  data = new float[nRows][nCols];
  for (int col = 0; col < nCols; col++) {
    for (int row = 0; row < nRows; row++) {
      data[row][col] = table.getFloat(row, col);
    }
  }

  minTime = min(columnMin(data, 1), columnMin(data, 2));
  maxTime = max(columnMax(data, 1), columnMax(data, 2));
  
  PFont.list();
  font = createFont("SansSerif", 12);
  textFont(font);
  // Minim player class
  //minim = new Minim(this);
  //player = minim.loadFile("jcijb.aiff");
}

float row[];
float hPos, vPos, rectWidth;
int rectColor;
float lane;
float alpha = 5;

void draw() {
  // Set up canvas
  background(#22210B);
  surface.setResizable(true);
  
  // Title and legends
  writeLabels();

  // Y-axis labels
  for (int i = 0; i < titles.length; i++) {
    lane = (i + 1)*(height - 2*vMargin)/titles.length;
    vPos = vMargin + lane - (height - 2*vMargin)/(titles.length * 2);

    pushMatrix(); // Translate the axis
    translate(hMargin - 10, vPos);
    rotate(-HALF_PI);

    textAlign(CENTER); // Write text
    fill(textColor);
    text(titles[i], 0, 0);

    popMatrix(); // Reset axis
  }

  // X-axis labels and ticks
  for (int i = 0; i < years.length; i++) {
    year = years[i];
    hPos = map(year, minTime, maxTime, hMargin, width-hMargin);
    year = new Date((long)year*1000).getYear() + 1900;
    
    // Tick marks
    strokeWeight(1);
    stroke(textColor, 200);
    line(hPos, height - vMargin, hPos, vMargin);
    
    // Labels
    textAlign(CENTER);
    fill(textColor);
    text((int)year, hPos, height - vMargin + 15);
  }

  // Main visualization
  for (int index = 0; index < nRows; index++) {
    row = data[index];

    // Start at checkout time (hPos) end at checkin time (rectWidth)
    hPos = map(row[1], minTime, maxTime, hMargin, width-hMargin);
    rectWidth = map(row[2], minTime, maxTime, hMargin, width-hMargin) - hPos;
    lane = row[0]*(height - 2*vMargin)/titles.length;
    vPos = vMargin + lane;

    noStroke();
    fill(colors[(int)row[0]], alpha);
    rect(hPos, vPos, rectWidth, (height - 2*vMargin)/titles.length);
  }

  // Release Dates
  if (showReleases) plotReleases();
}
Insights: When the user presses 'R' and shows the release dates as vertical lines, the cause-and-effect is quite shocking. When The Bourne Legacy was released, many more patrons chose to rent The Bourne Supremacy. Likewise, the release of Jason Bourne prompted many checkouts of the first and third movies. We can also see that the Seattle Public Library wastes no time in purchasing copies of the DVD when they are released. The biggest insight I gained from the visualization is actually that the Jason Bourne film series is based of a lengthy novel series as well! As we can see in the visualization for The Bourne Ultimatum, there were checkouts of the title before the DVD release. This is because The Bourne Ultimatum is also the title of one of the novels!

Future Plans: Now that I know there is a novel series as well, I am going to incorporate that knowledge into the next iteration of the project. I will likely break up each movie's lane into halves--one for the the film and the second for the novel. Colored by different shades of the current colors.
Attachments
bourne_v2.zip
All materials
(1.58 MiB) Downloaded 159 times
withReleaseDates.png
With 'R' pressed
noReleaseDates.png
Without 'R' pressed
Screen Shot 2017-01-24 at 10.59.01 AM.png
version1
Last edited by freeberg on Thu Jan 26, 2017 9:44 am, edited 1 time in total.

jingyi_xiao
Posts: 5
Joined: Wed Jan 11, 2017 10:43 am

Re: 2 2D Frequency Pattern Visualization

Post by jingyi_xiao » Tue Jan 24, 2017 11:18 am

Question:Does title length in different Dewey classes differ?

I queried the average title length in Seattle Public Library and classified as different Dewey classes, divisions and sections. I want to see if there is anything interesting, like some Dewey classes tend to have longer title than others.

Query

Code: Select all

SELECT 
    a.dewey,
    AVG(CASE
        WHEN a.itemtype LIKE '%bk%' THEN a.length
        ELSE 0
    END) AS book
FROM
    (SELECT DISTINCT
        bibNumber,
            itemtype,
            FLOOR(deweyClass) AS dewey,
            LENGTH(title) AS length
    FROM
        spl_2016.inraw
    WHERE
        YEAR(cout) > 2005 AND YEAR(cout) < 2017
            AND FLOOR(deweyClass) != '') AS a
GROUP BY a.dewey;
Analysis
Title length in Dewey class 400(Language), 500(Science), 700(Art & Recreation) and 800(Literature) is relatively short than others. And when we look deeper into each Dewey class, some Dewey classes tend to have similar length, but some differ. For example, title length in 780(Music) is much shorter than others, 800(Literature, rhetoric) and 910(Geography & Travel) is relatively longer.

Overall, books related to language, art and literature tend to have shorter title. I think maybe because these themes are naturally more concise or need to be concise.
Attachments
Overview.png
Final result
Process2.jpg
Processing result
Last edited by jingyi_xiao on Thu Feb 16, 2017 11:31 pm, edited 1 time in total.

kschlesi
Posts: 6
Joined: Wed Jan 11, 2017 10:42 am

Re: 2 2D Frequency Pattern Visualization

Post by kschlesi » Tue Jan 24, 2017 11:45 am

Layout: Columns are countries, rows correspond to (year, main dewey class) combos. They are sorted by dewey class, and within class they are sorted by year. Grayscale (for now) intensity indicates number of checkouts of books with the name of the country or its possessive (e.g. "Ireland" or "Irish") in the title. This is a basic visualization of the data; I plan to make several changes that I am still working on.
early_data_vis_sketch.png
Next steps: I'd like to highlight the difference between dewey class and year information more cleanly by showing the "profile" of each country in each dewey "bin" over the decade. I plan to do this by changing the widths of the boxes. (Don't have a physical sketch right now but the code is in progress...) Width of box would indicate the popularity of the the country in that year, normalized by all years of that country in that dewey "bin."

I also want to include interactivity that allows switching between the 900s, 800s, etc. In addition, I will include mouseovers that give the names of dewey categories and possibly more information.

I have tried a few color schemes but didn't find anything that worked well yet. I will experiment with the color pickers we talked about today. I'd like the color (like the intensity right now) to indicate the overall number of checkouts, while the box length is normalized to within the specific category. There are many ways to normalize this -- for example, normalizing by total number of books with that country in title could be useful, since e.g. China has many more books than other countries, and so its numbers are probably elevated in some for that reason. Interactive options could use different normalization schemes.

One final issue is that I only searched for countries in the title, not the subject, to keep the query efficient. But searching in the subject provides more results and gives a better idea of the cultural interaction (as I discovered by trying one country at a time). Since the different country possibilities can be run in parallel queries, I might try using the idea that a previous post mentioned last week, to calculate smaller groups of countries separately and concatenate the resulting .csv files later.

Edit: code and data included below:
project1_sketch.zip
(17.5 KiB) Downloaded 170 times
Last edited by kschlesi on Wed Jan 25, 2017 9:47 pm, edited 1 time in total.

brooksjaredc
Posts: 5
Joined: Wed Jan 11, 2017 10:39 am

Re: 2 2D Frequency Pattern Visualization

Post by brooksjaredc » Wed Jan 25, 2017 7:45 pm

Harry Potter: the book or the movie? continued

I wanted to look into the effect of the movie releases of the Harry Potter films has on the checkout frequency of the books and the films. Do the release of the movies increase the checkout frequency of the books? Is there a delay on the interest from the time the film comes out? How does this change with the different books?

I made the following query in MySQL to group all the checked out items whose titles matched the Harry Potter title, and grouped them by year/month and item type.

Code: Select all

SELECT DATE_FORMAT(cout,'%Y/%m') AS yearmonth, 
sum( case when itemType LIKE '%bk%' then 1 else 0 end) as book, 
sum( case when itemType LIKE '%dvd%' then 1 else 0 end) as dvd,
sum( case when itemType='jccd' then 1 else 0 end) as cd,
sum( case when itemType='accd' then 1 else 0 end) as soundtrack
FROM spl_2016.inraw 
WHERE (title LIKE '%harry potter and the order of the phoenix%') 
AND (year(cout) > 2005)
GROUP BY yearmonth
ORDER BY yearmonth
I did the same query for the other books: the half blood prince, and the deathly hallows. Each query took about 30 seconds.

For the visualization, I made a grid for each book with dates on the x-axis and item type on the y-axis, and stacked each grid on top of each other, so they share the time axis. I kept the color schemes that we demo'd in class, and changed the normalization scheme to have weight values for each cell be relative to the maximum in its row. The screenshot below shows normalization on, and color scheme #3, which I thought looked the best.
Screen Shot 2017-01-24 at 4.11.10 PM.png
I added a legend on the right that shows that a solid black triangle marks the theatrical release of the film, and the empty triangle marks the DVD release of the film just a few months later. The bottom stack, the deathly hallows, has two of each since the movie was split into part 1 and 2.

Analysis:
From all this we can see that people like to read the books just as the movies are being released to theaters, with audio books interest being more spread out. The biggest interest in all three of the books happen to peak during the theatrical release of the 5th movie: order of the phoenix. We also see that interest in the DVDs peaks just few months after the DVD release. Furthermore, interest for all Harry Potter books items quickly decreases and then plateaus right after the theatrical release of the last movie.
project2.tar.gz
(24.81 KiB) Downloaded 173 times

ariellalgilmore
Posts: 7
Joined: Wed Jan 11, 2017 10:40 am

Re: 2 2D Frequency Pattern Visualization

Post by ariellalgilmore » Thu Jan 26, 2017 1:36 am

How has the increase of technology effected checkouts in the library?

My research has involved counting the checkouts from each dewey class in 100 blocks since 2005 till 2016. I always gathered the same information on item types being checked out. As technology has increased dramatically, did people checkout certain items that were not as accessible as just downloading from the internet? Did people stop getting videos and books because everything can be on any of your devices?

I created a query that got all the months from 2005 to 2016, separated each dewey class, and the different item types. This query took 292 seconds (about 5 min.) and effected 144 rows.

Code: Select all

SELECT
year(cout),
COUNT(CASE WHEN deweyClass >= 000 AND deweyClass<100 THEN 1 END) as
Info,
COUNT(CASE WHEN deweyClass >= 100 AND deweyClass<200 THEN 1 END) as
Phil_Psych,
COUNT(CASE WHEN deweyClass >= 200 AND deweyClass<300 THEN 1 END) as
Religion,
COUNT(CASE WHEN deweyClass >= 300 AND deweyClass<400 THEN 1 END) as
SS,
COUNT(CASE WHEN deweyClass >= 400 AND deweyClass<500 THEN 1 END) as
Lang,
COUNT(CASE WHEN deweyClass >= 500 AND deweyClass<600 THEN 1 END) as
Science,
COUNT(CASE WHEN deweyClass >= 600 AND deweyClass<700 THEN 1 END) as
Tech,
COUNT(CASE WHEN deweyClass >= 700 AND deweyClass<800 THEN 1 END) as
Arts,
COUNT(CASE WHEN deweyClass >= 800 AND deweyClass<900 THEN 1 END) as
Lit,
COUNT(CASE WHEN deweyClass >= 900 AND deweyClass<1000 THEN 1 END) as
History,
SUM(CASE WHEN itemtype = 'accas' OR itemtype = 'arcas' OR itemtype = 'bccas' OR itemtype = 'jccas' OR itemtype = 'jrcas' OR itemtype = 'accd' OR itemtype = 'arcd' OR itemtype = 'jccd'
OR itemtype = 'jrcd' OR itemtype = 'accdrom' OR itemtype = 'arcdrom' OR itemtype = 'bccdrom'
OR itemtype = 'drcdrom' OR itemtype = 'jccdrom' OR itemtype = 'acdisk' OR itemtype = 'ardisk' OR itemtype = 'jrdisk' OR itemtype = 'acdvd' OR itemtype = 'ardvd' OR itemtype = 'bcdvd'OR itemtype = 'jcdvd' OR itemtype = 'jrdvd' OR itemtype = 'xrcdrom' OR itemtype = 'ucflpdr' THEN 1 ELSE 0 END) AS Media,
SUM(CASE WHEN itemtype = 'acvhs' OR itemtype = 'alvhs' OR itemtype = 'bcvhs' OR itemtype = 'blvhs' OR itemtype = 'jcvhs' OR itemtype = 'jlvhs' OR itemtype = 'jrvhs' OR itemtype = 'xrvhs' OR itemtype = 'scmed' OR itemtype = 'acvid' THEN 1 ELSE 0 END) AS Video,
SUM(CASE WHEN itemtype = 'acbk' OR itemtype = 'arbk ' OR itemtype = 'bcbk' OR itemtype = 'drbk' OR itemtype = 'jcbk' OR itemtype = 'jrbk' OR itemtype = 'bccd' THEN 1 ELSE 0 END) AS Book
FROM spl_2016.inraw
WHERE year(cout) >= "2005" and year(cout) < 2017 
GROUP BY year(cout), month(cout)
ORDER BY year(cout);
The visualization part combines the dates, dewey class, and item types. Each column is a separate month in each year and the rows are the dewey classes. I began by mapping the greyscale of how the increase in checkouts in a certain dewey class was darker and vice versa. Because I separated the item types into three different categories: media, videos, and books, I mapped all these values and use them as the RGB values. Red representing media, green representing videos, and blue representing book. This created one color for each month showing the item type. The transparency is given by the amount of items being checked out each month compared to the other months. I also have the similar commands we did in the demo.

[The extension tif has been deactivated and can no longer be displayed.]

Analysis: The picture illustrates the color going from black to green to a light blue to a bright pink to a dark purple and then almost black again. The more transparent boxes mean that there was less amount of items being checked out during that time period. Around 2009 is when the most book were checked out, so it is clear why the transparency is the brightest. As it goes towards 2016, it almost becomes a dark blue/purple because less people are checking out media item types. At the beginning is the only time we see the green because that was the most videos checked out. I would like to create more interactions that can show better what this graph is representing. I was thinking of adding a key where when you pressed each color it described why that box is that certain color and transparency.
Project1_2.zip
(74.43 KiB) Downloaded 166 times

Post Reply