I wanted to figure out how long (in terms of number of words( a good title, respective to the discipline (dewey class), is. I was hoping to see results such as: People prefer books on religion with a title length of more than x words, while people prefer books on science with more than y words.
To do so, I aggregated the data first by their dewey class (only the first 3 digits) and then sub-aggregated by the number of words in the title. For each of these sub-groups, I summed up the number of checkouts and the number of books that exist in this sub-group. I then defined the popularity of the subgroup as the ratio of number of checkouts to number of books (of this dewey class and group of title lenght).
The three dimensions I consequently extracted from the database are
a) the dewey class
b) the title length
c) the popularity
For each dewey category, I then plotted the popularity vs. the title length.
I encountered several issues with my initial idea:
1) No pattern whatsoever was visible. It basically was just a clatter of seemingly random lines.
2) There was too much data. This makes problem 1) worse, but also is a performance issue.
Considering that we have 1000 dewey classes (only first 3 digits) and that the title lengths may span from anywhere between 1 to 40 (in average, there were 14.25 distinct title lengths per dewey), we end up in the ten-thousands of rows.
3) Some dewey categories are obviously more demanded than others. Therefore the lines are quite far out scattered.
To overcome the issues, I implemented following changes:
1) I aggregated the data by the first 2, instead of the first 3 digits of the dewey class. This reduced the data to approx 3000 rows and consequently results in 100 distinct lines. I was hoping that due to the aggregation and therefore more data per group being available, we would also be able to see trends (instead of randomness) more clearly.
2) I extracted the main dewey class from the database (rather than computing it during runtime in processing) in the hope to use it for a better color-coding.
3) I implemented interactive filters.
What I still want to do:
1) Move the filtered class to foreground when filter is activated.
2) Add a legend.
To get the data, I used following query (contains some columns, that I do not actually use).
Code: Select all
SELECT
main.DeweyMain as Dewey,
main.Dewey as Dewey,
title_word_count,
checkouts,
number_of_titles,
checkouts_per_title,
checkouts_per_dewey,
checkouts/checkouts_per_dewey AS checkouts_share,
checkouts/number_of_titles/checkouts_per_dewey AS checkouts_per_dewey_share,
number_of_titles_per_dewey,
number_of_titles/number_of_titles_per_dewey AS number_of_titles_share
FROM
(SELECT
LEFT(deweyClass, 1) AS DeweyMain,
LEFT(deweyClass, 2) AS Dewey,
LENGTH(title) - LENGTH(REPLACE(title, ' ','')) + 1 AS title_word_count,
COUNT(id) AS checkouts,
COUNT(DISTINCT(bibNumber)) AS number_of_titles,
COUNT(id) / COUNT(DISTINCT(bibNumber)) AS checkouts_per_title
FROM
spl_2016.inraw
WHERE
length(deweyClass) > 0
AND
cout > '2011-01-01'
GROUP BY
Dewey, title_word_count
) AS main
LEFT JOIN
(SELECT
LEFT(deweyClass, 2) AS Dewey,
COUNT(DISTINCT(bibNumber)) AS number_of_titles_per_dewey,
COUNT(id) AS checkouts_per_dewey
FROM
spl_2016.inraw
WHERE
cout > '2011-01-01'
GROUP BY
Dewey
) AS count
ON (count.Dewey=main.Dewey)
WHERE
title_word_count <= 40
ORDER BY
main.Dewey, title_word_count;
Which takes 281 seconds.
And the following processing code.
Code: Select all
Table table;
Table deweyMainNames;
float horzMargin = 40;
float vertMargin = 30;
boolean deweyFilter = false;
int deweyFilterValue = 0;
int rowCount;
boolean boldness = false;
float xMin = 1;
float xMax = 40;
float yMin = 0;
float yMax = (float)300;
float xStretch ;
float yStretch;
public void settings() {
size(1000, 800); // setup the size of the window
table = loadTable("query_first2digits.csv");
deweyMainNames = loadTable("dewe_main.csv");
rowCount = table.getRowCount();
xStretch = (width-2*horzMargin)/(xMax-xMin);
yStretch = (height-2*vertMargin)/(yMax-yMin);
smooth(40);
}
public void lines() {
int dewey;
int deweyMain;
int deweyPrev = 100;
int wordCount;
float checkoutsPerTitleShare;
float x1 =0;
float y1 =0;
float x2 ;
float y2 ;
for (int row=0; row<rowCount; row ++){
deweyMain = table.getInt(row, 0);
if (deweyFilter == true && boldness == false && deweyMain != deweyFilterValue ){
continue;
}
dewey = table.getInt(row, 1);
wordCount = table.getInt(row, 2);
checkoutsPerTitleShare = table.getFloat(row, 5);
x2 = horzMargin + (wordCount-1) * xStretch;
y2 = height - checkoutsPerTitleShare * yStretch - vertMargin;
if (deweyPrev == dewey) {
colorMode(HSB, 10, 100, 10);
stroke(deweyMain, 50, dewey%10, 60);
strokeWeight(3);
if (boldness == true && deweyMain == deweyFilterValue){
stroke(deweyMain, 50, 70, 100);
strokeWeight(7);
}
line(x1, y1, x2, y2);
//text(round(checkoutsPerTitleShare), width-horzMargin, y2);
}
x1 = x2;
y1 = y2;
deweyPrev = dewey;
}
}
public void axis(){
xAxis();
yAxis();
}
public void yAxis(){
text("checkouts", width-70, height-10);
float y ;
for (int checkouts=0; checkouts<=280; checkouts +=10){
y = height - ((checkouts) * yStretch + vertMargin);
text(checkouts, width-vertMargin, y);
}
}
public void xAxis(){
text("Title length", horzMargin, 20);
float x;
for (int wordCount=1; wordCount<=40; wordCount ++){
x = (wordCount-1) * xStretch + horzMargin;
text(wordCount, x, horzMargin);
}
}
public void title(){
if (deweyFilter){
text("Category "+deweyFilterValue+"00 ("+deweyMainNames.getString(deweyFilterValue,1)+")", width -300, 100);
}
}
public void legend(){
}
public void keyPressed() {
int number;
number = Integer.valueOf(key)-48;
System.out.println(number);
if (number >= 0 && number <=9){
deweyFilter = true;
deweyFilterValue = number;
colorMode(RGB);
}
else if (key=='c') {
deweyFilter = false;
}
else if (key=='b') {
boldness = !boldness;
}
}
public void draw(){
colorMode(RGB, 100);
background(0);
lines();
axis();
title();
legend();
}
Trying to move away from a stereotypical histogram representation, I also implemented a heatmap for typical title lengths of each dewey category. The x-axis hereby represents the length of the title, the y-axis separates the space into the dewey categories, while the brightness of each pixel (rectangle) represents the popularity of the title length.
The same data / query as in the previous attempt can be used.
Code: Select all
Table table;
Table deweyMainNames;
float rightMargin = 300;
float leftMargin = 200;
float vertMargin = 70;
boolean deweyFilter = false;
int deweyFilterValue = 0;
int rowCount;
boolean boldness = false;
float xStretch ;
float yStretch;
float deltaX = 20;
float deltaY = 10;
float rectWidth;
float rectHeight;
public void settings() {
size(1400, 800); // setup the size of the window
table = loadTable("query_main.csv");
deweyMainNames = loadTable("dewey_main.csv");
rowCount = table.getRowCount();
rectWidth = (width - (leftMargin + rightMargin))/ deltaX*0.95;
rectHeight = (height - 2 * vertMargin)/ deltaY*0.95;
smooth();
}
public void rectangles() {
int deweyMain;
int wordCount;
float checkoutsPerTitleShare;
float x ;
float y ;
xStretch = (width- (leftMargin +rightMargin))/(deltaX);
yStretch = (height-2*vertMargin)/(deltaY);
for (int row=0; row<rowCount; row ++){
deweyMain = table.getInt(row, 0);
wordCount = table.getInt(row, 2);
checkoutsPerTitleShare = table.getFloat(row, 7)*2000;
x = leftMargin + (wordCount-1) * xStretch;
y = deweyMain * yStretch + vertMargin;
fill(checkoutsPerTitleShare);
rect(x, y, rectWidth, rectHeight);
}
}
public void axis(){
xAxis();
yAxis();
}
public void yAxis(){
fill(255);
float y ;
for (int dewey=0; dewey<=9; dewey +=1){
y = height - ((dewey) * yStretch + vertMargin) -rectHeight/2;
text(deweyMainNames.getString(dewey,1), vertMargin, y);
}
}
public void xAxis(){
fill(255);
text("Title length", leftMargin, 20);
textAlign(LEFT);
float x;
for (int wordCount=1; wordCount<=20; wordCount ++){
x = (wordCount-1) * xStretch + leftMargin;
text(wordCount, x + rectWidth/2, vertMargin-10);
}
}
public void title(){
if (deweyFilter){
text("Category "+deweyFilterValue+"00 ("+deweyMainNames.getString(deweyFilterValue,1)+")", width -300, 100);
}
}
public void legend(){
float y ;
int colVal;
for (int cell = 0; cell <=5; cell+=1){
colVal = cell * 50;
fill(colVal);
stroke(255);
y = (cell * yStretch) + vertMargin +50 ;
rect(width-rightMargin/2, y, rectWidth, rectHeight);
fill(255);
textAlign(LEFT, CENTER);
text(colVal, width-rightMargin/2-50, y+rectHeight/2);
stroke(0);
}
}
public void keyPressed() {
int number;
number = Integer.valueOf(key)-48;
System.out.println(number);
if (number >= 0 && number <=9){
deweyFilter = true;
deweyFilterValue = number;
colorMode(RGB);
}
else if (key=='c') {
deweyFilter = false;
}
else if (key=='b') {
boldness = !boldness;
}
}
public void draw(){
background(0);
//lines();
rectangles();
axis();
title();
legend();
}
Since I finally felt pretty unsatisfied with the overall result and its insights into the data, I made a final attempt in a different approach:
The popularity of a specific title length is one thing, but I wanted to further know what words a popular title would contain. I therefore queried for each dewey category (first two digits) the 5 most popular first, second, third, fourth and fifth word in the title. My hope was that if I build a graph and connect these words, I could come up with the perfect title for each dewey category.
Unfortunately, I was not able to write a query to get all the data in one request. Therefore the following query is iterated in a python script with changing parameters:
Code: Select all
SELECT
count(id) AS checkouts,
LEFT(deweyClass, 2) AS deweySub,
{wordNum},
SUBSTRING_INDEX(SUBSTRING_INDEX(title, ' ', {wordNum}),' ', -1) as word
FROM
spl_2016.inraw
WHERE
LEFT(deweyClass, 2) = '{dewey}'
GROUP BY
word
ORDER BY
checkouts DESC
LIMIT
5
The query terminates in approximately 19 seconds.