Concept description
I obtain the most common words found in the titles of Dewey 641 (food & drink) media and use t-SNE for dimensionality reduction in order to explore the data in 3-D space. I also generate a network that links words co-occurring in the same title and extract its community structure.
MySQL queries
I queried the titles of all Dewey 641 media that were checked out between 2006 and 2018. I noticed that there was a text encoding problem with the database - special characters were encoded incorrectly, for instance I would get 'caf‚' instead of 'café'. I was unable to find a solution, but the following code mitigates this somewhat (the words with special characters are still wrong, but they aren't as gnarly).
Code: Select all
SELECT
CONVERT( CAST(CONVERT( title USING LATIN1) AS BINARY) USING UTF8) AS fixedTitle,
COUNT(bibNumber) AS Counts
FROM
spl_2016.inraw
WHERE
deweyClass >= 641 AND deweyClass < 642
AND YEAR(cout) BETWEEN 2006 AND 2018
GROUP BY fixedTitle
ORDER BY Counts DESC
LIMIT 20000
The query took 32 seconds.
Next, I processed the titles using word2vec and extracted coordinates in 3-dimensional space using t-SNE. I also generated a network by defining words as nodes with edges between two words that occur in the same title. The weight of an edge represents how many times those two words have occurred in the same title. The code is attached as a Jupyter notebook, though the corresponding python code is here:
Code: Select all
#!/usr/bin/env python
# coding: utf-8
# MAT 259: 3D Visualization
#
# Chantal Nguyen
#
# Instructor: George Legrady
#
#
# The following code reads in a list of titles of all items at the Seattle Public Library in Dewey category 641 (food & drink). Words are extracted and processed using word2vec, and dimensionality reduction is performed using t-SNE to facilitate visualization in 3D space.
#
# Adapted from Andy Patel (https://labsblog.f-secure.com/2018/01/30/nlp-analysis-of-tweets-using-word2vec-and-t-sne/)
from sklearn.manifold import TSNE
from collections import Counter
from collections import OrderedDict
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
from scipy import sparse
import os
import sys
import io
import re
import json
import multiprocessing
# Define functions that can either load existing data or execute a function to acquire data.
def try_load_or_process(filename, processor_fn, function_arg):
load_fn = None
save_fn = None
if filename.endswith("json"):
load_fn = load_json
save_fn = save_json
else:
load_fn = load_bin
save_fn = save_bin
if os.path.exists(filename):
return load_fn(filename)
else:
ret = processor_fn(function_arg)
save_fn(ret, filename)
return ret
def print_progress(current, maximum):
sys.stdout.write("\r")
sys.stdout.flush()
sys.stdout.write(str(current) + "/" + str(maximum))
sys.stdout.flush()
def save_bin(item, filename):
with open(filename, "wb") as f:
cPickle.dump(item, f)
def load_bin(filename):
if os.path.exists(filename):
with open(filename, "rb") as f:
return cPickle.load(f)
def save_json(variable, filename):
with io.open(filename, "w", encoding="utf-8") as f:
f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))
def load_json(filename):
ret = None
if os.path.exists(filename):
try:
with io.open(filename, "r", encoding="utf-8") as f:
ret = json.load(f)
except:
pass
return ret
# Load the raw data:
def process_raw_data(input_file):
lines = []
print("Loading raw data from: " + input_file)
if os.path.exists(input_file):
with io.open(input_file, 'r', encoding="utf-8") as f:
lines = f.readlines()
num_lines = len(lines)
ret = []
for count, text in enumerate(lines):
if count % 50 == 0:
print_progress(count, num_lines)
text = u''.join(x for x in text)
text = text.strip()
if text not in ret:
ret.append(text)
return ret
# Tokenize titles (sentences):
def tokenize_sentences(sentences):
ret = []
max_s = len(sentences)
print("Got " + str(max_s) + " sentences.")
for count, s in enumerate(sentences):
tokens = []
words = re.split(r'(\s+)', s)
if len(words) > 0:
for w in words:
if w is not None:
w = w.strip()
w = w.lower()
if w.isspace() or w == "\n" or w == "\r":
w = None
if len(w) < 1:
w = None
if w is not None:
tokens.append(w)
if len(tokens) > 0:
ret.append(tokens)
if count % 50 == 0:
print_progress(count, max_s)
return ret
# Clean titles using list of common stopwords in addition to self-defined list ("cookbook", etc)
def clean_sentences(tokens):
all_stopwords = load_json("stopwords-iso.json")
extra_stopwords = ["cook", "cookbook", "book", "cooking", "recipe", "recipes", "food", "foods"]
stopwords = []
if all_stopwords is not None:
stopwords += all_stopwords["en"] + all_stopwords["fr"] + all_stopwords["es"] + all_stopwords["de"] + all_stopwords["it"]
stopwords += extra_stopwords
ret = []
max_s = len(tokens)
for count, sentence in enumerate(tokens):
if count % 50 == 0:
print_progress(count, max_s)
cleaned = []
for token in sentence:
if len(token) > 0:
if stopwords is not None:
for s in stopwords:
if token == s:
token = None
if token is not None:
if re.search("^[0-9\.\-\s\/]+$", token):
token = None
if token is not None:
cleaned.append(token)
if len(cleaned) > 0:
ret.append(cleaned)
return ret
# Get word frequencies:
def get_word_frequencies(corpus):
frequencies = Counter()
for sentence in corpus:
for word in sentence:
frequencies[word] += 1
freq = frequencies.most_common()
return freq
# Get word2vec representations:
def get_word2vec(sentences):
num_workers = multiprocessing.cpu_count()
num_features = 200
epoch_count = 10
sentence_count = len(sentences)
w2v_file = os.path.join(save_dir, "word_vectors.w2v")
word2vec = None
if os.path.exists(w2v_file):
print("w2v model loaded from " + w2v_file)
word2vec = w2v.Word2Vec.load(w2v_file)
else:
word2vec = w2v.Word2Vec(sg=1,
seed=1,
workers=num_workers,
size=num_features,
min_count=3,
window=5,
sample=0)
print("Building vocab...")
word2vec.build_vocab(sentences)
print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
print("Training...")
word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
print("Saving model...")
word2vec.save(w2v_file)
return word2vec
# Get word associations:
def most_similar(input_word, num_similar):
sim = word2vec.wv.most_similar(input_word, topn=num_similar)
output = []
found = []
for item in sim:
w, n = item
found.append(w)
output = [input_word, found]
return output
def test_word2vec(test_words):
output = []
associations = OrderedDict()
test_items = test_words
for count, word in enumerate(test_items):
if word not in associations:
associations[word] = []
similar = most_similar(word, num_similar)
output.append(similar)
for s in similar[1]:
if s not in associations[word]:
associations[word].append(s)
filename = os.path.join(save_dir, "word2vec_test.json")
save_json(output, filename)
filename = os.path.join(save_dir, "associations.json")
save_json(associations, filename)
filename = os.path.join(save_dir, "associations.csv")
handle = io.open(filename, "w", encoding="utf-8")
handle.write(u"Source,Target\n")
for w, sim in associations.iteritems():
for s in sim:
handle.write(w + u"," + s + u"\n")
return output
# Run t-SNE with 3 output dimensions:
def calculate_t_sne():
vocab = word2vec.wv.vocab.keys()
vocab_len = len(vocab)
arr = np.empty((0, dim0), dtype='f')
labels = []
vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
labels_file = os.path.join(save_dir, "labels.json")
if os.path.exists(vectors_file) and os.path.exists(labels_file):
print("Loading pre-saved vectors from disk")
arr = load_bin(vectors_file)
labels = load_json(labels_file)
else:
print("Creating an array of vectors for each word in the vocab")
for count, word in enumerate(vocab):
if count % 50 == 0:
print_progress(count, vocab_len)
w_vec = word2vec[word]
labels.append(word)
arr = np.append(arr, np.array([w_vec]), axis=0)
save_bin(arr, vectors_file)
save_json(labels, labels_file)
x_coords = None
y_coords = None
z_coords = None
x_c_filename = os.path.join(save_dir, "x_coords.npy")
y_c_filename = os.path.join(save_dir, "y_coords.npy")
z_c_filename = os.path.join(save_dir, "z_coords.npy")
if os.path.exists(x_c_filename) and os.path.exists(y_c_filename) and os.path.exists(z_c_filename):
print("Reading pre-calculated coords from disk")
x_coords = load_bin(x_c_filename)
y_coords = load_bin(y_c_filename)
z_coords = load_bin(z_c_filename)
else:
print("Computing T-SNE for array of length: " + str(len(arr)))
tsne = TSNE(n_components=3, random_state=1, verbose=1)
np.set_printoptions(suppress=True)
Y = tsne.fit_transform(arr)
x_coords = Y[:, 0]
y_coords = Y[:, 1]
z_coords = Y[:, 2]
print("Saving coords.")
save_bin(x_coords, x_c_filename)
save_bin(y_coords, y_c_filename)
save_bin(z_coords, z_c_filename)
return x_coords, y_coords, z_coords, labels, arr
# Run the code:
input_dir = ""
save_dir = ""
# if not os.path.exists(save_dir):
# os.makedirs(save_dir)
print("Preprocessing raw data")
raw_input_file = os.path.join(input_dir, "items.csv")
filename = os.path.join(save_dir, "data.json")
processed = try_load_or_process(filename, process_raw_data, raw_input_file)
print("Unique sentences: " + str(len(processed)))
print("Tokenizing sentences")
filename = os.path.join(save_dir, "tokens.json")
tokens = try_load_or_process(filename, tokenize_sentences, processed)
print('\n')
print("Cleaning tokens")
filename = os.path.join(save_dir, "cleaned.json")
cleaned = try_load_or_process(filename, clean_sentences, tokens)
print('\n')
print("Getting word frequencies")
filename = os.path.join(save_dir, "frequencies.json")
frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
vocab_size = len(frequencies)
print('\n')
print("Unique words: " + str(vocab_size))
print("Instantiating word2vec model")
word2vec = get_word2vec(cleaned)
vocab = word2vec.wv.vocab.keys()
vocab_len = len(vocab)
print("word2vec vocab contains " + str(vocab_len) + " items.")
dim0 = word2vec.wv[vocab[0]].shape[0]
print("word2vec items have " + str(dim0) + " features.")
print("Calculating T-SNE for word2vec model")
x_coords, y_coords, z_coords, labels, arr = calculate_t_sne()
with io.open(os.path.join(save_dir, 'labels.csv'), mode='w', encoding='utf-8') as f:
for row in labels:
f.write(row+'\n')
coords = np.transpose(np.vstack((x_coords, y_coords, z_coords)))
with open(os.path.join(save_dir,'coords.csv'), 'w') as f:
for x, y, z in coords:
f.write('%f, %f, %f\n' % (x,y,z))
freq = dict(frequencies)
reordered_freqs = []
for word in labels:
reordered_freqs.append(freq[word])
with open(os.path.join(save_dir,'frequencies.csv'), 'w') as f:
for num in reordered_freqs:
f.write('%f\n' % num)
num_similar = 20
test_words = []
for item in labels:
test_words.append(item)
associations = test_word2vec(test_words)
# Create word network: each node is a different word in the vocab, and an edge connects two words that are found in the same title. The network is represented by a sparse adjacency matrix where the indices are in the same order as the t-SNE labels.
labeldict = OrderedDict()
count = 0
for word in labels:
labeldict[word] = count
count = count + 1
adjacency_matrix = np.zeros((len(labels),len(labels)))
for token in cleaned:
for i in range(0,len(token)-1):
for j in range(i+1,len(token)):
if token[i] in vocab and token[j] in vocab:
adjacency_matrix[labeldict[token[i]]][labeldict[token[j]]] += 1
adjacency_matrix[labeldict[token[j]]][labeldict[token[i]]] += 1
sAdj = sparse.csr_matrix(adjacency_matrix)
# save as mtx file
from scipy import io
io.mmwrite('adjacency.mtx', sAdj, field='integer', symmetry='general')
# ignoring first 3 lines, resave as csv file with comma delimiters instead of whitespace
with open('adjacency.mtx', 'r') as f, open('adjacency.csv', 'w') as f2:
next(f)
next(f)
next(f)
for row in f:
f2.write(re.sub('\s+',',',row.strip())+'\n')
I also obtained the community structure of the word network in Matlab. The code requires the GenLouvain package available at
http://netwiki.amath.unc.edu/GenLouvain/GenLouvain. The community detection algorithm determines clusters (communities) of nodes that are relatively densely connected with each other and sparsely connected with the rest of the network.
Code: Select all
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% MAT 259 Project 3 %
% Determine community structure for word network %
% Author: Chantal Nguyen %
% Supervisor: George Legrady %
% %
% Reads in csv file containing sparse adjacency matrix %
% Outputs csv file containing community number for each word %
% Requires GenLouvain package available at %
% http://netwiki.amath.unc.edu/GenLouvain/GenLouvain %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% read in csv file containing 3 columns: source node, target node, edge
% weight
sparse_adj = csvread('~/Documents/classes/MAT259/proj3/data/adjacency.csv');
% convert to graph object
A = zeros(max(sparse_adj(:,1)));
for i = 1:length(sparse_adj)
A(sparse_adj(i,1),sparse_adj(i,2)) = sparse_adj(i,3);
end
G = graph(A);
A = adjacency(G,'weighted');
% do the community detection
resp = 1.2;
C = modularity(A,resp);
[S,Q,n_it]=iterated_genlouvain(C);
% write to csv file
csvwrite('~/Documents/classes/MAT259/proj3/data/communities.csv',S);
Screenshots and final results
I plotted each word, represented as a circle, using the coordinates obtained in t-SNE. The size of the circle is proportional to the frequency of the word in the body of titles. I also showed the network edges, but the network is pretty dense, so by default they are hidden; pressing keys 1-9 or Q-P will show the edges where the corresponding key is an edge weight threshold (e.g. pressing 3 shows all edges with weight greater than or equal to 3, and Q-P represent threshold values of 10-19). Pressing 0 will hide the edges. Hovering over a point will bring up the label along with the 20 most similar words. Pressing C will toggle the color scheme between one in which the colors are proportional to the respective frequencies of the words, and one in which identical colors indicate the same community. The user can also type in a word to highlight the corresponding node using the search bar in the lower right corner.