I started everything by acquiring the historical weather data for Seattle, which is downloaded from National Oceanic and Atmospheric Administration(NOAA)'s Surface Data Hourly Global dataset (https://www7.ncdc.noaa.gov/CDO/cdopoema ... olution=40). The dataset contains entries starting from 2005 April and extends up to today. After getting rid of unnecessary columns, I ended up with the following dataset:
Code: Select all
DATE: Date of current item,
TIME: Time of current item,
DIR: Wind direction (0,360), 0 is North,
SPD: Speed of the wind in mph,
TEMPF: Temperature in F,
TEMPC: Temperature in C,
SLP: Sea Level Pressure
Code: Select all
SELECT
DATE(sth.checkOut) 'date',
HOUR(sth.checkOut) 'hour',
COUNT(*) 'count',
GROUP_CONCAT(sth.subject
SEPARATOR '\n') 'subjects'
FROM
(SELECT
s.bibNumber, s.subject, t.checkOut
FROM
spl_2016.transactions t
JOIN spl_2016.subject s ON s.bibNumber = t.bibNumber
WHERE
t.checkOut >= '2005-04-19'
AND t.checkOut <= '2017-01-01') AS sth
GROUP BY DATE(sth.checkOut) , FLOOR( HOUR(sth.checkOut) / 6)
Code: Select all
# UTILITIES
import csv
from array import *
from datetime import datetime
# NLP STUFF
import nltk
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.probability import FreqDist
lmtzr = WordNetLemmatizer()
class Item(object):
#date, time, count, subjects, titles
def __init__(self, d, c, s):
self.datetime = d
self.count = c
self.subjects = s
def getFrequencyList(sentence):
# TOKEN EXTRACTION
tokens = nltk.word_tokenize(sentence)
# CONVERT LOWERCASE
tokens = [w.lower() for w in tokens]
# LEMMATIZATION
lemmatized = []
for x in range(len(tokens)):
lemmatized.append(lmtzr.lemmatize(tokens[x]))
# FREQUENCY
fdist1 = FreqDist(lemmatized)
return fdist1.most_common(len(fdist1))
# returns as [(',', 18713), ('the', 13721), ...
theArray = []
with open('words_05_17.csv', 'rt', encoding='utf-8') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
try:
for row in csvreader:
if(row[0] != 'date'):
print("%s %s:00"%(row[0],row[1]))
dt = datetime.strptime("%s %s:00"%(row[0],row[1]),'%Y-%m-%d %H:00')
subject = ''
for x in getFrequencyList(row[3]):
subject += "%s %s#"%(x[0], x[1])
i = Item(dt, int(row[2]), subject)
theArray.append(i)
except UnicodeDecodeError:
print("wrong read...")
except UnicodeEncodeError:
print("wrong read...")
with open('nltk_05_17.csv', 'wt', encoding='utf-8') as csvfile:
csvwriter = csv.writer(csvfile)
for x in range(len(theArray)):
csvwriter.writerow([theArray[x].datetime, theArray[x].count, theArray[x].subjects])
Whenever the user starts to type something with the keyboard, the system will search the phrase inside of the keywords (Regex is allowed: http://regexr.com). On the right-hand side of the GUI, the search results are listed, and on the left side, the first word of the list is detailed. Also in the word cloud, search results are highlighted. The Time Barcode shape at the detail panel denotes usage dates of the current word, and the red line corresponds to its Average Date. Time barcode starts from the beginning of the data on its left (2005 April) and goes up to January 2017. In addition to these, it is possible to filter what's drawn inside the cube with sliders at the bottom-left. The left-most buttons focus the camera to a specific axis. Invert Colors option changes the color scheme from dark to light. Auto Rotation rotates the camera automatically, and Color Search option paints the words that are the result of the current search term in terms of their average temperature values. After using the visualization, a couple of minutes later I discovered that it shows a pattern on low-frequency items. Since these items only appear once in the dataset, their average temperature is the temperature they are used in. Since this is the case, Date-Temperature plane clearly shows seasonality (winter-summer differences in terms of temperature). The source code (requires PeasyCam and ControlP5 libraries) is below: