Dynamical Representation of Supply and Demand
Concept
For this project, I would like to map statistics from my dataset to the parameters of a physical system. The specific statistics are:
1. the current supply in the library of each "title" that I am considering
2. the time difference between when a specific book is checked into the library and when it is checked out again.
These values will be mapped respectively to:
1. The strength of a 1/r radial field.
2. the strength of a 1/r circulation field. also the color of the particles.
This field is what will dictate the motion of many particles, which will exist inside a box. Because the values of the current supply and time difference will change over time, the particles' motions will thus be governed by a time varying acceleration field.
To summarize, the behavior of the particles in relation to the supply and demand is roughly:
1. as supply goes up, the strength of the radial field increases and pulls the particles into the center, making the system more ordered.
2. as supply goes down, the radial strength decreases , so particles are able to explore a greater region of the cube. To emphasis this, I have added a springlike noise field.
3. as demand goes up, circulation goes up, and the system will move circulate faster. Note that I will be using an exponentially smoothed time_diffs, rather than a average or instantaneous one.
Work in Progress
Sql Query
Code: Select all
SELECT
itemNumber, title, cout, cin
FROM
spl_2016.inraw
WHERE
(deweyClass< 5.121 AND deweyClass> 5.09)--> all same except (6.451,6.29) (4.671,4.59) (5.134,5.132)
ORDER BY itemNumber
Python script
This python script takes the the csvs generated from the sql query and converts it into a time_diff and influx array for each title. The steps required to do so are:
1. shift checkin column down relative to checkout
2. calculate diff across the row
3. filter out negative time diffs and time diffs over 150(to prevent outliers from significantly affecting the data)
4. convert checkout and checkin time to relative position in an array
5. creates the influx array by adding +/- 1 to the influx array according to the relative position from step 4.
6. "hashes" the time _diffs from step 3 with the hash being the value from step 4 and average with the influx array created in step 5 element wise.
7. Along the way, also calculates max supply for each title,as well as a global max supply and global max/min time_diffs
Code: Select all
import datetime
import pandas as pd
import numpy as np
path1='/home/benson/Dropbox/Code/Projects/Mat259_3D/raw_data/programming_languages.csv'
path2='/home/benson/Dropbox/Code/Projects/Mat259_3D/raw_data/networking.csv'
path3='/home/benson/Dropbox/Code/Projects/Mat259_3D/raw_data/machine_learning.csv'
path4='/home/benson/Dropbox/Code/Projects/Mat259_3D/raw_data/software.csv'
paths = [path1, path2, path3, path4]
def data_extractor(path, number_of_days=4500):
number = number_of_days
df = pd.read_csv(path)
df['cin'] = df['cin'].shift(1)
shifted_df= df.groupby('itemNumber').apply(lambda group: group.iloc[1:])
################################################################
def stringToDatetime(string):
return datetime.datetime.strptime(string, '%Y-%m-%d %H:%M:%S')
shifted_df['cout_date']= shifted_df['cout'].apply(stringToDatetime)
shifted_df['cin_date']= shifted_df['cin'].apply(stringToDatetime)
shifted_df['time_diff_date']= shifted_df['cout_date']- shifted_df['cin_date']
shifted_df['time_diff']= shifted_df['time_diff_date'].apply(lambda date: date.days)
################################################################
## Creating a new dataframe for better readability
filtered_df = shifted_df[shifted_df.time_diff>0]
cutoff = 150
def frac_over(series):
over_sum =0
for s in series:
if s>cutoff:
over_sum += 1
total = len(series)
return over_sum/total
# print frac to make sure we arn't cutting off too much
print("If this percentage is too high, increase the cutoff variable (current value = {})in this script: {}".format(cutoff, frac_over(filtered_df.time_diff)))
filtered_df = filtered_df[filtered_df.time_diff<cutoff]
max_time_diff = filtered_df['time_diff'].max()
print("Max time diff: {}".format(max_time_diff))
min_time_diff = filtered_df['time_diff'].min()
print("Min time diff: {}".format(min_time_diff))
supply = len(list(filtered_df['itemNumber'].unique()))
print("Max supply: {}".format(supply))
################################################################
def base_diff(string):
start= datetime.datetime(2006,1,1,0,0)
end = datetime.datetime.strptime(string, '%Y-%m-%d %H:%M:%S')
diff=end-start
return diff.days
filtered_df['cout_time']= filtered_df['cout'].apply(base_diff)
filtered_df['cin_time']= filtered_df['cin'].apply(base_diff)
# get max time to make sure there is no error
print("latest checkout time: {}".format(filtered_df['cout_time'].max()))
print("latest checkout time: {}".format(filtered_df['cin_time'].max()))
################################################################
def checkout_array():
a=np.zeros(number, dtype ='float64')
for time in filtered_df.cout_time:
a[time] +=1
return a
def checkin_array():
a=np.zeros(number, dtype = 'float64')
for time in filtered_df.cin_time:
a[time] +=1
return a
def timediffs_array():
a=np.zeros(number, dtype = 'float64')
for i,time in enumerate(filtered_df.cout_time):
a[time] += filtered_df["time_diff"][i]
return a
checkouts = checkout_array()
checkins = checkin_array()
time_diffs = timediffs_array()
final_time_diffs = np.divide(time_diffs, checkouts,
out= np.zeros_like(time_diffs), where=(checkouts!=0))
# checking with my jupyter notebook
print("Average checkouts per day: {}".format(checkouts.mean()))
print("Unnormalized time diffs average: {}".format(time_diffs.mean()))
print("Normalized time diffs average: {}".format(final_time_diffs.mean()))
influx = checkouts - checkins
return influx, final_time_diffs, supply, max_time_diff, min_time_diff
number_of_days = 4500
dataset= np.zeros((number_of_days,2*len(paths)), dtype='float64')
if __name__ == '__main__':
global_max_time_diff =0
global_max_supply =0
global_min_time_diff =0
for i,item in enumerate(paths):
print("Logs for title {}".format(i))
influx, final_time_diffs, supply, max_time_diff, min_time_diff = data_extractor(item, number_of_days)
dataset[:,2*i] = influx
dataset[:,2*i+1] = final_time_diffs
################################################################
if max_time_diff>global_max_time_diff:
global_max_time_diff= max_time_diff
if supply> global_max_supply:
global_max_supply = supply
if min_time_diff< global_max_time_diff:
global_min_time_diff = min_time_diff
################################################################
print("Change the global_max_supply variable in the Mat259_3D pde file to: {}".format(global_max_supply))
print("Change the global_max_time_diff variable in the Mat259_3D pde file to: {}".format(global_max_time_diff))
print("Change the global_min_time_diff variable in the Mat259_3D pde file to: {}".format(global_min_time_diff))
np.savetxt("dataset.csv", dataset, delimiter = ',')
Processing Code
v0.2: Initial system of box particles constrained to a box
v0.3: changed from particles bouncing to a flowfield
v0.4: noise was too big, so I turned it down
v0.5: I changed from a velocity field to a flowfield
v0.6: now I have the 4 systems set up
v0.7: now each individual system is dependent on the data
v0.8: increased max speed from 10 to 40, so particles can better resist the field
v0.9: added labels, colors, buttons and time lapse. But buttons are not functionally yet.
v1.0: added background, framerate indicator, and changed colors and fonts of buttons. Buttons are now functional.
Analysis
Although things look somewhat nice, the ability to distinguish high demand and low supply(which indicates that the library should perhaps buy more books in this genre) is lacking. One reason for this was my choice of dataset. Because I chose categories rather than specific titles, the current supply would stay relatively stable, and was always above 90 percent. And, because I had a max speed constraint, accelerations could not accumulate, thus flattening the behavior of the particles. This is somewhat mitigated by the color encoding.