"How 11 Year Olds Get Girls" or Advice in Times of the Internet
For my project, I will not work with the New York Times API but create a visualization of the correlation of the Seattle Public Library data with a quite interesting historical data source I came across.
Concept
In 1998, in an effort to legislate Internet pornography, the so called Child Online Protection Act (COPA) was passed (
http://en.wikipedia.org/wiki/Child_Onli ... ection_Act). In 2006, the Department of Justice, trying to enforce the newly created law, demanded access to the search logs of all major Internet search engines, including Google (Gonzales vs. Google, 2006:
http://news.findlaw.com/nytimes/docs/go ... 1806m.html). Google took legal action which eventually (2009) lead to the act being struck down as unconstitutional, after a long journey through the courts. Several other search engines complied and delivered the requested "multi-stage random sample of one million URL’s" from their database, and a computer file with "the text of each search string entered onto the search engine over a one-week period (absent any information identifying the person who entered such query)."
One search engine though, AOL, went a step further and decided to not only make its data available, but to make it
publicly available. So, according to Wikipedia, "on August 4, 2006, AOL Research, headed by Dr. Abdur Chowdhury, released a compressed text file on one of its websites containing twenty million search keywords for over 650,000 users over a 3-month period, intended for research purposes. AOL deleted the search data on their site by the 7th, but not before it had been mirrored and distributed on the Internet" (
http://en.wikipedia.org/wiki/AOL_search_data_leak). This is the text of the readme file that came with the data from AOL:
500k User Session Collection
----------------------------------------------
This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY.
Any application of this collection for commercial purposes is STRICTLY PROHIBITED.
Brief description:
This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.
The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.
The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
AnonID - an anonymous user ID number.
Query - the query issued by the user, case shifted with
most punctuation removed.
QueryTime - the time at which the query was submitted for search.
ItemRank - if the user clicked on a search result, the rank of the
item on which they clicked is listed.
ClickURL - if the user clicked on a search result, the domain portion of
the URL in the clicked result is listed.
Each line in the data represents one of two types of events:
1. A query that was NOT followed by the user clicking on a result item.
2. A click through on an item in the result list returned from a query.
In the first case (query only) there is data in only the first three columns/fields -- namely AnonID, Query, and QueryTime (see above).
In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.
CAVEAT EMPTOR -- SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.
Basic Collection Statistics
Dates:
01 March, 2006 - 31 May, 2006
Normalized queries:
36,389,567 lines of data
21,011,340 instances of new queries (w/ or w/o click-through)
7,887,022 requests for "next page" of results
19,442,629 user click-through events
16,946,938 queries w/o user click-through
10,154,742 unique (normalized) queries
657,426 unique user ID's
Please reference the following publication when using this collection:
G. Pass, A. Chowdhury, C. Torgeson, "A Picture of Search" The First
International Conference on Scalable Information Systems, Hong Kong, June,
2006.
Copyright (2006) AOL
The ensuing public outrage that AOL, somehow, had not anticipated, mainly focused on the badly implemented anonymization of the data: While users were assigned random IDs, their search history still revealed so much information that they became recognizable again. Coincidentally, it was the New York Times that first publicly revealed - with her consent - the identity of a user, no. 4417749, on August 9, 2006 (
http://www.nytimes.com/2006/08/09/techn ... d=all&_r=0). At AOL, The people responsible for the leak were let go and a written apology was published (
http://news.cnet.com/AOL-apologizes-for ... 02793.html).
For the rest of the year 2006, lots of people worked with the AOL data, identifying, amongst others, a potential murderer (
https://plentyoffish.wordpress.com/2006 ... mit-murder). Here is a little gem I found, a love poem by user 2708:
'2708', 'how to drive someone crazy', '2006-03-03 17:09:27'
'2708', 'how to say goodbye hurtfully', '2006-03-05 20:45:09'
'2708', 'how to send junk mail to someone else', '2006-03-18 16:22:04'
'2708', 'how to permantlly delete information from you', '2006-03-19 14:35:57'
'2708', 'how to send email anonymously', '2006-03-21 19:43:35'
'2708', 'how to humiliate someone', '2006-03-03 07:47:13'
'2708', 'how to get revenge on an old lover', '2006-03-03 17:15:12'
'2708', 'how to report child neglect in the state of n', '2006-03-11 17:01:51'
'2708', 'how to send things anounmously', '2006-03-19 12:47:27'
'2708', 'how to permanently delete information from yo', '2006-03-19 14:36:13'
'2708', 'how to stop loving someone', '2006-03-21 23:16:51'
'2708', 'how humiliate someone', '2006-03-03 16:56:05'
'2708', 'how to really make someone hurt for the pain ', '2006-03-03 17:22:50'
'2708', 'how to get back on an ex lover', '2006-03-18 15:53:45'
'2708', 'how to send things anonymously', '2006-03-19 12:47:33'
'2708', 'how to send alot of junk mail to someonne wit', '2006-03-21 19:41:29'
'2708', 'how to make someone misreable', '2006-03-03 17:05:12'
'2708', 'how to make an old lover suffer', '2006-03-05 15:02:13'
'2708', 'how to move on from a broken heart', '2006-03-18 16:01:28'
'2708', 'how to ruin someone's credit', '2006-03-19 13:19:47'
A class action law suit, filed in September 2006, however, put an end to this exploration. Where possible, data sources were removed, the few sources left were killed by time (both on web servers and in file sharing networks).
With the help of the Wayback Machine (
https://archive.org/web/web.php), however, I was able to retrieve a copy of the data, put it into a MySQL database and started to explore it. I was immediately captivated by the strange poetry of the search queries themselves and the users portrayed by them. This is why, for my data correlation project, I will correlate the AOL search log data with the Seattle Public Library data from the same time period, March 1 - May 31, 2006. As a working concept, I will look at "advice" given by the Internet and the books in the library by simply querying both data sources for the word "how" - hence the working title of my project.
Visualization
For the visualization, I am currently working in the style of the last assignment (the format is DIN A4). Right now, all I have is a very rough draft that is missing a lot of features but at least has all the data. I spent way too much time implementing "3D picking" to reduce the amount of information on the screen. Anyway, here is a first look at the sketch (including a wonderful poem by user 670808 and the corresponding book titles from the Seattle Public library):