DATA: Breaking Down A Movie

This semester I worked with Miguel Bermudez on breaking down a movie script to “auto-magically generate” movie trailers for any movie in any genre.

 

To do this, we created a data schema that had a hierarchy of the “Character’ as the most important component of the schema.  Technically, we used python to break down the script based on dialogue lines. All of the code (and further description) can be found HERE.

 

We also used sentiment analysis to locate lines of dialogue that were either positive or negative.  To further explore the distribution of positive vs negative, we used R to plot the occurrence of positive and negative lines throughout the timeline of a film.  The plot is pictured below.

 

 

We’d like to continue working on this project and form an “emotional” model for the timeline of different genres of film.  We hope this will better inform our ability to breakdown films into its specific parts and more accurately isolate dialogue most relevant for a trailer of any film in any particular genre.

RWET: Final – Movie Trailer Mashups

For my final project in “Reading Writing Electronic Text” I continued working on the Trailer Mashup concept in collaboration with the amazingly talented Miguel Bermudez. Our ultimate goal was to have a website where a user could choose a movie from a given list, choose a movie genre of interest, click one button, and “auto-magically” generate a movie trailer for the chosen film in that genre.

 

As an initial use-case, we worked with the film “Jaws” and attempted to auto-generate a trailer for this movie in the genere of ‘Romantic Comedy.”  We devised a “template” for a Romantic Comedy trailer that would follow this structure:

 

VOICE OVER LINE 1 – Introduce World/Location

We would use Stanford’s NER library to determine the most frequently occurring location in the film’s script

 

MOVIE DIALOGUE LINE 1 – A line about the chosen location

 

VOICE OVER LINE 2 – Introduce Main Male & Female Characters

 

MOVIE DIALOGUE LINE 2 – Line from Male Lead

MOVIE DIALOGUE LINE 3 – Line from Female Lead

 

VOICE OVER LINE 3 – Establish Current Situation Between The Leads

 

MOVIE DIALOGUE LINES 4 & 5 – Interaction of Male Lead to Female Lead

MOVIE DIALOGUE LINES 6 & 7 – Interaction of Female Lead to Male Lead

 

VOICE OVER LINE 4 – Convey A Positive Tone

 

MOVIE DIALOGUE LINES 8 & 9 – Using sentiment analysis, we would tag each line of dialogue with an emotional score ranging from -5 to 5.  A random sampling of lines would be chosen that had a score above 2 for the positive lines needed for this section

 

VOICE OVER LINE 5 – Introduce Primary Conflict

 

MOVIE DIALOGUE LINES 10 & 11 – Again using sentiment analysis, we would choose a random sampling of lines that had a score below -2.

 

VOICE OVER LINE 6 – DATE/TIME

Determine the most frequently occurring date/time in the film’s script.  Then randomly choose a date/time term within that time frame for the VO

 

MOVIE DIALOGUE LINES 12-15 – Exclamatory moments

 

VOICE OVER LINE 7 – A CALL TO ACTION

 

MOVIE DIALOGUE LINES 16 & 17 – “Action” dialogue

 

VOICE OVER LINE 8 – A WRAP-UP

 

MOVIE DIALOGUE LINES 18 & 19 – “Meaningful” dialogue

 

VOICE OVER LINE 9 – MOVIE TITLE

 

MOVIE DIALOGUE LINE 20 – A “button” moment – could be comical, sexy, cliffhanger, etc.

 

 

To achieve this type of output, we realized we’d need (1) a systematic way to breakdown and organize the text of the film to pull appropriate movie dialogue lines and (2) a corpus of voice-over dialogue line from which to pull/generate/create the Voice Over Lines.

 

In terms of the movie script, we loaded the “Jaws” script into Final Draft and were able to export an XML file that tagged all of the script’s content based on scene, character, direction, and dialogue.  Using this as a starting point, we established a data hierarchy with the “Character” being the most important component.  The model looks like this:

 

Script

|

Characters

|

Scenes

|

Dialogue

 

We figured if we could determine which characters were speaking most frequently in a given film, we could discerns the important characters , what their roles should be in the trailer based on the selected genre, and then generate our output based on the assigned roles.

 

We then tied everything together by writing a series of methods that would allow us to access the data through the characters.  In terms of the Voice Over Lines, we established our own corpus of lines, through a variety of ways, for each particular line.  We incoproated Adam Parrish’s “Markov.py” code as a starting point for the algorithmically generated VO.

 

All of the code has been posted HERE on Miguel’s website.

 

The output our code currently produces is interesting, but not entirely coherent.  With some “subjective finesse” we compiled the following text version of “Jaws” as a Romantic Comedy.

 

 

IN A ROMANTIC TOWN OF PASSION

All I’m saying is that Amity is a summer town   (Vaughn)

 

TWO TOTAL SOULS

I’m responsible for public safety around here   (Brody)

I saw a show with sea otters, and a big turtle   (Ellen)

 

SUDDENLY FOUND DESTINED TO MEET

How do I look? Older, huh?   (Brody)

I think they make you look sexy   (Ellen)

You’re very tight, y’know?  (Ellen)

I know   (Brody)

 

THEY THOUGHT EVERYTHING WAS IN THEY NEVER THOUGHT LIFE COULD GO WRONG

Go on and help yourself to whatever you need, Chief   (Lynwood)

Here’s to swimmin’ with bowlegged women   (Quint)

 

BUT ONE ARRIVES WEEKEND EVERYTHING CHANGED

Y’know I used to hate the water  (Brody)

He’s got this childhood thing, there’s a clinical word for it   (Ellen)

I wouldn’t put it that way.  But I love sharks  (Hooper)

 

THIS MEMORIAL DAY

Beautiful day, Chief!   (Meadows)

Three’ll do it! He’s havin’ trouble with two!   (Quinn)

Let him do it!  Go-Go-Go-Go-Go!   (Charlie)

 

INSPIRED BY A TRUE STORY ABOUT LOVE STORIES ARE MEANING COMEDY OF LOVES STRANGEST PARTS

What’ll I tell the kids?   (Ellen)

Tell ‘em i went fishin’!   (Brody)

 

WILL SET YOUR HEART LEAD TO LIFETIME

Right there. Mary Ellen Moffit broke my heart   (Hooper)

Love a cup of tea with lemon   (Brody)

 

JAWS

“C’mere and give daddy a kiss” (Brody)

 

 

We also constructed a video version that we hope better demonstrates what we’re ultimately trying to achieve.  For the video version, the visuals for the movie dialogue are all the sync movie clips tied to those dialogue lines.  The visuals occurring during the Voice Over Lines were determined based on (1) proximity to the subsequent movie dialogue combined with (2) the text description of the visuals, which is included in the movie script/XML file as scene description.   And the music is “Parentheses” by The Blow.

 

 

While both the final text and video versions are a combination of computer output and human curation, a high-quality computer-only generated system is definitely within reach.  Plus, we got the sweet URL trailermashups.com, so stay tuned!

 

RWET: Rom Com Phrases

As a follow up to my “Reading & Writing Electronic Text” mid-term project (Trailer Mashups), I used Adam Parrish’s N-Gram Python sketch to explore common language phrases occurring in Romantic Comedy scripts.  My corpus of movie scripts included “He’s Just Not That Into You”, “Love Actually”, “Sex And The City”, “Sleepless In Seattle”, and “When Harry Met Sally.”

 

The analysis compared each unique 3 word phrase from each script and produced a list of the phrases that occurring in all of the movie scripts.  Note, I was not checking for frequency per movie script, so this analysis only tells us which phrases showed up at least once in all 5 of the movie scripts.

 

Common 3-word phrases:

 

(‘to’, ‘meet’, ‘you.’)

(‘going’, ‘to’, ‘get’)

(‘you’, ‘have’, ‘a’)

(‘you’, ‘going’, ‘to’)

(‘I’, “don’t”, ‘know’)

(‘not’, ‘going’, ‘to’)

(‘one’, ‘of’, ‘the’)

(‘have’, ‘to’, ‘get’)

(‘know’, ‘what’, ‘I’)

(‘on’, ‘the’, ‘phone’)

(‘I’, “can’t”, ‘believe’)

(‘Do’, ‘you’, ‘want’)

(‘What’, ‘are’, ‘you’)

(‘I’, ‘have’, ‘to’)

(‘in’, ‘love’, ‘with’)

(“I’m”, ‘going’, ‘to’)

(‘it’, ‘in’, ‘the’)

(‘but’, ‘I’, “don’t”)

(‘you’, ‘have’, ‘to’)

(‘going’, ‘to’, ‘be’)

(‘want’, ‘me’, ‘to’)

(‘is’, ‘going’, ‘to’)

(‘to’, ‘talk’, ‘about’)

(‘I’, ‘have’, ‘a’)

(‘Do’, ‘you’, ‘think’)

(‘you’, ‘like’, ‘to’)

(‘I’, “don’t”, ‘think’)

(‘go’, ‘to’, ‘the’)

(‘out’, ‘of’, ‘my’)

(‘I’, ‘want’, ‘to’)

(‘a’, ‘lot’, ‘of’)

(‘to’, ‘talk’, ‘to’)

(‘How’, ‘do’, ‘you’)

(‘do’, ‘you’, ‘think’)

(‘you’, ‘want’, ‘to’)

(‘I’, “don’t”, ‘want’)

(“don’t”, ‘want’, ‘to’)

(‘I’, ‘thought’, ‘you’)

 

Common 4-word phrase:

 

(‘I’, “don’t”, ‘want’, ‘to’)

 

And here is the python code (for the 3 word analysis):

 

RWET: Trailer Mashups – First Attempt at the Auto-Trailer

For the RWET midterm, I messed around with the possibility of creating algorithmically-generated movie trailers.  Having spent a few years working as a movie trailer editor in Los Angeles, I’ve wondered for a while what it would be like if you could press a single button and auto-magically create an instant yet original film trailer.  So I took the first steps in what I hope will evolve into my final project for the class by trying to automatically generate complete trailer scripts.

 

I started by sourcing a number of “discarded” trailer-copy scripts from a range of films. (Copy scripts are essentially what the voice-over and/or graphics would say in a trailer.  They provide the core architecture for the piece and they usually don’t include specific film dialogue.)   I tried to include scripts from a range of genres such as action, romantic comedy, drama, and horror.  Overall, I was able to procure over 150 unused copy scripts of varying length for roughly two dozen films.

I decided to constrain the copy scripts I would use to versions that had 6-sentences of copy, and planned to add the film’s title as the seventh and final line of copy.  I also tried to select scripts that were movie agnostic in that they could not easily be identified with a specific film, character, actor, director, etc.  But, at the same time, I wanted each script to be applicable to any film within its specific genre.  I ended up with close to 90 total 6-line copy scripts.

 

In reviewing the scripts, I noticed some interesting patterns in their structure and decided a decent initial approach would be to break each script up into 3 pairs of copy-lines, lines 1 & 2, 3 & 4, and 5 & 6.  Those pairs were often logically and grammatically connected, which I hoped would provide a hint of coherency to the piece.  In terms of code, the plan was to randomly choose three pairs of lines, one 1-2 pair, one 3-4 pair, and one 5-6 pair, from all the sets of pairs and then use the selected six lines to form the copy script for the “auto-trailer.”

 

The second half of the project involved parsing an actual movie script.  For this initial pass, I decided to work with only one film, “The Big Lebowski.”  It’s a relatively quotable film that I felt was appropriate for the project.  Grabbing the script off the internetz, I copied it into a .txt file and used that as my source file.  I first attempted to exclude any text from the script that was not dialogue, but I had some troubles parsing those lines.  So everything in the movie script, including direction and set-up, was included as potential lines of dialogue for the trailer that would eventually be placed in between the copy lines.

 

I then thought about a general approach to constructing/editing trailers and how there was often a traditional “3-act structure” employed in the creation of the first pass of a film trailer.  Speaking vey generally, this traditional structure would look something like the following:

 

Act 1

  • I. Start with an opening hook
  • II. Introduce the world and the characters

Act 2

  • III. Reveal the main conflict
  • IV. Elaborate on why this is mind-blowing

Act 3

  • V. Transition to backend montage – highlights, production value, stars, accolades
  • VI. Offer an emotional connection leading into the Main Title
  • VII. Finish with a button (anything from a light-hearted moment to a cliff-hanger)

So, for the auto-trailer, it would begin with a random Trailer Copy Line 1 of 2, followed by a Movie Script Line, followed by the Trailer Copy Line 2 of 2, followed by a Movie Script Line, followed by a random Trailer Copy Line 3 of 4, etc…  In terms of choosing the movie script lines, I went with the following structure, which I hoped would complement the traditional 3-act structure outlined above:

 

Movie Script Line 1: Starts with “I”

Movie Script Line 2: Starts with “He”, “She”, or “They”

Movie Script Line 3: Starts with “Wh[aeoy]…” (i.e. What, When, Where, Who, Why)

Movie Script Line 4: Starts with “This”

Movie Script Line 5: Contains at least one fo the following words – “life”, “live”, “death”, “dead”

Movie Script Line 6: Contains at least one emotional word (see the list of emotional words in the python code below)

Movie Script Line 7: Contains a curse word

 

 

Here are a few auto-trailers created by the script:

 

 

Auto Trailer 1

 

THIS THANKSGIVING

i want a fucking lawyer, man.  i

ITS ALL ABOUT

he scowls.

TO HIM ITS A WAY OF LIFE

what?

THIS FALL

this is our concern, dude.

THERES ALWAYS SOMEONE

all my plants are dead!

WHO MAKES IT WORTH THE TRIP

has the whole world gone crazy?  am

THE_BIG_LEBOWSKI

does the pope shit in the woods?

 

 

 Auto Trailer 2

 

THIS SUMMER

i am the walrus.

THE TRUTH ABOUT LOVE

he fixes the cable?

OF LOVE COURAGE AND FRIENDSHIP

what’s that, walter?

EVEN IN OUR DARKEST HOURS

this will not stand, ya know, this

THIS SUMMER

all my plants are dead!

PREPARE FOR THE WORST

you happy, you crazy fuck?

THE_BIG_LEBOWSKI

fuck you.  fuck the three of you.

 

 

 

 Auto Trailer 3

 

IN AN INVESTIGATION

i don’t see any connection to vietnam,

WHERE EVERYONE HAS A SECRET

they hung up, walter!  you fucked it

AFFAIRS COME AND GO

what the fuck is he talking about?

BUT IN THE INTELLIGENCE COMMUNITY

this was a valued, uh.

WHEN LIFE DOESNT GO AS EXPECTED

all my plants are dead!

TAKE THE UNEXPECTED ROUTE

well sir, it’s this rug i have, really

THE_BIG_LEBOWSKI

shit dude, i’m sorry–

 

 

 Auto Trailer 4

 

IN THE MOST DESOLATE PLACE ON EARTH

i the only one here who gives a shit

YOU CAN DIE FROM MANY THINGS

he gives walter a weaker shove.  walter seems dazed, then

SOMETIMES FIRST IMPRESSIONS

who’s in pyjamas, walter?

ARE MEANT TO HAVE SECOND CHANCES

this is its simplicity. if the plan

THE DAY SHE THOUGHT WOULD NEVER COME

break only if it’s a matter of life

HAS BECOME A NIGHTMARE THAT WILL NEVER END

nuts.

THE_BIG_LEBOWSKI

ah fuck it.

 

 

And here is the python code:

</p>
<pre>import sys
import random
import re

source1 = sys.argv[1] #trailer scripts
source2 = sys.argv[2] #movie script

#use this list to hold the trailer copy scripts parsed out by comma
trailerScripts_lines = list()

#use these lists to hold the copy pairs
one_two_combos = list()
three_four_combos = list()
five_six_combos = list()
new_trailerScript = list()

for line in open(source1):
line = line.strip()
phrases = line.split(",")
trailerScripts_lines.append(phrases)

for line in trailerScripts_lines:
if len(line) < 6:
None
else:
one_two_combos.append( (line[0], line[1]) )
three_four_combos.append( (line[2], line[3]) )
five_six_combos.append( (line[4], line[5]) )

movieScript_lines = list()

this_line = list()
i_line = list()
wh_line = list()
heshe_line = list()
lifedeath_line = list()
emotional_line = list()
curseWord_line = list()

finalMovieScript = list()

for line in open(source2):
line = line.strip()
line = line.lower()
movieScript_lines.append(line)

for line in movieScript_lines:

#i lines
if re.search(r"^i ", line):
i_line.append(line)

#he-she-they lines
if re.search(r"^she ", line) or re.search(r"^he ", line) or re.search(r"^they ", line):
heshe_line.append(line)

#wh... lines
if re.search(r"^wh[aeoy]", line):
wh_line.append(line)

#this lines
if re.search(r"^this ", line):
this_line.append(line)

#life-death lines
if re.search(r"\bdeath\b", line) or re.search(r"\bdead\b", line) or re.search(r"\bli[vf]e\b", line):
lifedeath_line.append(line)

#emotional lines
if re.search(r"\binsane\b", line) or re.search(r"\bnuts\b", line) or re.search(r"\bcrazy\b", line) or re.search(r"\bwow\b", line) or re.search(r"\bbizarre\b", line) or re.search(r"\breally\b", line) or re.search(r"\bwhoa\b", line) or re.search(r"surprise", line) or re.search(r"\bstupid\b", line):
emotional_line.append(line)

#curse word lines
if re.search(r"\bfuck\b", line) or re.search(r"\bshit\b", line) or re.search(r"\bbullshit\b", line) or re.search(r"\bshoot\b", line):
curseWord_line.append(line)

def constructMovieTrailer():

#choose a random set from each trailer batch
choice1 = one_two_combos[random.choice(range(len(one_two_combos)))]
choice2 = three_four_combos[random.choice(range(len(three_four_combos)))]
choice3 = five_six_combos[random.choice(range(len(five_six_combos)))]

line1 = choice1[0]
line2 = i_line[random.choice(range(len(i_line)))]
line3 = choice1[1]
line4 = heshe_line[random.choice(range(len(heshe_line)))]
line5 = choice2[0]
line6 = wh_line[random.choice(range(len(wh_line)))]
line7 = choice2[1]
line8 = this_line[random.choice(range(len(this_line)))]
line9 = choice3[0]
line10 = lifedeath_line[random.choice(range(len(lifedeath_line)))]
line11 = choice3[1]
line12 = emotional_line[random.choice(range(len(emotional_line)))]
line13 = source2[:-4]
line14 = curseWord_line[random.choice(range(len(curseWord_line)))]

print line1
print line2
print line3
print line4
print line5
print line6
print line7
print line8
print line9
print line10
print line11
print line12
print line13
print line14

constructMovieTrailer();</pre>
<p>

RWET: “How To Ruin Four Articles”

For the second assignment in Reading & Writing Electronic Text, I focused on the recent barrage of articles/comments/discussion/attention regarding the NY Times Magazine Article “How Companies Learn Your Secrets” by Charles Duhigg. It was an interesting article that explored the behaviors of large retail companies, specifically Target, in terms of tracking customer behavior, using that data to better understand their customers, and then turn those results into actionable business/marketing strategies.

 

A journalist from Forbes, Kashmir Hill, blogged about the piece, but used the title “How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did” for her piece. Hill’s title was based off a key aspect of Duhigg’s original piece. Because of it’s more provocative title copy, Hill’s piece started getting more web traffic than Duhigg’s original piece. A day or two later, this led to a third blogger, Nick O’Neil, writing a piece titled, “How Forbes Stole A New York Times Article and Got All the Traffic” . This particular piece started to get some decent web traction, and so, after a few days, a fourth blogger, Jim Romenesko, wrote a piece titled, “NYT Reporter Defends Forbes Writer Accused of Stealing His Work,” in which Romenesko interviewed all parties involved- Duhigg, Hill, and O’Neil, to let everyone say their piece about what happened.  And, of course, this piece got some decent web traffic.

 

So, as a fifth layer of meta-analysis, I took all four of these articles and wrote a python script that mashes them up and creates a combo piece based on the sentences with the longest re-occurring words in all four articles.  More specifically, here is what the script does:

 

  1. creates an independent set for each article’s unique words
  2. uses concurrence to compare the four sets and determine which words occur in all four articles (concurrence value = 4)
  3. takes this new set of words that occur in all four articles and limits it to the longest words  (in this case words with length > 5 letters)
  4. leaving us with a set of six words: “before”, “companies”, “didn’t”, “father”, “started”, and “target”
  5. then, from each article, it grabs the sentences that contains each of the words
  6. randomly picks a sentence for each word in each article, i.e. six sentences per article each with one of the unique words
  7. then randomly shuffles those sentences to create a final 24 sentence output
Hope that makes some sense.

 

The following are 2 versions of the script running:

 

didn’t

before

started

target

companies

father

i ran with that anecdote and the sexy privacy issue duhigg dug up — target’s use of predictive analytics — distilling that from the larger piece for my privacy-interested audience.

the manager didn’t have any idea what the man was talking about.

with all the talk these days about the data grab most companies are engaged in, target’s collection and analysis seem as expected as its customers’ babies.

nothing illustrates it better than this recent target article.

pole didn’t answer my e-mails or phone calls when i visited minneapolis.

the reality is that in the world of newsfeeds and streams, titles matter more than ever before.

on the phone, though, the father was somewhat abashed.

“then we started mixing in all these ads for things we knew pregnant women would never buy, so the baby ads looked random.

nick o’neill started a little kerfuffle over the weekend with his post, “how forbes stole a new york times article and got all the traffic.”

on the phone, though, the father was somewhat abashed.

duhigg shares an anecdote — so good that it sounds made up — that conveys how eerily accurate the targeting is.

kashmir hill, a writer at forbes, realized this and quickly developed a condensed version of the article with a far more powerful title: “how target figured out a teen girl was pregnant before her father did“.

but as a writer who has covered the privacy beat for four years, what leaped out at me as the gold mine of the piece was the anecdote about target data-mining its way into customers’ wombs so effectively that it picked up on a teen’s pregnancy before her father did.

if they could entice those women or their husbands to visit target and buy baby-related products, the company’s cue-routine-reward calculators could kick in and start pushing them to buy groceries, bathing suits, toys and clothing, as well.

andrew pole had just started working as a statistician for target in 2002, when two colleagues from the marketing department stopped by his desk to ask an odd question: “if we wanted to figure out if a customer is pregnant, even if she didn’t want us to know, can you do that?

target, for example, has figured out how to data-mine its way into your womb, to figure out whether you have a baby on the way long before you need to start buying diapers.

it was an extremely long article which discussed how large companies like walmart and target collect data about your individual consumption patters to figure out how to most efficiently make you happy.

but as a writer who has covered the privacy beat for four years, what leaped out at me as the gold mine of the piece was the anecdote about target data-mining its way into customers’ wombs so effectively that it picked up on a teen’s pregnancy before her father did.

it was a great piece but there was one problem: it didn’t have the title it deserved.

or because your kids have started eating?

if companies can identify pregnant shoppers, they can earn millions.

for one, it’s too generic a title in the age of the wall street journal’s ‘what they know’ series, which has explored over and over again how companies grab data about us in ways we wouldn’t expect.

we’ll be sending you coupons for things you want before you even know you want them.”

i suspect i drove a ton of traffic to the new york times that they wouldn’t have otherwise gotten because they hadn’t sold their story quite as well as i did and didn’t create a short version of it that was easy to share and digest online.

 

 

didn’t

before

started

target

companies

father

 

target has a baby-shower registry, and pole started there, observing how shopping habits changed as a woman approached her due date, which women on the registry had willingly disclosed.

o’neill is right about the new york times’ headline –”how companies learn your secrets” — not resonating online.

“then we started mixing in all these ads for things we knew pregnant women would never buy, so the baby ads looked random.

even if you’ve fully stalked the person on facebook and google beforehand, pretend like you know less than you do so as not to creep the person out.

kashmir hill, a writer at forbes, realized this and quickly developed a condensed version of the article with a far more powerful title: “how target figured out a teen girl was pregnant before her father did“.

what’s odd is that the editors clearly knew that target knowing a customer is pregnant is a juicy story as they put it in the lede.

but as a writer who has covered the privacy beat for four years, what leaped out at me as the gold mine of the piece was the anecdote about target data-mining its way into customers’ wombs so effectively that it picked up on a teen’s pregnancy before her father did.

with all the talk these days about the data grab most companies are engaged in, target’s collection and analysis seem as expected as its customers’ babies.

but as a writer who has covered the privacy beat for four years, what leaped out at me as the gold mine of the piece was the anecdote about target data-mining its way into customers’ wombs so effectively that it picked up on a teen’s pregnancy before her father did.

if companies can identify pregnant shoppers, they can earn millions.

andrew pole had just started working as a statistician for target in 2002, when two colleagues from the marketing department stopped by his desk to ask an odd question: “if we wanted to figure out if a customer is pregnant, even if she didn’t want us to know, can you do that?

nick o’neill started a little kerfuffle over the weekend with his post, “how forbes stole a new york times article and got all the traffic.”

the manager didn’t have any idea what the man was talking about.

what graybiel and her colleagues found was that, as the ability to navigate the maze became habitual, there were two spikes in the rats’ brain activity — once at the beginning of the maze, when the rat heard the click right before the barrier slid away, and once at the end, when the rat found the chocolate.

what target discovered fairly quickly is that it creeped people out that the company knew about their pregnancies in advance.

on the phone, though, the father was somewhat abashed.

her house was clean, though not compulsively tidy, and didn’t appear to have any odor problems; there were no pets or smokers.

it was a great piece but there was one problem: it didn’t have the title it deserved.

i suspect i drove a ton of traffic to the new york times that they wouldn’t have otherwise gotten because they hadn’t sold their story quite as well as i did and didn’t create a short version of it that was easy to share and digest online.

it all began friday when the new york times published an article “how companies learn your secrets“.

kashmir hill, a writer at forbes, realized this and quickly developed a condensed version of the article with a far more powerful title: “how target figured out a teen girl was pregnant before her father did“.

on the phone, though, the father was somewhat abashed.

charles duhigg’s piece is a masterful look at how target gathers information about its customers and mines it to keep them loyal and better market to them.

it was clear to target’s computers that i was on a business trip.

 

 

 

And here is the python code.  It’s definitely not optimized, but I think it’s working so right so I guess that’s a good thing.

 

import sys
import random

#set up each soure as an argument to be passed in on the command line
source1 = sys.argv[1]
source2 = sys.argv[2]
source3 = sys.argv[3]
source4 = sys.argv[4]

# list for all the lines in each article
source1_lines = list()
source2_lines = list()
source3_lines = list()
source4_lines = list()

#create lists of words for each article
for line in open(source1):
	line = line.strip()
	line = line.lower()

	line = line.replace(". ", ". \n")
	line = line.replace("! ", "! \n")
	line = line.replace("? ", "? \n")

	sentences = line.split("\n")
	for sent in sentences:
		source1_lines.append(sent)

for line in open(source2):
	line = line.strip()
	line = line.lower()

	line = line.replace(". ", ". \n")
	line = line.replace("! ", "! \n")
	line = line.replace("? ", "? \n")

	sentences = line.split("\n")
	for sent in sentences:
		source2_lines.append(sent)

for line in open(source3):
	line = line.strip()
	line = line.lower()

	line = line.replace(". ", ". \n")
	line = line.replace("! ", "! \n")
	line = line.replace("? ", "? \n")

	sentences = line.split("\n")
	for sent in sentences:
		source3_lines.append(sent)

for line in open(source4):
	line = line.strip()
	line = line.lower()
	line = line.replace(". ", ". \n")
	line = line.replace("! ", "! \n")
	line = line.replace("? ", "? \n")

	sentences = line.split("\n")
	for sent in sentences:
		source4_lines.append(sent)

#declare 4 sets for unique words
words1 = set()
words2 = set()
words3 = set()
words4 = set()

#create fifth set for the final set of unique words
wordsFilter = set()

#create a function to create the 4 unique word sets
def unique_words(source):
	words = set()
	for line in source:
	 	line_words = line.split()
	 	for word in line_words:
			words.add(word)
	return words	

words1 = unique_words(source1_lines)
words2 = unique_words(source2_lines)
words3 = unique_words(source3_lines)
words4 = unique_words(source4_lines)

combo_list = list()
#create a function to combine the sets
def setCombiner(wordSet):
	for word in wordSet:
		combo_list.append(word)

#call the setCombiner function 4 times in a row to add in the sets
setCombiner(words1)
#print len(combo_list)
setCombiner(words2)
#print len(combo_list)
setCombiner(words3)
#print len(combo_list)
setCombiner(words4)
#print len(combo_list)

wordDict = dict()

for word in combo_list:
	if word in wordDict:
		wordDict[word] += 1
	else:
		wordDict[word] = 1

for word in wordDict.keys():
	if wordDict[word] == 4:
		wordsFilter.add(word)

outputList = list()

def lineGrabber(source, word):
	finalList = list()
	for line in source:
		offset = line.find(word)
		if offset != -1:
			finalList.append(line)
	random.shuffle(finalList)
	outputList.append(finalList[0])
	#print finalList[0]

for word in wordsFilter:
	if len(word) >= 6:
		print word
		print ("--")

		lineGrabber(source1_lines, word)
 		lineGrabber(source2_lines, word)
 		lineGrabber(source3_lines, word)
 		lineGrabber(source4_lines, word)

random.shuffle(outputList)
for line in outputList:
	print line

 

RWET: “The” “Goddamn” “Jabberwocky”

For the first assignment in “Reading Writing Electronic Text,” I played with the poem “Jabberwocky” by Lewis Carroll.  In 6th grade I memorized it for an English class poetry reading.  I remember it not making a ton of sense to me back in the day, but I really loved it, so I thought this would be an appropriate piece for a first crack at python. I wondered how the poem would hold up if all the lines were randomized.  Using the “randomize_lines” python script Adam gave us, here was the initial output:

 

 

And the mome raths outgrabe.
‘Beware the Jabberwock, my son!
‘Twas brillig, and the slithy toves
The Jabberwock, with eyes of flame,
The vorpal blade went snicker-snack!
All mimsy were the borogoves,

Did gyre and gimble in the wabe;
He left it dead, and with its head

The jaws that bite, the claws that catch!
O frabjous day! Callooh! Callay!’

‘Twas brillig, and the slithy toves

He went galumphing back.
All mimsy were the borogoves,
He chortled in his joy.

One, two! One, two! And through and through
And the mome raths outgrabe.
Come to my arms, my beamish boy!
He took his vorpal sword in hand:
Came whiffling through the tulgey wood,
The frumious Bandersnatch!’
And as in uffish thought he stood,
So rested he by the Tumtum tree,
‘And hast thou slain the Jabberwock?
And stood awhile in thought.
Beware the Jubjub bird, and shun
And burbled as it came!

Long time the manxome foe he sought–
Did gyre and gimble in the wabe;

 

 

Next, I wanted to see how the poem would hold up if I only kept lines that contained the word “the” (or “The”) in it as well as being randomized.  Would it still be fun and make some sense?  I integrated Adam’s “forfinder” script (replacing “for” with “the”) with the first script and here was the first output:

 

 

Beware the Jubjub bird, and shun
So rested he by the Tumtum tree,
And the mome raths outgrabe.
‘Twas brillig, and the slithy toves
The jaws that bite, the claws that catch!
‘And hast thou slain the Jabberwock?
And the mome raths outgrabe.
Long time the manxome foe he sought–
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
Did gyre and gimble in the wabe;
Came whiffling through the tulgey wood,
All mimsy were the borogoves,
‘Twas brillig, and the slithy toves
‘Beware the Jabberwock, my son!

 

 

Then, I thought it’d be nice to add a little Samuel L. Jackson into the mix.  So, I added code to determine which remaining lines had exclamations in them, specifically “!” and “?”.  For these lines, I inserted a “goddamn” after each “the.”  And here was the final output:

 

 

‘Beware the goddamn Jabberwock, my son!
All mimsy were the borogoves,
Long time the manxome foe he sought–
The goddamn jaws that bite, the goddamn claws that catch!
Came whiffling through the tulgey wood,
‘Twas brillig, and the slithy toves
And the mome raths outgrabe.
‘Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
So rested he by the Tumtum tree,
And the mome raths outgrabe.
Beware the Jubjub bird, and shun
All mimsy were the borogoves,
‘And hast thou slain the goddamn Jabberwock?
Did gyre and gimble in the wabe;

 

 

Here is the final python code:

 

 

import sys
import random

new_lines = list()

for line in sys.stdin:

	line = line.strip()

	lower_the = line.find(" the ")
	upper_The = line.find(" The ")

	if lower_the != -1 or upper_The != -1:
		if line.find("!") != -1 or line.find("?") != -1:
			line = line.replace(" the ", " the goddamn ")
			line = line.replace("The ", "The goddamn ")
		new_lines.append(line)

random.shuffle(new_lines)

for line in new_lines:
	print line

 

Dynamic Web: Spring 2012 Project Ideas

Here are a handful of projects I’m interested in working on this semester.  I’m hoping to use the Dynamic Web class as a means to execute some, if not all, of these projects.

 

1) Thesis – Create a Game for Early Stage Alzheimer’s Patients

 

At the Alz Assoc. here in NYC, support groups get together a couple times a week to talk about their condition and play “mental exercise” games.  I’m hoping to adapt one of the games they play from a “pen and paper” version to a computer/web-based version that they could use during their sessions.

 

A possible example of a game would be to match news headlines with pictures of people relevant to the news story. So, I would pull in current News Headlines, probably from the NY Times API, and current images, possibly from the Google Images API, store them in my own database, and then use them to play the game.  I’m currently researching the types of games they play and what would be the best interaction for them.

 

2) Patience – An Assistive Tech App for Physical Therapy

 

This is a project I worked on in the Assistive Tech class last Spring 2011 with John Schimmel.  The current version allows a user to capture a phones’s real-time accelerometer values through a web site and, also in real-time, print those sensor values on a graph in a separate browser.  I’d like to continue working on this app, specifically allowing values to be stored to a database, and adding more functionality to the app.  It would be great to then take the app across the street to the Physical Therapy dept and see how well it performs under real-world circumstances.

 

3) The Twitter Scorecard – Evaluating Tweets from the SuperBowl

 

Last semester, I worked on a project that pulled in tweets from a couple NFL games. I’d like to pull in tweets from the Super Bowl coming up next Sunday (through the Twitter Streaming API), store those tweets, and then create a site that would let users explore the content in a fun and engaging way.

 

4) Burger Breakdown

 

The concept here is to create a web site that would allow users to “visually” build their favorite kind of hamburger.  As the burger assembles, possible options to print to the screen would be (1) a calorie counter indicating how many calories are in a user’s custom burger, (2) an exercise regimen that would be necessary to work off the burger’s calories, and/or (3) the best/closest place to actually get this type of burger.

 

I’d also consider letting people submit their favorite burger spots and maybe even add in some sort of voting/ranking.  It could also be fun to tap into the user’s geo-location to determine where are the best burger spots closest to the user.  Another  possibility would be to access the Foursquare API as a source of data regarding local hamburger restaurants and number of check-ins.

 

5) Run Keeper / Nike Plus

 

I recently started using the RunKeeper App and the Nike+ Sensor App to keep track of my runs.  I’d love to tap into RunKeeper’s Health Graph API and explore interesting ways to work with and visualize that data.  I’d also love to get my hands on a FitBit, a Jawbone (although they’ve been recalled), and one of the new Nike FuelBands.  It could be interesting to compare the data from each of them, possibly indicating which device is the most/least accurate.

Data Rep: Final Project – The Twitter Scorecard

 

 

 

For my final project in Jer Thorp’s Data Representation class, I explored the nature of tweets in relation to live sports. I worked on a concept for an app this past summer (summer of 2011) called “Tailgate” (formerly “Huddl”) that would provide users with curated twitter feeds for live sporting events on tv. During my research for this project, I found that following the sports “action” on twitter was not only entertaining, but extremely insightful. If I wanted to know what was going on in a particular game, reading recent tweets was a much faster and easiesr way to get a sense of the game than tuning in on TV. Also, reading tweets after a game had the potential to tell a much richer, nuanced story than other traditional forms of media, specifically boxscores and highlight reels.

 

For my project, I sought to execute a proof of concept, through data visualization, that twitter does provide an accurate and rich depiction of the “story” of a live sporting event. Using a Ruby sketch to access the Twitter streaming API, I pulled in tweets from two NFL football games. I stored the tweets in a MongoLab database and then exported the data in CSV format. I then brought the CSV into Processing to create the visualizations.

 

The first game from which I pulled in tweets was the Dallas Cowboys vs. Miami Dolphins, played on Thanksgiving Day, Nov.24, 2011. For this game, I queried the twitter streaming API for 12 game-specific hashtags. Over the course of roughly 3 hours, I pulled in close to 35,000 tweets.

 

The twitter streaming API sends back an extensive JSON object for each tweet. Per Jer’s advice, I started out by focusing specifically on the tweet message and the time it occurred. Using the Simple Date Object in Processing to determine each tweet’s exact timestamp, I plotted the frequency of all the collected tweets along the timeline of the game. The x-axis represented the time, progressing from left to right, with points plotted for every tweet and y-values set to a random spectrum. Here is the first visualization:

 

 

As you can see in the image above, there were areas of significant density representing higher tweet volume. (note: I lost the feed during halftime of the game which explains the gap of tweets in the center of the sketch). Common sense would assume these dense areas were moments of significance during the game, probably times when points were scored. So I then plotted the important scoring events of the game along the timeline below the tweets.

 

 

 

 

Touchdowns clearly sparked the highest volume of tweets along with the ending of the game, which was a game-winning field goal by the Cowboys with no time remaining.

 

To see what was actually being said, I then parsed out words from individual tweets to confirm what was being discussed and when it was happening. The images below show the occurrence of different words throughout the game.

 

 

Tweets with the word “Touchdown” or “TD”

 

 

 

 

Tweets with the word “Field Goal” or “FG”

 

 

 

Tweets with the word “Fumble” (this one is particularly interesting in that you could assume the fumble resulted in an opponent touchdown shortly after)

 

 

 

Tweets with the word “Interception” or “int”

 

 

 

Going beyond basic football terms, I started looking at the occurrence of words that had more “emotional significance.” Here’s a sketch comparing words of frustration (damn, shit, crap, fuck) with words of elation (amazing, nuts, crazy), and words of laughter (Ha, LOL).

 

 

 

 

For the most part, these occurrences appeared to line up with the significant moments of the game.

 

I then repeated this entire process for the New York Giants vs. Green Bay Packers game, which was played on Sunday Dec. 4th. For this game, I searched for 10 game-specific key words instead of hashtags. Through 3 quarters of the game, I pulled in over 150,000 tweets. (I unfortunately lost my feed during the beginning of the 4th quarter of the game.)

 

For my final class presentation, I developed a sketch that was primarily an exploratory tool. It allows a user to search for any word that occurred during either game, plot when / how often it occurred, and plot the relative frequency of that particular word throughout the game. The relative frequency helped distinguish when a particular word spiked in occurrence regardless of its total volume.

 

I also added in rollover functionality, where the actual tweet message would be displayed if the mouse rolls over a particular tweet in the lower graph. The visible message in the middle of some of the sketch images is the result of  the mouse rollong over the densest area and picking a tweet to display. Here are some images of the final interactive sketch. I am working to get the sketch active online, since the real charm of the piece is being able to search for any and all words to see where and when they might have occurred, but still images will have to do for now.

 

And special thanks to Jeremy Scott Diamond, Martin Bravo, Rune Madsen, and Greg Borenstein for their help on this project.

 

 

Game 01 – Overall

 

 

 

Game 01 – “Cheerleader” (a player on the Cowboys knocked over a cheerleader at the beginning of the 4th quarter, the television broadcast replayed the moment several times)

 

 

 

Game 02 – Overall

 

 

 

Game 02 – “Sack”

 

 

 

Game 02 – “Fumble”

 

 

 

 

 

 

Design for UNICEF: Lifecycle

 

 

 

In my “Design For UNICEF” class, I worked with Alvin Chang, Emily Webster, Jamie Lin, and Lia Martinez on a project called Lifecycle. This semester, the brief for the class was to come up with solutions that could be implemented in a specific region – Northern Uganda. The thought was that if something could work in Northern Uganda, it could work anywhere.

 

The class was split into four groups, each free to choose their own topic and focus. Our group was interested in designing something that could improve the rural healthcare system there. A problem we found in Northern Uganda is that health care clinics are sparsely located throughout this region, often miles away from villages, forcing people to walk (or bike) for hours to get there. Ambulances are practically non-existent, so bikes are the most common form of transport, but the roads are often so bumpy or uneven that riding can be a challenge in itself. For someone who is sick, elderly, or pregnant, this distance can often be too far to bear.

 

Given the lack of medical transport devices (i.e. wheelchairs, stretchers, ambulances), we wondered if it’s possible to design a system that would allow people to easily connect two bikes in parallel, forming a stable 4-wheel unit, that could be used to transport patients (or goods during non-emergency situations). Bikes are not readily available in this region for repurposing, but they do exist, and a simple low-cost kit that would allow two bikes to temporarily connect and then disconnect so that the bikes could maintain their original use, could be a viable solution.

 

What we came up with is “Lifecycle.” All of our documentation and work can be found here at Lifecyclebike.org.

 

 

 

 

Essentially, there are 3 main points of connection – first using the seat holes to form an initial stable connection, second connecting the handlebars, and third connecting lower bar in the main frame. We purchased pipes/parts at Home Depot and used recycled tire tubes to form the seat netting. As a whole, the entire kit cost a little under $30. This is definitely an early prototype and we recognize these materials are not readily available in Uganda. In addition, we’d also love to maintain the ability to ride the bike. As it stands now, the bikes can only be pushed or pulled.

 

There is much room for improvement and we encourage people to offer feedback and/or build off our research to continue this line of work.  Here are couple documentation videos that show the assembly of the bikes as well as road test out in NY.

 

 

 

 

DIY Health: “The Airing” Final Video

Here is the final video project for my DIY Health class.

 

 

Music By: Junesex “The Road Seems Endless But At Least You’ve Found A Path”

Page 1 of 212