Using big data to infer how people would’ve answered

I recently wrote an algorithm that would use the answers from 57,000 stories to predict what three topics people might choose for a story with similar words in it.

How does it work?

People tell a lot of stories, and the words they use are correlated with the topics they choose. So if the correlation is strong enough, a computer algorithm can correctly “guess” the topic the person would have chosen. The guess is based on (1) generating a dictionary of words and their frequency of use in stories a human has assigned to one of ten topics then (2) scoring a test story by adding up the relevance of each word in that story to the topic, based on that topic dictionary.

The rigorous way to do this is set aside 10-20% of the data to test the algorithm and use the rest to “train” it, then run the algorithm on the test set to estimate how likely it will be to choose the correct topic from among these 10 choices:

topic question from story form

I was surprised to see that the reliability of this approach depends on which topic you mean:

Fetched 19343 records, 1 fields, with 8010659 characters. Conn: Closed food
Fetched 15743 records, 1 fields, with 6517898 characters. Conn: Closed sec
Fetched 22009 records, 1 fields, with 9192587 characters. Conn: Closed fam
Fetched 19246 records, 1 fields, with 8186335 characters. Conn: Closed fre
Fetched 24335 records, 1 fields, with 10342187 characters. Conn: Closed phy
Fetched 30365 records, 1 fields, with 12079326 characters. Conn: Closed know
Fetched 16717 records, 1 fields, with 6985293 characters. Conn: Closed self
Fetched 8678 records, 1 fields, with 3274556 characters. Conn: Closed resp
Fetched 14378 records, 1 fields, with 5838550 characters. Conn: Closed cre
Fetched 5633 records, 1 fields, with 2050559 characters. Conn: Closed fun

Accuracy rates (percent match between the algorithm and what people choose)
{'kno': 95.7, 
'fre': 6.2, 
'res': 67.5, 
'cre': 16.8, 
'phy': 85.5, 
'sec': 2.8, 
'fam': 47.2,
'fun': 67.2, 
'slf': 0.4, 
'foo': 6.1}

That means that I can accurately predict stories about “knowledge” 96% of the time, but only 2.8% correct for “security” stories. Correlation with number of stories tagged with a topic is low. Fun is a seldom used topic, but matches with 67% accuracy;  self-esteem is 0nly 0.4% accurate, but tagged in 3X the number of stories that fun was.

Next I thought, “maybe the most common words in each reference dictionary are too similar among all 10 topics.” I noticed the top words are similar in many of the 10 topics. Words like ‘school’, ‘organization’, and ‘community’ are present in all stories, and so offer no differentiating ability. I should remove them.

creativity [(‘organization’, 5200.40795559667), (‘school’, 3543.777062566668), (‘community’,
3248.152150989258), (‘child’, 2862.176422375521), (‘day’, 1558.2406172604306), (‘helped’,
1528.985994397759), (‘village’, 1518.7758112094393), (‘area’, 1459.6429306441655), (‘organisation’,
1431.8204927035933), (‘aid’, 1339.7938144329896),…]

security [(‘organisation’, 1797.8478738427743), (‘helped’, 839.5979011322839), (‘hiv’,
758.4263051629651), (‘pupil’, 757.4545341769011), (‘school’, 749.9667855960569), (‘month’,
731.5803097814555), (‘provides’, 578.9633375474084), (“i’am”, 544.4925373134329), (‘child’,
522.5630079912575), (‘business’, 519.3864168618267), (‘standard’, 480.0096525096525), (‘money’,
464.8109119558795), (‘aid’, 460.0247422680413), (‘just’, 455.86785009861933), (‘happy’,
422.7314842729374), (‘mzesa’, 405.6), (‘thanks’, 402.25015556938394), (‘gulu’, 395.3125763125763), …]

knowledge [(‘child’, 36082.86353391162), (‘school’, 33818.189588161),
(‘community’, 32907.04868545692), (‘helped’, 32814.49078786444),
(‘organisation’, 32659.84693237094), (‘group’, 18383.11439114391), (‘life’, 16962.78238448316), (‘woman’, 14369.672232361278), (‘money’, 14049.368721686034),
(‘good’,13293.343451864701), (‘youth’, 13202.397977609246), (‘food’, 12707.99451382372), (‘living’,
12395.504079003864), (‘poor’, 12331.987821235045), (‘parent’, 11596.92557475659),
(‘education’, 11534.22393346681), (‘aid’, 11186.01649484536), …]

When you exclude all words that are in the 60th percentile of frequency or above, you get the opposite pattern for accuracy:

{'kno': 0.8,
'fre': 51.8,
'res': 2.6,
'cre': 24.4,
'phy': 0.8,
'sec': 79.8,
'fam': 2.9,
'fun': 3.3,
'slf': 97.3,
'foo': 59.1}

pythonWell that won’t do either. So I decided I needed to get serious. Oddly in python, that means writing a whopping five more lines of code instead of just the usual single line of code to do something amazing like “take all the words in all dictionaries and drop the words that are present at the 60th percentile or greater.”

Python code typically looks like this:

    def inall(key,topic_dicts):
        # returns True/False if a key is present in all the dicts of topic_dicts
        in_all = 0
        for k,v in topic_dicts.items():
            if key in v:
                in_all += 1
        if len(topic_dicts) == in_all: #if every dictionary has the word, these will match.
            return True
            return False
        alt_topic_dicts[k] = {x:y for x,y in v.items() if inall(x,topic_dicts) == False}

On my third try, I decided to exclude any words that are present in all 10 topics from each of the 10 respective topic (word:frequency) dictionaries. It took 75 seconds to rerun all the analysis, and the accuracy was much better:

{'kno': 86.7,
'fre': 83.8,
'res': 73.1,
'cre': 57.3,
'phy': 64.8,
'sec': 58.9,
'fam': 62.4,
'fun': 43.2,
'slf': 61.1,
'foo': 60.8}

So with the exception of stories with the topic “fun,” I can use this simple algorithm to predict the topic of a story (from a list of ten topics representing the hierarchy of human needs) correctly over 50% of the time.  The probability of randomly picking the right topic would be one in ten — 10% success — so I’m quite happy with this result.

But is 65% accuracy (on average) “good”?

In 2009 we ran this experiment with humans. This what what storytellers chose:

What people talked about in stories from Kenya

And this is what human “experts” predicted:


When we asked 65 aid experts to pick the top 6 out of 12 topics in that survey question, and rank-order them, only one of out 65 got #1 correct! And later, he admitted in email that he just guessed. Overall, people performed worse than chance (8%) at this task, because they were biased by what they thought the main topics would be for everyone.

So in that context, this algorithm does surprisingly well, and much better than humans for this specific task.

By another measure, in the sense of Shannon Information Theory, it provides 3X to 6X more information than we would have about the story had we not included this new “meta data.” The exact number is tricky to calculate (at 3am) because storytellers were asked to choose 3 of 10 topics on the form and if the algorithm’s #1 choice is in the top 3, then I count that as a hit. A rigorous result would only count cases where all three topics matched the human’s choice as correct. That’s a bit more involved that what I care about. This does bring up an interesting point about surveys. Most questions only allow for one right answer on forms, and we required 3 of 10 answers. It makes it easier for the algorithm to “learn” how to be mostly right because each story has multiple topics that overlap. Good to think about doing this on more surveys in the future Big Data Era.

The Big Idea Behind Big Data

This topic prediction approach works because of some very simple math and a huge, rather complete amount of empirical data (57,000 stories about the types of things people talk about when they describe community efforts in East Africa). International Development suffers from having the smallest and most disconnected data systems on Earth. This is a rather large training data set, where poverty is concerned. But once you have this, you can do a lot more with it – such as categorize future narratives along a hierarchy of needs with about 65% accuracy – without having to collect more data and waste more peoples’ time.

Learning can happen faster.

People can take action quicker.

It’s not a replacement for listening, but it can aid our understanding.

And importantly, this approach can work with other questions that we included in our survey.

Read more: The future of big data is quasi-unstructured

Which was quoted in this wired blog: The growing importance of natural language processing

This is the kind of thing described in the book, “The Secret Life of Pronouns.”


Predicting GlobalGiving Project Report Topics

I extended this test by applying the ten topic dictionaries to a totally new set of narratives: 24,392 project reports on GlobalGiving from 2006-2013. All of these are about real project work, though the words people use are different. According to these topic dictionaries, the breakdown of topics among the GlobalGiving project reports is as follows:

Sum of top three assign topics:

{'knowledge': 24045,
'freedom': 33,
'respect': 19467,
'creativity': 181,
'physical needs': 18675,
'security': 24,
'family': 1035,
'fun': 9689,
'self-esteem': 10,
'food & shelter': 17}

Clearly, this method does not assign topics to updates in the same proportion that people assigned these topics to their stories. This could be because the narrative words are quite different for the subjects that are underrepresented. These scores are both a measure of how similar the language (words) are between reports and stories on a topic, as well as a measure of how many report contain these topics.

Organizations probably use very different language to describe security, freedom, self-esteem, and food-shelter projects on GlobalGiving from the way people talk about them in stories.

Knowledge (education) and physical needs are described similarly in both places.

Respect is overrepresented in project-speak. There is no corresponding project theme on GlobalGiving, although “women” and “children” projects are the largest category on the site.

Food & shelter is described in terms of disaster relief on GlobalGiving, but appears more in the context of poverty in stories.

Freedom in stories maps to human rights and democracy projects on GlobalGiving.

Coherence between story role and predicted story point of view based on pronoun use

In general, people use “I” and “me” in stories where they were affected or played an active part. And they use less personal pronouns in observer stories:

Fetched 39714 records, 1 fields, with 15281132 characters.
'Saw it happen','Heard about it happening'
third plural 39.1%
first plural 20.4%
third singular 17.5%
fourth 17.1%
first singular 5.9%
Fetched 13346 records, 1 fields, with 6136291 characters.
'Was affected by what happened'
first singular 29.3%
first plural 24.1%
third plural 22.9%
third singular 12.5%
fourth 11.2%
Fetched 7756 records, 1 fields, with 3468508 characters.
'Helped make it happen'
third plural 26.4%
first plural 23.7%
first singular 18.8%
third singular 17.3%
fourth 13.8%

“Fourth POV” is my short hand for when stories contain more organization words than pronouns. They are impersonal and lacking in details. More like press releases. But luckily, not too common overall.

This analysis continues elsewhere: It turns out, teller a story from a different point of view can make a project report more compelling, leading to more donations.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s