Statistical grounding for qualitatitive story sets

In my previous post demonstrating how algorithms can transform some kinds of qualitative data into quantitative patterns, I mentioned that a more statistically-rigorous filter was possible. Here is an example.

First Principle: Robust data should yield the same conclusion across many regions, over time, and from many perspectives to be considered “quantitative.”

Here is an example from 749 stories containing the word ‘rape’, compared with the larger story set of nearly 50,000 stories from the Globalgiving Storytelling Project:

The above map shows words from these stories. Larger words are more common. Red circles are from the point of view of a person who was part of the story he/she told. Blue circles are reports from observers. Purple words overlap with both perspectives. Words from stories that come from only a few unique  individuals are automatically excluded. So this is a pretty good map of what maters.

But if we also look at the consistency of story patterns over time, we should be able to see which patterns are strongest.

What parts of these stories are more common in this set of stories, compared with the tens of thousands of other stories from the region?

If you look at all the words in these ‘rape’ stories, and count up the frequencies of each word used, and then do the same for all 50,000 stories – plotting the ratio will yield points for each word. Ratios that are greater than 1 mean the word is more common in these stories; ratios less than 1 mean the words are less common in these stories than in the whole set of stories. Below I plotted the LOG of this ratio because it is easier to read. (Remember, LOG of 1 = 0, so ZERO becomes the dividing line between common and uncommon words instead of 1):

I have manually gone in and labeled the words that correspond to nearby points on the chart. On the top you’ll see lots of highly relevant words. On the bottom, words that common more generally in all stories from everywhere. Reading from left to right,  the words on the left appear most often in the 50,000 stories. Words on the right are least commonly used. Reading from top to bottom (the adjusted prevalence for each word within the ‘rape’ story set), words on the top are more common to ‘rape’ stories than they are in the whole set of 50,000 stories.

The error bars are the standard error of the mean (SEM) for each word’s prevalence over time. So I have actually taken all 749 stories and broken them into 3-month chunks, then averaged each word over these chunks, then calculated standard error. The result can show whether a particular is not just more common in ‘rape’ stories than in general, but how consistently it appears over time.

The same approach can be repeated for locations across Kenya and for across all the scribes who help collect story. So after adjusting a map for at least one of these characteristics (time), the new map looks like this:

The colored map (above) and the X-Y plot of log-ratios are showing the same data. But I prefer to use the maps because they also reveal something about the interrelationship of the words. Green words are more common in ‘rape’ stories; Red words are more common in stories that do not mention ‘rape’. Grey stories and nodes are not particularly more or less common. The rules for coloring nodes also consider the error bars around each word. Something like MREMBO scored very high on the log-ratio plot and is highly-associated with ‘rape’ stories, but the error bars were big. Mrembo stories were only collected in one calendar month, so they fail on the ‘diverse sources’ rule and so it does not appear in the final map.

As expected, mrembo, boyfriend, gang, and defilement all have big error bars, meaning that they are not consistently present in stories from each time interval, and thus do not appear in the red-green wordtree map.

I think this is a pretty good indication that we can automate the filtering process for any set of qualitative data, provided that there is a large reference set of similar stories. The reason I’ve never seen qualitative analysis software do this is that they assume there is no valid reference story set. For a set to be valid it needs to be at least twice as big as the set of stories one is interested in analyzing. That is why we enforce a 2-story-rule: Each person contributes two stories, and if one is deemed relevant to some analysis, the second story becomes part of the first stories relevance filter. If I used some totally different reference set, like the New York Times or, the map would probably be far less precise at filtering.

I think that reading this map, I was surprised the HIV/AIDS was marked red- indicating it appears at a lower rate than in general stories. And police, jail, arrested appear in green – indicating these are more common in ‘rape’ stories. I am already thinking about ways to use these top green words as seeds for a building a better set of stories that are about ‘rape’ but do not necessarily include the word ‘rape’ anywhere. If a story, for example, included justice, boys, boyfriend, and unwanted, perhaps that story is really close to being about a rape-like situation. Clearly the word rapist should have matched, but there are natural language word-stemming tools that can catch rapist from a search about rape already. There are no good tools that can pool together 4 unrelated words into an equivalence-dictionary. Building “equivalent word lookup dictionaries” on the fly (algorithms do it) that actually work would vastly improve the power qualitative data for accurately informing decision makers and social science researchers, and I hope that I can convince the Santa Fe Institute to support this work in 2013. Imagine a web page where you upload text narratives or connect an RSS feed of narratives, and it structures your data and gives you back a  representation of those narratives with only the consistent patterns. It would free people up to think about collecting stories again, rather than designing surveys – which are long, boring, expensive, and don’t always capture what matters. Yes, they have their use, but I am frustrated that in 2012 we are drowning in an ocean of text and there don’t seem to be any good algorithms for getting more quantitative mileage out of these opinions, journals, and conversations.

Minimal requirements for the qual-to-quant transformation

If you just know these four things, you can build a pretty powerful filter for relevance and consistency of words or phrases from a set of about 200 (minimum) stories:

  • Who collected
  • When it was collected
  • Where (to be added to the algorithm in the future)
  • What else these people talk about and therefore care about (2-story rule)

Lastly, you should always aim to collect a lot of short narratives from lots of people, places, and times instead of a few long narratives from just a few. You can’t build a mosaic with just a few tiles.

You can use lots of stories to build quantitatively reproducible patterns. The “reproducible” part of this claim comes from sampling all the pieces of the data and see how consistent the patterns in each piece are, then showing the consistent parts in shades of green and red.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s