The “interesting story” filter

I’ve stumbled onto a trick that isolates the most interesting stories from among the 20,000+ stories that are in the GlobalGiving Storytelling collection. It uses an indirect, “natural language processing” approach that compares the words in each story to the words in every other story within a set. Every story has a defined similarity with every other story:

Story similarity formula =

Each story has a degree of overlap with every other story in a set. I then average all of these scores and assign it to the original story as an overall measure of how “typical” that story is among the set.

Here’s what these average similarity scores look like when plotted for a set of the first 200 stories collected in the GlobalGiving Storytelling Project:

If that s-shaped curve pattern seems familiar to you, it should. Most people plot this kind of data as a frequency histogram (how many stories fall into each “bin” of similarity scores?). This reveals that these averaged “similarity scores” are normally distributed:

Above is a plot of story count (Y-axis) versus story similarity score (X-axis). These values are normalized to a scale of 0 to 100 using a standard normalization forumla:

I was very surprised that given any set of stories, the similarity of each story in the set to all other stories in the set follows a normal distribution, meaning that in the most likely case (middle of the bell curve), a story will have a predictable amount of similarity to other stories. Here is the histogram for the first 2000 stories:

And for all Kenya stories (batch of 7584):

And 1857 stories that mention water:

And 3873 stories that mention children:

And 1252 stories that mention school fees or scholarships:

Interpreting distributions

If you look closely you’ll notice that while all of these are normally distributed, some are skewed towards lower scores. This makes sense – as the “set of all stories from Kenya” should have greater diversity than the “set of all stories about school fees.” This means that story sets with skewed distributions will, on average, be a more diverse set of stories (sharing fewer words among them) than another story set with less-right-hand-skew.

Finding the most “typical” and most “interesting” stories

At both ends of the bell curve, you’ll find two very useful types of stories. Those with very high similarity scores are “most typical” of the set as a whole, and might be useful in quickly getting a read on what all of these stories might contain. Here are the top ten “most typical” stories from the 1252 about school fees or scholarships:

  • Orphaned Children – In a certain area there were many orphans and there was no one to help them with school fees and other needs. Kibera C.D.F company
  • School – I was in a broad need of school feed, but unfortunately one teacher volunteered and helped me to pay the school fees. Sitatunga Individual
  • School dropout   –   World Vision helped me and took me bock to school because at first i was not schooling because of school fees and world vision desided to pay for me school fees. and am happy now. Famine World Vision World Vision
  • Fees – I thank God for providing me with school fees which I was sent by donors from various institutions. Sitatunga Individual
  • EDUCATION – Budukiro project has helped my children to go back to school through paying for them school fees,giving them books,uniform and now our children are at school. And the problem of school fees is solvent. Katwe Budukiro project
  • Education support – UWESO has been their to help orphans by giving them materials and school fees. It has given support to orphans hence improving their lives mentally. Kijabwemi UWESO Uweso
  • School support – World Vision gives children things like books pens to use at school also they pay school fees to some of the children hence fullfilling one of the children right that is the right to education. Kakunyu World Vision World Vision
  • School  – We had  been helped by the organization to pay our school fees.This comes across through sports.The support has quite helped parents and their children. singereti Individual
  • Education facilities –        Hope alive is an that helps children in education facilities like scholastic scholastic materials and also school fees for the children in our community hence improving education and reducing illiteracy in the community. kijabweni Hope alive
  • School fees support  –     Pator walugembe has improved on the education of the children in our community. He pay school fees for 25 pupils in our community this has prevented children who go on streets from our community. kiseka Individual Individual

You’ll immediately notice that all of these stories are indeed about what we were expecting to find. They are also very short (a side-effect of the imperfect algorithm I used to assign similarity scores).

And now… the ten “least typical” stories from this set of 1252 stories about school fees or scholarships:

  • Scholarship in sports – Many of our young youths in Kianda have benefitted from football sports, which has been sponsored by the white people in Isack sports ground. This has helped the talented youths to utilize their time properly.  Due to this our community has started growing up dums in the village coz their ris an organization their us came to support youth those dont have jobs and now theirs secutrity is seffety. KIBERA makina Organization Unnamed Organization
  • Provision of scholarships –      The need to improve the level of education have spacked several people to acquire sponsorships from several difference places within and outside the  contry.    However there are other organisations tha are currently offering scholarships to deserving citizens within the country.   In siaya district, the Jamo Kenyatta Foundation was fetching needy students to award them scholarships for futhering their studies.    this is a great boost and has promoted competition not only among schools but also among students who want to be the best and with the  award.There should be more of the same kind.  Jomo Kenyatta Foundation Jomo Kenyatta Foundation
  • EFFECTIVE SUPPORT – It is a good thing that organizations live up to their own meaning just like Access has done.For children from the less fortunate backgrounds to access their dreams and hope for their future it was just so needful that ACCESS comes in handy for them to access their hope.  Effectively ACCESS had accessed funding to needy students through scholarships that have not brought a regret to them in return. They have enhanced determination among students who work really hard when they qualify for their awards. municipality ACCES
  • NAKATO THE TWIN – Nakato the twin is a radio presenter who fights for the needy people and those who have family problems. She collects funds from the listeners after a story on air concerning a person. She has managed to secure scholarships and bursaries from within and outside the country fro the orphaned and disadvantaged children. Through her efforts, over 800 children in Uganda are staying in different schools. NTINDA INDIVIDUAL Individual
  • DAZZLING ART – Kaafiri Kariuki,a talented man has used his talent to inform the society about a lot of things.He has encouraged most of the people throug his dancing pen and dazzling art.Many people,especially tourist who are making their moves and visits in Kenya.He has also taught young men how to draw and paint and indeed the people have made a good move and able to economically support themselves.Many of the people have gotten scholarship overseas  and earn a good living. Kawangware Kaafiri Kariuki Kaafiri Kariuki
  • PROMOTING TALENTS VIA EDUCATION – Kampala International University in a a bid to promote education, it offerd two scholarships for every district in Uganda.Male and female students are selected by LC5chairman of every distric. It also offers scholarships to students who have a back ground record of being excellent in sports. It also enables those with 1st class and 2rd upper degrees to up grade their education   MASAKA TOWN KAMPALA INTERNATIONAL UNIVERSITY
  • Empowering Uganda People  – Kulika Chair table trust is a united kingdom based organisation operating in Uganda. It has done great role in providing education scholarships and grants to Ugandan students and also helps on the ground sustainable agriculture trainings to farmers in Uganda in order to improve their out put per acre or piece of land Namarwe Kulika Chairtable trust
  • FIGTING ILLETRACY – Samona Products LTD This organisation owned my Salongo Samona which is popular and famous has done much in the masses of the whole Uganda by supporting very many shows as it supported the Enkuka of CBS Fm in 2010,it helps much in fighting illetracy by offering scholarship to children and equpping them with scholastic materials like bed sheets,books,pens,extra.It hav also made the young musicians to of set their goals and reach at the climax of their aims by inputing income so as to buy equipments like computers and produce songs Mengo SAMONA PRODUCTS LIMITED
  • Reform – I have a friend called Tony waigaijo, who has been a carjacker for many years.He came to community transformers having been fed up with what he does.He believed that the only way that he could be able to transform was through our initiatives and the programmes that community transformers under take.\without wasting time our director NICK omondi readily welcomed and called upon him to volunteer.He has been part of us for quite sometime until he benefited from a scholarship from community transformers and presently he is working with an NGO. MATHARE  Community Transformers
  • COMPUTER LESSONS  – Craft silicon foundation is an organisation that helps the less fortunate by teaching them computer packages for free. their classes are mobile since they are built in a big bus that moves from one area to another especially in the slums and rural area.It also offers scholarships to students who are qualified for IT courses  LOKICHOGIO craft silicon foundation

I personally find the leftovers in the gutter to be far more interesting. Each of these “least typical” stories touches on school fees or scholarships tangentially, and the focus is on something else, such as computer education, illiteracy, sports, reforming a criminal friend, or empowerment. If you are looking for a method that can best reveal what you were not expecting to find, this approach provides an effective filter. In later versions, I combined this with another kind of filter that compares story for one set of stories with diversity found in all of the stories – to better pull out ideas.

Relationship to Shannon Information Theory

In the Shannon Information Theory sense of the word, information is the opposite of redundancy. James Gleick writes:

Information is uncertainty, surprise, difficulty, and entropy. (The Information, p219)

By uncertainty, Gleick (paraphrasing Claude Shannon) means that a message with a lot of information will contain a lot of words that you did not expect and could not have predicted. Thus another way you could describe these “similarity scores” is that they measure how much of the story is redundant with other stories in the set. Voila! It makes sense that those stories with the least similarity to the rest might contain the more surprising and potentially revealing information. However, they are only surprising if you know what the normal story is like – and so you need to read some of the most “typical” stories first for context.

If this approach works, you can get a pretty good idea of what these 1252 stories contain simply from reading the right 20 stories (ten “most typical” and ten “least typical”). Another way to guide your search is to convert these scores into a filter that can be used with Cognitive-Edge’s SenseMaker(R) as a scale of “REDUNDANCY vs UNIQUENESS”:

Story similarity as a filter in SenseMaker(R):

Now you can combine a measure of story redundancy with other information the storyteller provided, to find the stories that might help you learn the most:

This is what the story similarity bell curve looks like when converted to a scale in SenseMaker(R) (using 1252 stories about school fees or scholarships):

And this is what the data looks like when combined with answers to “This story describes a Good Idea that succeeded, or a Good Idea that failed, or a Bad idea (using a triad)”:

I’ve color coded the quartiles of these redundancy / uniqueness scores so you can see where the most typical stories cluster, and where the most unique stories cluster.

And when you start to slice the data by age group, you can see that 31-45 year olds are more likely to tell somewhat unique stories:

Younger age groups are more likely to tell “typical” stories, with a lot of the same words other people use. Older people are even less likely to share stories about bad ideas or good ideas that failed (not shown).

This “story similarity” scale can be combined with another derived scale of whether the story was about success or failure:

I hope this illustrates the power of combining natural language processing with “signification frameworks” to point information searchers to exactly what they are looking for.

This idea is continued on: The automatic self-bias detector

Related Posts:

Mapping the transition from qualitative to quantitative

Statistical grounding for qualitative story sets

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s