Real NGO Attribution & Complexities in Automating It

Over the past year we’ve collected over 35,000 stories. Each one has an attribution box for the storyteller to fill in:


In those 36,636 stories we received 12,974 unique answers to “who led the community effort in your story?” Over the past year, manual checking and correcting of these answers allowed us to reduce that number to a mere 930 organizations in Kenya and Uganda. Here are the top 100 in order of decreasing frequency of appearance:

No Answer (7263), Individual (5643), Amref (996), Government (850), School (654), Community (495), Red Cross (483), Swim, Unnamed Organization, None, Tysa, Naads, Plan International Kenya, Un, Wewasafo , Sita Kimya, Taso, Carolina For Kibera, World Bank, Usaid, Unicef, World Vision, Equity Bank, Police, Church, Green Belt Movement, Western Hiv/Aids Network, Wfp, Compassion, Concern, Family, Care, Cdc, Uganda Cares, Christian Union, Rakai Health Science Program, Who, Goal, Nacada, Peace Makers Development Group, Cdf, Fida, African Union, Cfca, Kazi Mashambani Develoipment Project(Kamadep), Safaricom Company, Kplc, Brac, Kitovu Mobile, Plan International, Uweso, Nema Organisation, News Media, Mpira Mtaani, Maddo, Covit, Kigulu, Adra, Kazi Kwa Vijana, Aphia 2, Rotary Club, Pace, Caritas, Magoso Primary School, Global Giving Story Project, Sapta, Bucadef, Kwft, Neema Women Group, Hamlet, Feed The Children, Pap, Chedra, Binti Pamoja, Doctors Without Borders, Co-Operative Bank, Hope, European Union, Tulina Omubeezi, Kibera Youth Group, Yesu Akwagala, Mrc, Vi Agro Forestry, Dominion, Catholic Church, Child Care, Lea Toto, Fao, Upe, Wfo, Kacc, Vct, Caitas, Tuungane Tusaidiane Organisation, Knhcr, Solidares, Peepoo, Uno, Resso, Orphans, Aid Child…

The two most common answers remain a significant indicator about who is really responsible for the important matters happening (or not happening) in the community: 7,263 (20%) of the stories are not about any organization at all. 15% are about individuals. After that, AMREF – the African Medical and Research Foundation (www.amref.org) is the first NGO to appear, with nearly 1000 stories (2.7% of all stories). I am guessing that about 3000 stories are related to GlobalGiving partner organizations (8%), but to get an exact number, we would need a very precise method for matching stories to organizations. The intuitive method (if someone writes the name of an organization in the box, match that text to the name in GlobalGiving’s database) doesn’t work. Even fuzzy organization name matching only assigns 674 stories to an organization (and only 8 of these are AMREF stories, but we know 996 stories are about AMREF). The problem is that the storyteller will write “Amref” but Amref calls itself “African Medical Research Foundation.” A computer doesn’t equate acronyms and aliases with the real thing – not yet. World Vision appears to be the only name used consistently by everyone. Even the Red Cross is often called “Kenya Red Cross”, “Red Cross Foundation”, “Red Cross International”, or “Kenyans for Kenya.”

So of course we’re building a list of all the aliases for each organization – but don’t expect that to be easy. In the meanwhile, here are some algorithmic tricks I’m using to get the vast majority of stories assigned to the closest possible project on GlobalGiving.

Step #1: Match all stories that lack an organization name in the attribution box to relevant projects by topic.

Relevance depends on how similar the story text is to the project description on GlobalGiving:

Topical Matching (strip, rank, phrase, drop, compare, and h-index it)

The matching algorithm isn’t too crazy. First, you strip out out all the common words, then rank all words based on importance. Important words appear infrequently in stories. They are the important because if a rare word appears in one story and one project, you can be sure that the story and the projec are quite related. Rare words are often about specific topics, so we weight them heavily during the matching. Third, you add all common two-word phrases into the set of story and project words to match to each other. “World Vision” has a far different meaning than the words “world” and “vision” appearing in the same story, so adding two-word phrases into the matching machine aids a lot. Fourth, you strip out all the words that only appear once across all stories, since they won’t match any other stories anyway. If you want to speed this up you could remove words that appear only twice, or three times – but some of those word matches could be important, so I leave them in.

Next, you actually match stories and projects. The text of each story is compared with the text of every project and scored. I use the idea behind the h-index to set the threshold of how many stories are saved to each project; the top 9 stories sharing at least 9 words or phrases with a project are saved, or the top 10 stories sharing at least 10 words, and so on. Thus, the threshold appears organically be set at 8 stories / 8 matching words per project. This result is influenced by the typical word length of our thousands of stories and NGO projects.

Matching in this way – by topical relevance – yields 2585 projects with at least 8 stories matched to each. 28,313 of the stories were matched this way, leaving 606 that didn’t match anywhere.

As an alternative matching method, I pulled in a non-overlapping set of stories where a specific organization was named. Here, only 674 stories were unambiguously connected to a GlobalGiving partner. I then use the topic match to assign each story to the organization’s most relevant project. This leaves thousands of stories about a specific NGO but can’t be matched algorithmically by organization name alone (be see my note at the end):


We are going to need to build a massive list of aliases & acronyms to definitively match the remaining gray slice of stories (21% of the total). That’s a manual process. And I am interested in using geographical proximity to further refine our topical matching, which will work for organizations working in Kenya and Uganda (estimated to work for about 20% of stories).

What emerges is a far different picture of our global civil society than what traditional evaluations would present:

  • First, most stories are not about well-known organizations. The average person in a community either doesn’t know about or chooses not to talk about one of them.
  • Second, inductive attribution to organizations is messy but eventually works. (By inductive, I mean let the storyteller choose the organization and let communities combine stories as a means of choosing which organizations this method is used to evaluate.) Some of our partner organizations were missed by this method, but the fraction of those missed is starting to decrease. (I’ll have a precise answer once we process the gray slice of stories). Note: I’m not even considering stories where multiple organizations are mentioned, but we know that there a lot of these. Attribution is much more complex than say, geo-mapping.
  • Third, most of the stories about GlobalGiving partners fall in the gray (ambiguous) and striped (topic/geo relevance) categories.
  • Fourth, stories about individuals making a difference are important. I’m happy that these will be soon appearing in front of GlobalGiving project leaders all over the world, and could help others learn what little things we do that make a difference, as well as what are priorities across communities.
  • Lastly, people in organizations everywhere can find value from reading 8 well-matched stories. This is a kind of continuous monitoring that can yield data with a longer shelf life for a greater number of people, and we have only just begin to figure out how to share “what” with “whom” “when” and “where.”

Additional tricks for matching by organization name

With my conservative criteria, at least 7 of every 8 letters needed to match in the organization name. So “Hope Orphanage” doesn’t match “HOPTEC Orphanage” but “Plan International” matches “Plan International USA”, but just barely. I was able to increase the number of matches from 674 to 1012 by lowering the match criteria as the organization name gets longer. This has the odd effect that “Red Cross Society” will match “Red Cross”, but “Red Cross” will not match “Red Cross Society.” And there will always be false positives until we have a rather complete list of organizations, and a complete list of acronyms and aliases for every organization. Worldwide, that would be about 4 million NGOs and 13 million aliases! Even then, misspellings are common – much more common than for place name. I can match about 90% of all locations this way, but only about a third of all named organizations. It should be clear now why I put more faith in topical matching (where organization names are included with stories and treated as unique seldom-used phrases) over org-name matching.

In case it wasn’t clear – nearly all of these stories will soon appear on related GlobalGiving project pages, and in a special “dashboard” that project leaders see when they check for new donations. We hope giving everybody something fresh and relevant to think about can trigger more curious behavior about the nature of the complex problems they struggle to solve – and some will even use our do-it-yourself storytelling kit locally.

Related Posts

H-index

Interesting story filter

Storytelling analysis tools and maps

Our First Storytelling Feedback Session

The Story Theme Game

Advertisements

One thought on “Real NGO Attribution & Complexities in Automating It

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s