On Friday, Feb 15, 2013 I attended this interesting talk at the Center for Global Developement, a Washington Think Tank just down the street from GlobalGiving where I work:
Edie and Lew Wasserman Professor of Social Science
California Institute of Technology
Center for Global Development
Unlike other people, when I get excited about a good talk, I go home and build tools to exploit the possibilities. Jean Ensminger showed that one can detect fraud within 10,000 pages of World Bank financial reporting using just simple algorithms. The most revealing on is Benford’s Law, which shows that any time people count real objects (money, goats, bribes, etc) they fall into a predictable inverse power law distribution, because you need to count 1 before you can count 2, and so on. The leading digits in real data look like this:
The probability that a number in a financial document starts with a 1 is 30%. 17.6% of numbers will start with a 2, and so on. So I built a tool that can instantly calculate and display a map of this data from any document, along with catching a bunch of other tricks that fraudsters use. Another fraud-prediction trick I especially love is the laziness that humans have in using convenient finger patterns on keypads to enter repetitive data, shown here:
From the fastest way to crack an ATM PIN number:
26.83 percent of passwords can be cracked using the top 20 combinations. These would be 0.2 percent of the passwords if they were randomly distributed:
Based on this pattern, I have it check for all horizontal, vertical, and diagonal, and corner-zig-zaggy number combinations that make up parts of larger numbers. I also check for frequently repeated numbers, and numbers that look rounded. I left off the pattern of numbers starting with 197x… 198x… 199x… as these are years of birth and specific to passwords, not likely to be part of a pattern someone uses in making up numbers in financial reports.
The crux of why this works is that people try too hard to make fake data look random, when in fact, real data is far less random.
Examples of instant heuristic auditing:
The tools is really simple. Paste a bunch of data from PDF, spread sheet, or word document into the box and hit CHECK. Don’t worry about the text or columns or whatever – the algorithm will ignore everything but the numbers:
CAVEAT: Jean Ensminger was careful to point out that this approach is only a diagnostic tool, and not proof of fraud. Furthermore, only invoices and other spreadsheets of the day-to-day expenses will conform to Benford’s Law, and not summaries of these, or expenses that are arbitrarily constrained (see below).
Kenya’s Constituencies Development Fund (40,693 CDF projects)
Kenya’s politicians all get a legal pork fund from which they fund local projects. Accusations of double-dipping are rampant, but are they true?
And the Result? The flattened 1s and 2s look bad, but we can’t say it is fraud. It turns out that these 40,000 CDF line items are constrained by the allowances each member of parliament gets. I’ve been told to expect this kind of flattened distribution when expenses are constrained, and you can read more on why more here.
What this does reveal is that 200,000 and 300,000 are among the most common numbers, representing 39% and 30% of all line items respectively. At the very least, this is some very lazy reporting as Kenyan MPs allocated the maximum to projects instead of breaking it up to serve many more needs. This fund represented $14 billion of Kenyan’s public budget in 2011.
My source is Kenya’s OpenData Project: https://opendata.go.ke/Public-Finance/CDF-Projects-2003-2010/6rxd-cfvr
Kenya School statistics (31,230 schools)
I also pasted all data for all secondary school statistics into the djotjog instant heuristic auditor:
|Pupil Teacher Ratio||Pupil Classroom Ratio||Pupil Toilet Ratio||Total Number of Classrooms||Boys Toilets||Girls Toilets||Teachers Toilets|
|Total Toilets||Total Boys||Total Girls||Total Enrolment||GOK TSC Male||GOK TSC Female||Local Authority Male|
|Local Authority Female||PTA BOG Male||PTA BOG Female||Others Male||Others Female||Non-Teaching Staff Male||Non-Teaching Staff Female|
And the results look much better!
Let me walk through this example, because it represents a huge volume of data (31,230 schools):
- The top five most common numbers are 0,1,2,3,4 – which is to be expected from real data. In fact, half of all numbers are zeroes. I guess reported school enrollment is not very high, but at least the numbers don’t lie.
- Very few of the numbers appear to be estimated, keypad-biased, or duplicates.
- Less than 1% of numbers are repeated.
- The total deviation is the sum of the absolute differences between the ideal and actual frequencies for each digit. Seems like scores around 20-30 are typical of the financial documents I tested, and this comes in with a score of 15 – lower is better.
- I realized I pasted some columns that are percents instead of raw numbers. This data does not conform to Benford’s law, so the actual raw numbers on school enrollment are probably even better.
Invoices for the GlobalGiving Storytelling Project:
Transparency starts at home, eh?
Here is an invoice from Moses, our Ugandan story coordinator from 2011-2012. He checks out okay.
And here are all 7 of my invoices for the storytelling project in 2011. They check out as well. (phew!)
Note to self: I could tell from the numbers that my algorithm is pulling out that ti was breaking up numbers like 11,452 and 5,523 into two numbers. I’ve fixed that now on http://djotjog.com/audit/.
Auditing random GlobalGiving Projects
Below are 5 organizations that applied to join GlobalGiving. Two of these were detected a suspicious (disreputable) organizations and possible frauds. The other three are current partner organizations. I think you can tell which of these five has a clearly deceptive accounting trick in play. Note that in this case I am looking at the actual amounts an organization claims to have spent on various expenses in a year, and not their projected future budgets.
Implications of heuristic auditing for the world:
- Real time diagnostics – while it can’t prove fraud, it tell you if something is fishy before you pay an invoice.
- These heuristic audits took less than 2 minutes each – 1 minute to find the data online, 10 seconds for the python algorithm to run, and 50 seconds to paste a screen shot into this blog post.
- Instant bullshit-detection can allow program managers to screen financial data in real time, thereby avoiding paying out money at the first sign of a red-flag.
- Open financial data is now instantly actionable. It took the Kenya OpenData Project months to get this data public (Thanks for Eric Hersman and other advocates), but mostly it just sat there in a large obscure spreadsheet. Now, the larger the data set, the more reliable the audit.
- Concise reporting is safer: This inverts the traditional “bury ’em with data” strategy to avoiding getting caught. It is much harder to fake a large report, because computers can find patterns without reading the data itself.
- Citizens have the power to detect corruption on a large scale. If given 500 financial documents, we can screen them all in a weekend.
- The good guys have more power within institutions to stamp out fraud. According to Jean Ensminger, the World Bank’s INT (Integrity Vice Presidency) has had only four trained forensic accountants on staff at any given time in recent years. These are the people who can do this kind of audit (the old fashioned way), but they only audit projects that have a “credible complaint and suspicion of corruption” – which is only a small fraction of all the billions of dollars that the World Bank disburses each year. This tool can help every part of the World Bank (and GlobalGiving) catch fraud on a small scale before it gets to be as rampant as that seen in the Arid Lands Project.
- Soon I’ll publish an analogous version that does this with language, narratives, and “qualitative” reports. I’ve been working on it for 6 months but we’re close to being able to launch the first version for public use. I’ve described a preliminary version here.
- Context becomes important after you see a suspicious pattern: There are legitimate reasons why anomalies occur in datasets, so it helps to ask more questions. This information is dangerous (undermines the credibility of people who fight corruption) when others who don’t understand how the tool works try to apply it out of context:
Example: Are principals lying when they report the number of classrooms in their schools in Kenya?
Here the Benford analysis is correctly applied to this statistic, and N is large (over 31,230 primary schools), and yet the distribution shows a preponderance of just a few numbers: 8, 9, 12, and 16.
The context here matters. Why do 28% of all primary school appear to have a reported 8 classrooms? Because classroom blocks are built to house an even number of classrooms, and the most common design features 8 classrooms. The same applies to schools with 12 and 16 classrooms. What’s truly surprising is the high frequency of 9 classroom schools. It might be because schools run from grade 1 to 9, or because an 8 classroom school is overpopulated and has set up a ninth “makeshift” classroom. Context matters. But numbers do too.
Now, project managers, go forth and audit yourself. Your proactive vigilance can make corruption harder to get away with.
The tool is free – just use it! Click on the image to load it.
Since I posted this, the World Bank hosted a DataKind Hackathon where we improved the tool. Here is a presentation about it:
And my follow up post to this: Turning victims of fraud into agents of change