Standards Utopia or a Beautiful Soup Universe?

For years the UN has been funding an internal agency called UN Global Pulse to coordinate information flowing among all its member agencies. The UN organizational chart is daunting:

UN-org-chart

UN Global Pulse is charged with the task of managing data for all of these agencies, getting them to share data with each other, and ensuring that data supports decision-making.

UN Global Pulse recently released a report, A World That Counts: Mobilizing the data revolution for global development. It outlines their vision for how to solve the data problem within the UN and the development world.

A world that counts - UNGlobalPulse

How UN Global Pulse intends to fix the data problem:

  1. Develop a global consensus on principles and standards.
  2. Share technology and innovations for the common good: We propose to create a global “Network of Data Innovation Networks”, to bring together the organisations and experts in the field.
  3. New resources for capacity development: A new funding stream to support the data revolution for sustainable
    development should be endorsed at the “Third International Conference on Financing for Development”, in Addis Ababa in July 2015.
  4. Leadership for coordination and mobilisation: Start a “World Forum on Sustainable Development Data” and A “Global Users Forum for Data for Sustainable Development Goals (SDGs)” and Brokering key global public-private partnerships for data sharing.
  5. Exploit some quick wins on SDG data: Establishing a “SDGs data lab” to support the development of a first wave of SDG indicators, developing an SDG analysis and visualisation platform using the most advanced tools and features for exploring data, and building a dashboard from diverse data sources on ”the state of  the world”.

I think their plan is folly. Agreeing on a set of standards doesn’t actually lead to standardized content. More groups and meetings won’t necessarily lead to more coordination. And their description of a “quick win” under item #5 sounds like one of the more complex tasks the UN has ever undertaken. Fixing the measurement, analysis, and visualization problem is not a “quick win,” but rather – it’s the whole ball game.

Consider the Internet

The Internet is truly global, and we have both “common standards” and it’s antithesis – BeautifulSoup – to thank. HTML standards are set by W3C – world wide web consortium (w3.org), but they are not enforced, and often get ignored. Browsers and the programmers who must make web content understandable to billions of people decide what parts of the W3C standards matter. Microsoft’s IE browser interprets the same HTML pages somewhat differently from the rest (Chrome, Safari, Opera, Firefox). As a result, their share of the browser market shrinks every year. People don’t like the way IE interprets the Internet, so they switch browsers. The standards are suggestions, but heuristic codes (living inside browsers) define what the Internet actually looks like.

If the Internet is to be a model for international development, then the lesson is that smarter code, and not smarter standards, is the solution.

The people who define the rules are rarely the people who must fit the messy real-world content into software and websites for people to use. The gatekeepers are people like me, modest programmers with an immediate problem to solve and disdain for standards that slow down the work. I am aware of the current standard data format for international aid work, called IATI, but I generally ignore it. Practically none of the data *I NEED* is already in that format, and the format makes it harder to navigate the data than many other non-standard formats I use instead, such as JSON.

The Beautiful Soup approach

BeautifulSoup is a python module built for people like me. It parses non-standard HTML and even broken code with about a 98 percent success rate. This example reads the HTML of this blog post into a machine readable format:

from bs4 import BeautifulSoup
import requests
html = requests.get('http://wp.me/plX0C-1qp') #this blog post!
soup = BeautifulSoup(html.text)
soup.findAll('p') #text of every paragraph of this blog in a list

Where standards have failed, BeautifulSoup prevails. Where people have been “doing their own thing” all over the Internet, BeautifulSoup is the Rosetta Stone for reading all HTML pages in every language, no exceptions. I suspect that all web browsers have a section of code that works the way BeautifulSoup does. Its philosophy is “try this, then if it fails, try the next thing, and so on.” It even contains a section called “UnicodeDammit” that exists because yet another standards-setting body failed to get uniform adoption of its rules. Unicode is the way that non-English characters are saved in documents and not lost. It can be a headache for programmers to read, and errors in encoding can lead to permanent data loss (unrecoverable gibberish). BeautifulSoup can sometimes read this gibberish using heuristics and a working knowledge of the most common errors that lead to gibberish in the first place.

I believe in BeautifulSoup. It works. I’m doubtful a UN agency can beat this approach. So if UN Global Pulse wants to make headway, they can write standards for data, but they will also need to invest in Beautiful Soup style solutions to the problem. These solutions include heuristic functionsgenetic algorithms, and web-content or legacy document restructuring. These approaches move us closer to the pythonic way of improving the lives of people around the world.

The future is quasi-unstructured data and the path looks Beautiful (Soup).

standards

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s