The future of big data is quasi-unstructured

I’ve previously talked about using trello as a free project management tool. Today I found a natural tool to complement it in RescueTime, which tracks what you are doing on your computer without requiring any intelligent input from you, the user:

RescueTime - Dashboard

hour-by-hour breakdown friday
Note the effect of March Madness on my programming is not as severe as taking naps during the day.

RescueTime figures out what you are doing on your computer as soon as you turn it on and produces useful graphs to track individual worker productivity.

Integrating this with trello:

Heuristic Social Auditing - Trello

Yields a very powerful way to track projects and all work done on computers without the burden of structuring the input or even asking the user to do any input at all!

I routinely hire freelancers to work on projects. And when I have a full-time freelancer doing a project, I expect to see 8 hours a day devoted to tasks that logically line up with that project, such as software-development (the largest bar on my RescueTime chart). This tool enables me to evaluate and manage people even when the tasks are hard and complex – and don’t yield immediate results.

The information strategy here is a core part of future “big data” revolution. Here’s why:

In the future, the most useful data will be the kind that was is too unstructured to be used in the past. Algorithms to “wrap” many different kinds of structured data together (i.e. APIs for popular sites like twitter) or apply a structure to disorganized content (i.e. python’s BeautifulSoup module) are going to make most data easier to exchange. For example, I just built a heuristic auditing tool which accepts any kind of data and yields a report. (Test it at

In more abstract language, I am saying:

The future of Big Data is neither structured nor unstructured. Big Data will be structured by intuitive methods (i.e. “genetic algorithms”), or using inherent patterns that emerge from the data itself and not from rules imposed on data sets by humans.

Big Data means:

Information sets that approach the size of all information known about “X”. For example, instead of a sample of e-books, it means a comprehensive set of all e-books ever written (~70% to N=ALL). Big Data sets are noisier yet do not require us to know beforehand what questions we will pose of it. We can drill down in Big Data sets and ask arbitrary questions. It is a complementary method to statistics, which rely on sampling to eliminate bias through random sampling. Instead, Big Data assumes bias and quantifies what the biases are in the data set, so that they can be detected, inspected, and corrected.

Genetic Algorithms:

Seek to “evolve” a computational solution to a problem in a manner similar to how biological systems evolve over generations. It requires the problem to be characterized and encoded as a set of rules in a game-like program. The program also requires that any possible solution is able to be scored against other solutions, so that the best solutions from each population of solutions can be selected, mutated, and “mated” with each other to generate new solutions for testing in subsequent generations. See examples with goats, robots, and the Mona Lisa.

Intuitive Algorithms:

Intuitive algorithms play a guessing game with possible ways to structure a data set, and iterate on the result until the structure is good enough.


Emergence is the way complex systems and patterns arise out of a multiplicity of relatively simple interactions. The sum is different (or at least difficult to predict) from its component parts. Meaningful data structures based on emergence are hard to develop with existing programming, but intuitive algorithms and genetic/evolutionary approaches to algorithms will likely make emergence structuring much more feasible in the near future.

The the degree of “structure” in data sets lie along a scale, yielding different results when these approaches are applied to them:

Types of quasi-structured data and examples of each

  • totally unstructured data — google search results cover all websites, but are hard to further categorize without access the google database itself
  • intuitive-structure — my wordtree algorithm accepts any pasted text and yields a network map based on similarity of langauge within the text, as well as proximity of words to each other within the text. But it is not “tagged” the way youtube and flickr track content in images
  • emergent structure — algorithms to extract the main idea of groups of stories
  • pseudo-structuring — looking at content and assigning structure to all possible variations of a single document type, such as I did with the auditing tool.
  • guess, apply a rule, and refine — in this mode the algorithm tries an approach and refines it iteratively based on user feedback. IF the feedback is automated in the form of a score on the result, this approach becomes evolutionary programming.

(I am still figuring out how to describe this – so some of these above examples may be the same thing.)

These strategies for structuring Big Data have come about as a consequence of two trends. First – 100 times more content is added online each year than the sum of all books ever written in history. Second – most of this content is structured by institutions that for various reasons don’t want to release the fully annotated version of the information. So pragmatic programmers like me build “wrappers” to restructure the parts that are available. Eventually there will be a universal wrapper for all content about financial records, and another one for all organization reports. These data sets will organize content into clusters that are similar enough for us to study patterns on a global scale. That’s when “big data” begins to get interesting. Today, we’re in the early stages of deconstructing the structure so that we can reconstruct larger data sets from the individual parts that each have unique yet “incompatible” structures. It is like taking apart all the cars in a junk yard so we can categorize all the parts and deliver them to customers that want to build fresh cars. You see cars go in and cars go out, but a lot happens in between.

Last year, if someone had asked you to track all the work you do on your computer, you would have probably filled out a survey (like the “time tracking” reports I fill out monthly at work). In the future your computer will fill them out for you and in greater detail, and these data will be “mashable” with other reporting systems. This will not happen because two systems are built to work together, but instead because someone build a third system that allows two systems to share information. Eventually we will build “genetic algorithms” that will write programs that can re-organize data into usable structures regardless of how the original data was structured. This is going to happen in the next ten years and we will ask ourselves why we didn’t do it sooner.

Continued in: Pythonic thinking for international development

Related to: Fixing the statistical power problem for international development

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s