I’ve been building a website that can calculate whether any organization report, blog post, or document tackles themes and conforms to a certain writing style. The hard part is that the rules I’ve been given are quite abstract. Examples:
- “Lead off by framing the issue as being about people, not problems.”
- “Talk about resilience, not needs.”
I took on this challenge because I assumed that I’d be given a collection of writing examples that my algorithm could learn from. Unfortunately, I only got two examples. This doesn’t really define something as broad as “resilience.” So I built my own set of examples.
This is how I did it:
- Search through tens of thousands of reports for those that contain an obvious keyword, such as “resilience.”
- Pull out the matching reports and run an analysis of phrases that most often appear in the same sentence as resilience, or in nearby sentences.
- Use these new phrases to pull in new reports that don’t mention resilience but do mention several of these related phrases.
- Analyse those reports, find new phrases, and repeat.
The evolving set of reports creates a tapestry of language variations that looks like this:
Imagine that different ways to describe “resilience” are represented by shades of green in that above graphic. Not every shade is exactly resilience, but the mosaic is a more reliable and complete definition of “resilience” than you started with.
This recursive, expanding search eventually generates a dictionary. Every word is somewhat related to “resilience.” Beside every word is a number for how relevant it is. Given these weighted words, I can score any other text and decide if is about resilience or not.
This works very well when the topics are actual topics with their own jargon. It works less well when authors are trying not to use jargon, and instead describe the concept with common everyday words. Resilience can be captured in specific statements like “after school she learned how to avoid dangerous street corners at night.” It can also be highly correlated with more general words, such as training. This poses a problem: some trainings increase resilience, but not all of them. However, if the words “training” and “post-war” or “aftermath” appear together in a paragraph, that’s probably resilience too.
Finding a Good End to the Up Goer Five Problem
The extreme case of this (the ultimate natural language processing challenge!) would be to write a computer program that can answer the question, “Is this text about rocketry?” and apply it to XKCD’s Up Goer Five cartoon.
Up Goer Five describes the Saturn V rocket in extreme technical detail using only the 1000 most common English words. In this case, there is absolutely no domain-specific bias to the choice of words, and the only language data you can use are combinations of everyday words and patterns (word order). It can be done, but the solution is not a simple 10-line script. I should ask my buddy Nick of godelescherblog.wordpress.com fame what answer scikit.learn gives to this question. It is a lot like the Mona Lisa genetic algorithm problem from StackOverFlow, but for natural langauge processing / data mining.
Learning how to solving this challenge would yield some very practical solutions to real world problems. If you could feed the code the text from 100 of Keystone’s past clients, build a dictionary of the types of phrases that are often found, then use it to predict which of 100 other non-client organizations are the best prospects to court (e.g. business development: predictive client acquisition).
Or, if you’re an evaluator, you could search through hundreds of documents from the field and find clear examples of some concept that your organization works on. You could then create an external evidence base for reference purposes.
Writing this makes me think that I need to ask 10 experts on resilience to send me essays they’ve written about resilience in the past, as well as rewritten versions of these essays where only the “ten hundred” most common English words are allowed. This would help me train my algorithms to detect common everyday descriptions of resilience. It might also reveal more universal language patterns that people fall into when trying to write without using jargon.
Anyone up for taking the “is this text about rocketry?” challenge? If so, comment below or email me (on the about page).
Challenge accepted by @NicholasHamlin from the GodelEscherBLog
This week, my friend Marc asked me how I’d solve the Up-Goer Five problem. Specifically, how would you decide if a particular piece of text is about a certain topic (like rocketry)? What about if the text has no science or engineering jargon and only uses the most common words in the English language?
Let’s start with the first question. Before we can get a computer to help unpack this problem, we have to convert the text into something that’s easier for a computer to understand. Computers are better at numbers than they are at words, so we’ll look to translate our pieces of text into a set of numbers. We can do this using a process called “vectorization”. The simplest way to do this is to count all the times a word appears in each piece of text (we’ll call these “documents”) and then make a chart. For example, let’s vectorize “My dog is brown” and “The brown cat jumped”:
All we do is count the number of times each word appears in each document. Each word is a column, and each document is a row. The official jargon for what we’ve built is a “term-document matrix” (TDM). If we had a set of training documents that we knew were about a particular topic, we could then use that to built a TDM and look for patterns of words that happen frequently in those documents (there are tons of different ways to find these patterns, which I’ll skip).
There are a few problems with this basic approach. First, what happens when you have really long or really short documents? Chances are that the longer your documents are, the higher the values in your TDM will be. To get around this, we may want to take each value and divide it by the length of the document. Another problem happens when we have very common words. “The” probably happens pretty much everywhere, whether a document is talking about rockets or kittens. There are two ways to fix this. One easy way would be to ignore these words completely (official jargon: “excluding stopwords”). Another way is to look at how many of the documents contain the term. If a word appears in every document, it probably doesn’t give us much special information about that document. If it only appears in some situations, it’s much more likely to be useful. So, in our rocket example, we might exclude a word like “the”, but include a word like “booster”. This process of creating a TDM by asking “How frequently does a term appear in a document?” and ” “How many documents have this term?” is usually called (official jargon again) Term Frequency-Inverse Document Frequency Vectorization, or TF-IDF for short.
Normally, TDMs created by the TF-IDF process are “sparse”, that is, most words don’t appear in most documents (especially if you exclude stopwords). This can make it hard to search for patterns, though there are ways of getting around this. However, in the Up-Goer Five problem, there can only around 1000 possible words, so the TDM we’ll create should end up being more “dense” than if we were using more complex words. This may create other problems, since many words will appear in many documents, but, we can get around this based on the particular method we use to look for the patterns.
Want to try this yourself? Python’s Scikit-Learn tool is open-source, free, and great at machine learning problems like this. For example, here’s how it might work using scripts from How I Met Your Mother. In the meantime, I’ll post more actual numbers if I make progress on Marc’s problem.
Marc’s Final Note:
After running Nick’s model on my collection of narratives possibly about resilience, the top two phrases in the strongly-aligned group and the strongly-NOT-aligned group were, respectively:
“be able to” – resilience
“step closer to” – non-resilience
So we think that those phrases, along with 5000+ other phrases, when used in the aggregate, can reliably categorize a narrative as being about “resilience” or not about “resilience” 70 percent of the time.
This is hopefully been a brief but clear primer on the power of natural language processing.