Sentiment analysis: Discovering the best way to sort positive and negative feedback

Natural language processing has been available for a while, and the free stuff is starting to get good. In developing the FeedbackCommons, we’re on our third generation of classifier. A classifier is a trained machine learning algorithm that predicts whether a given chunk of text is saying something positive, neutral, or negative. This is really important when you want to sort comments from tons of people into things that leaders can respond to. This is at the heart of how Keystone trains organizations to use feedback to manage relationships.

Our consulting approach is to construct simple surveys that provide scores on relationships, then we subtract out the positive bias (inherent in all relationships in the non-profit world), then we compare those scores against normative scores for that type of relationship, in a similar context, based on all the other organizations we’ve previously worked with. We call that benchmarking. Those scores are a useful measure of progress on their own – and a proxy for impact when measuring change through indicators is not feasible.

These scores are also useful as they allow us to group the people in those relationships into three buckets, and improve one’s own services based on the three types of feedback they give:

  • Promoters / positive – people have a positive outlook about the organization and its work, and can usually some comments about what’s working
  • Detractors / negative – people who aren’t happy with the organization and the progress of its work. They often know why things aren’t working, but are not likely to speak up unless asked.
  • Nominals / neutral – the masses of people who have little opinion about the organization, and offer few insights into what it can improve.

Here’s an example of the feedback associated with those responses:

Here are more examples of useful, constructive feedback. Most neutral comments are not as useful. Clearly some comments are more actionable than others:


Representatives are always friendly and helpful. This is a wonderful service we are able to offer our clients. We are very grateful to have partnered with {this org}.
0.9348: pos 0.408, neu 0.592, neg 0.0

Patients appreciate the small, easy to swallow capsules. We feel blessed to be able to distribute them to our patients.
0.8591: pos 0.358, neu 0.642, neg 0.0

On one shipment, labels came loose. No other problems.
0.8205: pos 0.2, neu 0.761, neg 0.039


You should provide products for both children and adults.
-0.128:  pos 0.0, neu 0.897, neg 0.103

We are living in rural area and there is no courier service. Service by courier is really painful for us.
-0.659: pos 0.0, neu 0.803, neg 0.19

The children’s vitamins are not well liked by the children.
-0.7137: pos 0.0, neu 0.71, neg 0.29

Although using the scores to group the feedback is one way to do it, a simpler way that would allow unstructured feedback to be grouped is to allow a computer model to read the comments directly and predict whether it is positive, neutral or negative. The examples above have been categorized using a trained machine learning sentiment classifier, not the scores from surveys.

If you can reliably group comments by the sentiment (positive, neutral, negative), then you can skip the structure of a survey altogether and sort comments from a suggestion box, or given over social media, or over email, or over virtually any medium. So much of what people say isn’t acted on because it doesn’t make it into a dashboard, like the one I’ve shown. At Keystone, we are determined to find computer-aided solutions to prioritize key, actionable pieces of feedback and get it in front of leaders.

Using open text feedback is radical simplicity, and a perfect complement to tools like HappyOrNot. Last weekend on I saw this kiosk in 10 different parts of an Ikea I visited:

HappyOrNot charges companies, airports, and ballparks hundreds of thousands of dollars to install and manage simple kiosks like this one that feed a database and dashboard with net promoter scores, but no comments from customers.

Wouldn’t it be nice to be able to get the scores AND the comments, from just the comments? That’s doable today, and FeedbackCommons offers it within our suite of tools. I’m also playing around with Google Cloud Vision – a simple way of getting sentiment from images people share with you. You can upload a test image here for free and see how powerful it is: — and I recommend uploading an image full of handwriting or whiteboard writing. It can read that too! So imagine using that. You could turn a suggestion box into text, then into sentiment. Isn’t hard anymore!

Comparing sentiment analyzers: Vader wins!

And now for the technical side. We started out using TextBlob, a simplified tool based on python’s nltk that lets you quickly get a sentiment score. But out-of-the-box model wasn’t very good. It was trained using movie reviews and the way people talk about movies is nothing like the way they talk about government agencies, projects, aid, public services, and power relationships.

from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
TextBlob("I find your lack of faith disturbing.", analyzer=NaiveBayesAnalyzer())
{'neg': 0.182, 'pos': 0.817, 'combined': 0.635}

Then I had a better idea. Why not take those 67000 stories from the GlobalGiving storytelling project, all of whom have a positive, neutral, or negative outcome, and use that to build a smarter analyzer? Well, I did that. I ran many different versions of the data through a training model, and the best one was the version that only classified sentiment using words in stories that appeared twice or more, and ignored all the most common words. The best trained model could predict 74 percent of stories’ outcomes correctly. However, while I was trying to plug this back into our FeedbackCommons site, I stumbled across notes on a better analyzer, that was able to classify sentiment even more reliably.

This “vader” corpus is an add-on to the widely used python natural language toolkit, or nltk.

import nltk'vader_lexicon') # do this once: grab the trained model from the web
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Analyzer = SentimentIntensityAnalyzer()
Analyzer.polarity_scores("I find your lack of faith disturbing.")
{'neg': 0.491, 'neu': 0.263, 'pos': 0.246, 'compound': -0.4215}

This is called the “vader” vocabulary, after Darth Vader, who has a lot of negative dialogue that other analyzers miss. It outperforms my storytelling trained classifier, even though the storytelling data is very similar to the feedback Keystone collects. Why? Because vader parses all the negating language in sentences and considers “faith” and “lack of faith” to mean opposite things. It isn’t tricked by double negatives. And that alone seems to make it more reliable. A really good longer writeup with example code about vader is here: and I recommend it. It was a quick drop-in replacement and immediate improvement for TextBlob/IMDB which is far more commonly used, but really doesn’t work well.

Finally, the star wars quote that comes to mind when thinking of “Vader” and difficult/ambiguous-to-parse language.

Next post: I will take the predicted sentiment from text and compare it with the overall scores based on surveys from the same people. I predict the prediction will be about 75% accurate, meaning that if you just had the comments from people, you could predict how satisfied they are with a relationship and be right 75% of the time. That gives an estimated score a pretty narrow margin of error – good enough to be useful, in the absence of any better data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s