Complexity of building survey benchmarks

A colleague recently emailed me, asking, “Can you please tell me what is the net promoter score (NPS) for all responses to this question:”

To what extent do you believe that {{organization X}} will use your answers to this survey to improve its work?

That is what we call a benchmark. The average for a group is a useful reference value when you want to know whether your performance is good, average, or poor. She wanted to know so she could advise a Keystone client as to whether their community of responses was good or bad. This is why we built the Feedback Commons. But in case this sounds trivial, let me illustrate all the ways that benchmarking can be harder than it sounds.

Me: Sure. That takes a database query. Not hard…

  1. I found the questions out of 800+ that we have absorbed from past surveys that most-closely match the question she asked.
#136: Do you feel that {{org}} will use your answers to this survey to improve its services?
#545: Do you expect that {{org}} will use the feedback from this survey to improve its work?
#1298: I am sure that {{org}} will use my answers to this survey to improve its services.
#1433: I think that {{org}} will use my answers to this survey to improve its services.
[---]: *Do you believe {{org}} will use your feedback in this survey effectively? 
* That last question was not on a 0-10 scale, but rather used a 5-point likert fromscale responses are so not comparable/mergable with the rest.

2. I Fetched all responses to any of these questions from a mongo database (built to handle unstructured data like surveys) and merged them into one long list of numbers.

data = list(mongodb.find({'$or':[{'q136':{'$exists':True}}, {'q545':{'$exists':True}}, {'q1298':{'$exists':True}}, {'q1433':{'$exists':True}}]},{'q136':1,'q545':1,'q1298':1,'q1433':1}))
4144 responses in all
3. Oops. Some of those 4,144 stored responses to a question are not valid. Answers like None or “” or ” ” (a space) or -1 or 11 appeared. So I had to write another line of code to filter out nonsense answers. I know from past experience that one organizations decided that ’11’ would mean NO ANSWER for their 0-10 scale survey questions. Another group used a web tool that didn’t allow ZERO for an answer, so their 0-10 scale is stored as 1,2,3,4,5,6,7,8,9,10, and 11. I’d tried to clean this stuff up when I imported it, but it isn’t always fixed. Mysterious, isn’t it? Probably it is because I restored a piece of data from an unclean backup at some point. This is the kind of data crap I have been building the Feedback Commons to handle for people. Humans are error-prone, and conventions inside each organization break comparability. But with more massaging, I got the answer:
Question: On a scale of 0 to 10, how much do you feel that {{org}} will use your answers to this survey to improve its services?
The answer is 7.8
On further analysis, found that nearly all the data come from one version of this question, and the rest came from a second version. Two questions had no data at all. The average scores for both versions were similar.
Later, I realized that since 90% of the data came from version of this question, and all questions had already been pre-grouped by our Feedback Commons data import engine, I could have gotten the same answer in 10 seconds. All I’d have to do is
  1. Log in to FeedbackCommons.org.
  2. Go to Survey Builder.
  3. Click to analyze a survey where this question was asked.
  4. Read the benchmark from the chart:
NPA vs avg score - will use feedback question
There is a 0.2 difference in score from the missing 10% of data, but for 10 seconds of effort, this benchmark is fine for making a decision.
Interoperability is ugly and messy. When you use the Feedback Commons, you can leave the details to us at Keystone. Instead, focus on improving your organization’s capacity to respond to the data you do collect in a more proactive, less reactive way. This is organizational change. What good is good data when your team doesn’t consult data in the first place?
This is an introduction to the complexity involved in asking one question and merging responses with those others are asking to a similar question to create a benchmark analysis.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s