A review of two books on survey-making

Reading time: 15 mins. Epistemic status:

Simplicio: I have a question. When people make surveys, how do they make sure that the questions measure what they want to measure?
Salviati: Woe be upon me.

Introduction

The two books reviewed: The Power of Survey Design (TPOSD) and Improving survey questions: Design and Evaluation (IDS) have given me an appreciation of the biases and problems that are likely to pop up when having people complete surveys. This knowledge might perhaps be valuable both for those in the EA and rationality communities who find themselves making surveys, as well as for those who have to interpret them.

For the eyes of those who are designing a survey

You might want to read this review for the kicks, then:

a) If you don’t want to spend too much time, the pareto principle thing to do might be to use this checklist, and this list of principles, both from the books under review. I’ve also found this summary to be very nonthreatening.

b) If you want to spend a moderate amount of time:

Chapter 3 of The Power of Survey Design (68 pages) and/or Chapter 4 of Improving survey questions (22 pages) for general things to watch out for when writting questions. Chapter 3 of The Power of Survey Design is the backbone of the book.
Chapter 5 of The Power of Survey Design (40 pages) for how to use the dark arts to have more people answer your questions willingly and happily.

c) For even more detail:

Chapters 2 and 3 of Improving survey questions (38 and 32 pages, respectively) for considerations on gathering factual and subjective data, respectively.
Chapter 5 of Improving survey questions (25 pages) for how to evaluate/test your survey before the actual implementation.
Chapter 6 of Improving survey questions (12 pages) for advice about trying to find something like hospital records to validate your questionnaire with, or about repeating some important questions in slightly different form and get really worried if answerers answer differently.
The introductions, i.e. Chapter 1 and 2 of The Power of Survey Design (9 and 22 pages, respectively), and Chapter 1 of Improving survey questions (7 pages) if introductions are your thing, or if you want to plan your strategy. In particular, Chapter 2 of TPOSD has a cool Gantt chart.

Here is the index for Improving Survey Questions and here is the index for The Power of Survey Design. libgen.io will be of use if you want an electronic copy. Note that using that webpage might only be legal if you already own a physical copy of the books, depending on your jurisdiction. Also note that the World Bank offers The Power of Survey Design for free.

Both books are dated in some respects; neither mentions online surveys, and they both make more emphasis on field surveys. However, I think that on the broad principles and considerations, both books remain useful guides. Nonetheless, I don’t have any particular attachment to these two books; I’d expect that any book on survey making by an author which worked on the field professionally, or published by an university press, is likely to be roughly as useful as the two above.

Some ways in which people are inconsistent or incoherent when answering survey questions

For the casual reader, here is a nonexhaustive collection of curious anecdotes mentioned in the first book.

A Latinobarometro poll in 2004 showed that while a clear majority (63 percent) in Latin America would never support a military government, 55 percent would not mind a nondemocratic government if it solved economic problems.
When asked about a fictitious “Public Affairs Act” one-third of respondents volunteered an answer
The choice of numeric scales has an impact on response patterns: Using a scale which goes from -5 to 5 produces a different distribution of answers than using a scale that goes from 0 to 10.
The order of questions influences the answer, and so does wording as well: framing the question with the term “welfare” instead of with the formulation “incentives for people with low incomes” produces a large effect.
Options that appear at the beginning of a long list seem to have a higher likelihood of being selected. For example, when alternatives are listed from poor to excellent rather than the other way around, respondents are more likely to use the negative end of the scale. Unless it’s in a phone interview, or read out loud, in which case the last options are more likely.
When asked whether they had visited a doctor in the last two weeks, if respondents have had a recent doctor visit, but not one within the last 14 days, there is a tendency to want to report it. In essence, they feel that accurate reporting really means that they are the kind of person who saw a doctor recently, if not exactly and precisely within the last two weeks.
The percentage of people supporting US involvement in WW2 almost doubled if the word “Hitler” appeared in the question.

These examples highlight the point that people do not always have consistent opinions which are elicited by the question, but that the spectrum of answers is influenced by the wording of the question.

Influencing respondents

Chapter 5 of The Power of Survey Design: A User’s Guide for Managing Surveys, Interpreting Results, and Influencing Respondents goes over the influencing respondents part. The author has spent way more time thinking about the topic than the survey-taker and can thus nudge him.

On the one hand, the author could write biased questions with the intention of eliciting the answers he wishes to obtain. However, this chapter makes more emphasis in the following: once good questions have been written, how do you convice people, perhaps initially recluctant, to take part in your survey? How do you get them to answer sensitive questions truthfully? Some mechanisms to do this are:

Legitimacy and appearances.

At the beginning, make sure to assure legal confidentiality, maybe research the relevant laws in your jurisdiction and make reference to them. Name drop sponsors, include contact names and phone numbers. Explain the importance of your research, its unique characteristics and practical benefits.

There is a part of signalling confidentiality, legitimacy, competence which involves actually doing the thing. This is also the case for explaining the purpose of your survey, and arguing that its goals are aligned with the goals of the respondent. For example, if you assure legal confidentiality, but then ask information which would permit easy deanonimization, people might notice and get pissed. But another part consists of merely being aware of the legitimacy dimension.

The first questions should be easy, pleasant, and interesting. Build up confidence in the survey’s objective, stimulate their interest and participation by making sure that the respondent is able to see the relationship between the question asked and the purpose of the study. Don’t ask sensitive questions at the beginning of your survey. Instead, allow time for the respondent’s System 1 to accept your claim of legitimacy. It is also recommended that sensitive questions be made longer, as they are then perceived as less threatening. Part of that length can be a sentence or two explaining that all the options are ultimately acceptable, and that the answerer won’t be judged for them.

Don’t bore the answerer.

Cooperation will be highest when the questionnaire is interesting and when it avoids items difficult to answer, time-consuming, or embarrassing. An example of this might be starting with a prisoner’s dilemma with real payoffs, which will might double as the monetary incentive to complete the survey. Or, more generally, starting with open questions.

It serves no purpose to ask the respondent about something he or she does not understand clearly or that is too far in the past to remember correctly; doing so generates inaccurate information.

Don’t ask a long sequence of very similar questions. This bores and irritates people, which leads them to answer mechanically. A term used for this is acquiescence bias: in questions with an “agree-disagre” or “yes-no” format, people tend to agree or say yes even when the meaning is reversed. In long lists of questions on a “0-5” scale, people tend to choose 2.

On the other hard, don’t make questions too hard. In general, telling the respondents a definition and asking them to clasify themselves is too much work.

Elite respondents

Elites are apparently quickly irritated if the topic of the questions is not of interest to them. Vague queries generate a sense of frustration, and lead to a perception that the study is not legitimate. Oversimplifications are noticed and disliked.

To mitigate this, one might start with a narrative question, and add open questions at regular intervals throughout the form. Elites “resent being encased in the straightjacket of standardized questions” and feel particularly frustrated if they perceive that the response alternatives do not accurately address their key concern.

In general, a key component of herding elite respondents is to match the level of cognitive complexity of the question with the respondent’s level of cognitive ability, as not doing so leads to frustration. Overall, it seems to me that the concept of “elite respondent” applies to the highly intelligent and ambitious crowd characteristic of the EA and rationality movements.

Useful categories.

Memory

Events less than two weeks into the past can be remembered without much error. There are several ways in which people can estimate the frequency with which something happens, i.e., to answer questions of the form “How often does X?”, chiefly:

Availability heuristic: How easy it is to remember or come up with instances of X?
Episodic enumeration: Recalling and counting occurrences of an event. How many individual instances of X can you come up with?
Resorting to some sense of normative frequency: How often should one wash your hands?
etc.

Of these, episodic enumeration turns out to be the most accurate, and people employ it more the less instances of the event in question there are. The wording of the question might be changed to facilitate episodic enummeration, even explaining the concept explicitly and asking the respondent to commit to using it.

Asking a longer question, and communicating to responders the significance of the question has a positive effect on the accuracy of the answer. For example, one might employ phrasings such as “please take your time to answer this question,” “the accuracy of this question is particularly important,” or “please take at least 30 seconds to think about this question before answering”.

If you want to measure knowledge, take into account that recognizing is easier than recalling. More people will be able to recognize a definition of effective altruism than be able to produce one on their own.Furthermore, if you use a multiple question with n options, and x% of people knew the answer, whereas (100-x)% didn’t, you might expect that (100-x)/n % hadn’t known the answer, but guessed correctly by chance, so you’d see that y% = x% + (100-x)/n % selected the correct option.

Consistency and Ignorance.

In of our examples at the beginning, one third of respondents gave an opinion about a ficticious Act. This generalizes; respondents rarely admit ignorance. It is thus a good idea to offer an option for “I don’t know”, or “I don’t really care about this topic”. With regards to consistency, it is a good idea to ask similar questions in different parts of the questionnaire to check the consistency of answers.

Subjective vs objective questions

The author of Improving Survey Questions views the distinction between objective and subjective questions as very important. Apparently, there are serious metaphysical implications to the fact that there is no direct way to know about people’s subjective states independent of what they tell us. To this, the author devotes a whole chapter.

Anyways, despite the lack of an independent measure, there are still things to do, chiefly:

Place answers on a single well defined continuum
Specify clearly what is to be rated.

For example, “how depressed are you?” is neither on a well defined continuum, nor is it clear what is to be rated (as of yet). The Becket Depression Inventory gives a score on a well defined continuum and clearly specifies what is to be rated (but is, for some purposes, too long).

The author warns us about the dangers of misinterpreting subjective questions:

“The concept of bias is meaningless for subjective questions. By changing wording, response order, or other things, it is possible to change the distribution of answers. However, the concept of bias implies systematic deviations from some true score, and there is no true score… Do not conclude that "most people favor gun control”, “most people oppose abortions”… All that happened is that a majority of respondents picked response alternatives to a particular question that the researcher chose to interpret as favorable or positive."

Test your questionnaire

I appreciated the pithy phrases “Armchair discussions cannot replace direct contact with the population being analyzed” and “Everybody thinks they can write good survey questions”. With respect to testing a questionnaire, the books go over different strategies and argue for some reflexivity when deciding what type of test to undertake.

In particular, the intuitive or traditional way to go about testing a questionnaire would be a focus group: you have some test subjects, have them take the survey, and then talk with them or with the interviewers. This, the authors argue, is messy, because some people might dominate the conversation out of proportion to the problems they encountered. Additionally, random respondents are not actually very good judges of questions.

Instead, no matter what type of test you’re carrying out, having a spreadsheet with issues for each question, filled individually and before any discussion, makes the process less prone to social effects.

Another alternative is to try to get in the mind of the respondent while they’re taking the survey. To this effect, you can ask respondents:

to paraphrase their understanding of the question.
to define terms
for any uncertainties or confusions
how accurately they were able to answer certain question and how likely they think they or others would be to distort answers to certain questions
if the question called for a numerical figure, how they arrived at the number.

For example, for the question: “Overall, how would you rate your health: excellent, very good, fair, or poor?”, a followup question might be: “When you said that your health was (previous answer), what did you take into account or think about in making that rating?”

Considerations about tiring the answerer still apply: a long list of similar questions is likely to induce boredom. For this reason, ISQ recommends testing “half a dozen” questions at a time.

As an aside, if you want to measure the amount of healthcare consumed in the last 6 months, you might come up with a biased estimate even if your questions aren’t problematic, and this would be because the people who just died consume a lot of healthcare, but can’t answer your survey.

Tactics

Be aware of the biases

Be aware of the ways of the ways a question can be biased. Don’t load your questions: don’t use positive or negative adjectives in your question. Take into account social desirability bias: “Do you work?” has implications with regards to status.

An good example given, which tries to reduce social desirability bias, is the following:

Sometimes we know that people are not able to vote, because they are not interested in the election, because they can’t get off from work, because they have family pressures, or for many other reasons. Thinking about the presidential elections last November, did you actually vote in that election or not?

Incidentally, self-administered surveys are great at not creating bias because of the interviewer; answerers don’t feel a need to impress.

There is also the aspect of managing self-images: it’s not only that the respondent may want to impress, it’s also that they may want to think about themselves in certain ways. You don’t want to have respondents feel they’re put in a negative (that is, inaccurate) light. Respondents “are concerned that they’ll be misclassified, and they’ll distort the answers in a way they think will provide a more accurate picture”. One countermeasure against this is to allow the respondent to give context. For example:

How much did you drink last weekend?
Do you feel that this period is representative?
What is a normal amount to drink in your social context?

Thus, they survey-maker can get into their head and manage the way they perceive the questions, so as to minimize the sense that certain answers will be negatively valued. “Permit respondents to present themselves in a positive way at the same time they provide the information needed”.

In questions for which biases are likely to pop up, consider explicitly explaining to respondents that giving accurate answers is the most important thing they can do. Have respondents make a commitment to give accurate answers at the beginning; it can’t hurt.

This, together with legitimacy signaling, has been shown to reduce the number of books which well-educated people report reading.

Don’t confuse question objective with question.

The soundest advice any person beginning to design a survey instrument could receive is to produce a good, detailed list of question objectives and an analysis plan that outlines how the data will be used. “If a researcher cannot match a question with an objective and a role in the analysis plan, the question should not be asked”, the authors claim.

Further, it is not enough to simply put your objective in question form. For example, in the previous example, the question objective could be finding out which proportion of the population votes, and simply translating it into a question (f.ex., “Did you vote in the last presidential election?”) is likely to turn up all kinds of interesting biases. To reduce them, one should employ bias-mitigation strategies like the above.

An avalanche of advice.

The combined two books contain a plethora of advice, as well as examples which help build one’s intuitions. To recap, if you can’t afford to take the time to read a book, the pareto-principle thing to do might be to read this checklist, this list of principles, or this nonthreatening summary. Some advice which didn’t quite fit in the previous sections, but which I couldn’t leave unmentioned, is:

Ask one question at a time. For example: “Compared to last year, how much are you winning at life?” is confusing, and would be less so if it was divided into: “How much are you winning at life today?” and “how much were you winning at life last year?”. If the question was important, a short paragraph explaining what you mean by winning at life would be in order.
Explanatory paragraphs come before the question, not after. After the respondent thinks she has read a question, she will not listen to the definition provided afterwards. Ditto for technical terms.
Not avoiding the use of double negatives makes for confusing sentences, like this one.
Avoid the use of different terms with the same meaning.
Make your alternatives mutually exclusive and exhaustive.
Don’t make all your questions too long. As a rule of thumb, keep your question under of 20 words and 3 commas (unless you’re trying to estimulate recall, or if it’s a question about sensitive topics).
The longer the list of questions, the lower the quality of the data.

Closing thoughts.

The rabbit hole of designing questionnaires is deep, but seems well mapped. I don’t think that this needs to become common knowledge, but I expect that a small number of people might benefit greatly from the pointers given here. Boggling at the concept of a manual, I am grateful to have access to the effort of someone who has spent an inordinate amount of time studying the specific topic of interviews.