Skip to main content


Showing posts from February, 2010

What's the goal?

The goal of the stopping rule for surveys is to govern the process with something that is related to the nonresponse bias (under an assumed, but reasonable model).

Although we don't discuss it much in the article, I like to speculate about the effect this might have on data collection practices. If the response rate is the key metric, then the data collection process should focus on interviewing the easiest to interview set of cases that meet the target response rate. Of course, it's a bit more 'random' than that in practice. But that should be the logic.

What would the logic be under a different key metric (or stopping rule)? Would data collection organizations need to learn which cases get you to your target most efficiently? How would those cases be different? It seems that in this situation there would be a bigger reward for exploring the entire covariate space on the sampling frame. How do you go about doing that?

There are a set of alternative indicators out the…

Stopping Rules Article

The article that Raghu and I wrote on stopping rules for surveys is available now as an early view article at Statistics in Medicine. The basic idea is that you should stop collecting data once the probability that new data will change your estimate is sufficiently small. As always, you want good paradata and sampling frame data to aid with this decision.

What are "paradata" anyway?

In preparing a manuscript, I found myself defining paradata to include interviewer observations. Interviewer observations are questions asked of the interviewer while they are in the process of attempting or completing an interview. They might range from information about the neighborhood of the sampled unit, guesses about who might live in the sampled unit, or information about the person who was just interviewed.

Classifying these observations as paradata seemed logical to me at the time since they are not sampling frame data and they aren't reports from the sampled unit. But are they paradata? Are they generated by the process of data collection? Or, is anything that doesn't come from the sampling frame or respondent by default count as paradata?

It's probably worth the exercise of developing a more precise definition. If only for the economy of language that such precision should afford.