## Posts

Showing posts from 2013

### Equal Effort... or Equal Probabilities

I've been reading an article on locating respondents in a panel survey. The authors were trying to determine what the protocol should be. They reviewed the literature to see what the maximum number of calls should be.

As I noted in my last post, I was recently involved in a series of discussions on the same topic. But when I was reading this article, I thought immediately about how much variation there is between call sequences with the same number of calls. The most extreme case is calling a case three times in one day is not the same as calling a case three times over the course of three weeks.

I think the goal should be to apply protocols that have similar rates of being effective, i.e. produce similar response probabilities. But there aren't good metrics to measure the effectiveness of the many different possibilities. Practitioners need something that can evaluate how the chain of calls produce an overall probability of response. Using call-level estimates might be one of…

### Simulation of Limits

In my last post, I advocated against truncating effort. In this post, I'm going to talk about doing just that. Go figure.

We were discussing call limits on a project that I'm working on. This is a study that we plan to repeat in the future, so we're spending a fair amount of time experimenting with design features on this first wave.

There is a telephone component to the survey, so we've been working on the question of how to specify the calling algorithm and, in particular, what if any ceiling we should place on the number of calls.

One way to look at it is to look at the distribution of final outcomes by call number -- sort of like a life table. Early calls are generally more productive (i.e. produce a final outcome) than late calls. You can look at the life table and see after which call very few interviews are obtained. You might truncate the effort at that point.

The problem is that simulating what would happen if you place a ceiling on the number of calls isn'…

### On the mutability of response probabilities

I am still thinking about the estimation and use of response propensities during data collection. One tactic that may be used is to identify low propensity cases and truncate effort on them. This is a cost saving measure that makes sense if truncating the effort doesn't lead to a change in estimates.
I do have a couple of concerns about this tactic. First, each step back may seem quite small. But if we take this action repeatedly, we may end up with a cumulative change in the estimate that is significant. One way to check this is to continue the truncated effort for a subsamples of cases.
Second, and more abstractly, I am concerned that our estimates of response propensities will become reified in our minds.  That is, a low propensity case is always a low propensity case and there is nothing to do about that. In fact, the propensity is always conditional upon the design under which it is estimated. We ought to be looking for design features that change those probabilities. Prefera…

### More on changing response propensities

I've been thinking some more about this issue. A study that I work on monitors the estimated mean response propensities every day. The models are refit each day and the estimates updated. The mean estimated propensity of the active cases for each day is then graphed. Each day they decline.

The study has a second phase. In the second, phase, the response probabilities start to go up. Olson and Groves wrote a paper using these data. They argue that the changed design has changed the probabilities of response. I agree with that point of view in this case.

But I also recently finished a paper that looked at the stability of the estimated coefficients over time from models that are fit daily on an ever increasing dataset. The coefficients become quite stable after the first quarter. So the increase in probabilities in the second phase isn't due to changes in the coefficients.

The response probabilities we monitor don't account for the second phase (there's no predictor for…

### Do response propensities change with repeated calling?

I read a very interesting article by Mike Brick. The discussion of changing propensities in section 7 on pages 341-342 was particularly interesting. He discusses the interpretation of changes in average estimated response propensities over time. Is it due to changes in the composition of the active sample? Or, is it due to within-unit decreases in probability caused by repeated application of the same protocol (i.e. more calls)?
To me, it seems evident that people's propensity to respond do change. We can increase a person's probability of response by offering an incentive. We can decrease another person's probability by saying "the wrong thing" during the survey introduction.

But the article specifically discusses whether additional calls actually change the callee's probability of response. In most models, the number of calls is a very powerful predictor. Each additional call lowers the probability of response. Brick points out that there are two interpret…

### Optimal Resource Allocation and Surveys

I just got back from Amsterdam where I heard the defense of a very interesting dissertation. You can find the full dissertation here. One of the chapters is already published and several others are forthcoming.

The dissertation uses optimization techniques to design surveys that maximize the R-Indicator while controlling measurement error for a fixed budget. I find this to be very exciting research as it brings together two fields in new and interesting ways. I'm hoping that further research will be spurred by this work.

### Daily Propensity Models

We estimate daily propensity models for a number of reasons. A while ago, I started looking at the consistency of the estimates from these models. I worry that the estimates may be biased early in the field period.

I found this example a couple of years ago where the estimates seemed pretty consistent.

I went back recently to see what examples of inconsistent estimates I could find. I have this example where an estimated coefficient (for the number of prior call attempts) in a daily model varies a great deal over the first few months of a study.

It turns out that some of these coefficient estimates are significantly different.

The model from this example was used to classify cases. The estimated propensities were split into tertiles. It turns out that these differences in estimation only made a difference in the classification of about 15% of the cases. But that is 15% that get misclassified at least some of the time.

### Speaking of costs...

I found another interesting article that talked about costs. This one, from Teitler and colleagues, described the apparent nonresponse biases present at different levels of cost per interview. This cuts to the chase on the problem. The basic conclusion was that, at least in this case, the most expensive interviews didn't change estimates.

This enables discussing the tradeoffs in a more specific way. With a known amount of the budget that didn't prove to change estimates, could you make greater improvements by getting more cases that cost less? Spending more on questionnaire design? etc.

Of course, that's easy to say after the fact. Before the fact, armed with less than complete knowledge, one might want to go after the expensive cases to be sure they are not different. Of course, I'd argue that you'd want to do that in a way that controlled costs (subsampling) until you achieve more certainty about the value of those data.

### Keeping track of the costs...

I'm really enjoying this article  by Andresen and colleagues on the costs and errors associated with tracking (locating panel members). They look at both sides of the problem. I think that is pretty neat.

There was one part of the article that raised a question in my mind. On page 46, they talk about tracking costs. They say "...[t]he average tracing costs per interview for stages 1 and 2 were calculated based on the number of tracing activities performed at each stage." An assumption here -- I think -- is that each tracing activity (they list 6 different manual tracing activities) takes the same amount of time. So take the total time from the tracing team, and divide it by the number of activities performed, and you have the average time per activity.

This is perfectly reasonable and fairly robust. You might do better with a regression model predicting hours from the types and numbers of activities performed in a week. Or you might ask for more specific information on t…

I spent a few posts cataloging design features that could be considered adaptive. No one labelled them that way in the past. But if we were already doing it, why do we need the new label?

I think there are at least two answers to that:

1. Thinking about these features allows us to bring in the complexity of surveys. Surveys are multiple phase activities, where the actions at different phases may impact outcomes at later phases. This makes it difficult to design experiments. Clinical trials, some have labelled this phenomenon as "practice misalignments." They note that trials that focus on single-phase, fixed-dose treatments are not well aligned with how doctors actually treat patients. The same thing may happen for surveys. When something doesn't work, we don't usually just give up. We try something else.

2. It gives us a concept to think about these practices. It is an organizing principle that can help identify common features, useful experimental methods, …

### Panel Studies as a Place to Explore New Designs

I really enjoyed this paper by Peter Lynn on targeting cases for different recruitment protocols. He makes a solid case for treating cases unequally, with the goal of equalizing response probabilities across subgroups. It also includes several examples from panel surveys.

I strongly agree that panel surveys are a fertile ground for trying out new kinds of designs. They have great data and there is a chain of interactions between the survey organization and the panel member. This is more like the adaptive treatment setting that Susan Murphy and colleagues have been exploring. I believe that panel surveys may be a fertile ground for bringing together ideas about adaptive treatment regimes and survey design.

### Persuasion Letters

This is a highly tailored strategy. The idea is that certain kinds of interviewer observations about contact with sampled households will be used to tailor a letter that is sent to the household. For example, if someone in the household says they are "too busy" to complete the survey, a letter is sent that specifically addresses that concern.

It's pretty clear that this is adaptive. But here again, thinking about it as an adaptive feature could improve a) our understanding of the technique, and b) -- at least potentially -- its performance.

In practice, interviewers request that these letters be sent. There is variability in the rules they use about when to make that request. This could be good or bad. It might be good if they use all of the "data" that they have from their contacts with the household. That's more data than the central office has. On the other hand, it could be bad if interviewers vary in their ability to "correctly" identify case…

### Sorry I missed you...

This is another post in a series on currently used survey design features that could be "relabeled" as adaptive. I think it is helpful to relabel for a couple of reasons. 1) It demonstrates a kind of feasibility of the approach, and 2) it would help us think more rigorously about these design options (for example, if we think about refusal conversions as a treatment within a sequence of treatments, we may design better experiments to test various ways of conducting conversions).

The design feature I'm thinking of today has to do with a card that interviewers leave behind sometimes when no one is home at a face-to-face contact attempt. The card says "Sorry I missed you..." and explains the study and that we will be trying to contact them.

Interviewers decide when to leave these cards. In team meetings with interviewers, I heard a lot of different strategies that interviewers use with these cards. For instance, one interviewer said she leaves them every time, eve…

In the same vein as previous posts, I'm continuing to think about current practices that might be recast as adaptive.

Call limits are a fairly common practice. But they are also, at least for projects that I have worked on, notoriously difficult to implement. For example, it may happen that when project targets for numbers of interviews are not being met, then these limits will be violated.

We might even argue that since the timing of the calls is not always well regulated, that it is difficult to claim that cases have received equal treatments prior to reaching the limit. For example, three calls during the same hour is not likely to be as effective as three calls placed on different days and times of day. Yet they would both reach a three-call limit. [As an aside, it might make more sense to place a lower-limit on "next call" propensities estimated from models that include information about the timings of the call, as Kreuter and Kohler do here.]

In any event, subject …

### Again on Refusal Conversions

This isn't a technique that gets much attention. I can think of three articles on the topic. I know of one article (Fuse and Xie, 2007)that investigates refusal conversions in telephone surveys and collects information (observations) from interviewers. And I just googled another one (Beullens, et al., 2010) that investigates the effects of time between initial refusal and first converstion attempt.

There is a third article (Burton, et al. 2006) on refusal conversions in panel studies. This one adds another element in that a key consideration is whether refusers that are converted will remain in the panel in subsequent waves. This problem seems to fit really well into the sequential decisionmaking framework. The decision is at which waves, for any given case that refuses, should you try a refusal conversion. You might, for instance, optimize the expected number of responses (completed interviews) over a certain number of waves. Or, you might maximize other measures of data quality.

I have two feelings about talking about adaptive or responsive designs. The first feeling is that these are new concepts, so we need to invent new methods to implement them. The second feeling is that although these are new concepts, we can point to actual things that we have always (or for a long time) done and say, "that's an example of this new concept" that existed before the concept had been formalized.

I think refusal conversions are a good example. We never really applied the same protocol to all cases. Some cases got a tailored or adaptive design feature. The rule is something like this: if the case refuses to complete the interview, then change the interviewer, make another attempt, and offer a higher incentive.

I'm trying to think systematically about these kinds of examples. Some are trivial ("if there is no answer on the first call attempt, then make a second attempt"). But others may not be. The more of these we can root out, the more we can fo…

### New Objective Functions...

I've argued in previous posts that the response rate has functioned like an objective function that has been used to design "optimal" data collections. The process has been implicitly defined this way. And it is probably the case that the designs are less than optimal for maximizing the response rate. Still, data collection strategies have been shaped by this objective function.

Switching to new functions may be difficult for a number of reasons. First, we need other objective functions. These are difficult to define as there is always uncertainty with respect to nonresponse bias. Which function may be the most useful? R-Indicators? Functions of the relationships between observed Y's and sampling frame data?

There are theoretical considerations, but we also need empirical tests. What happens empirically when data collection has a different goal? We haven't systematically tested these other options and their impact on the quality of the data. That should be high o…

### Empirical Data on Survey Costs

I pointed out an interesting (if older) book by Seymour Sudman a few posts ago -- "Reducing Survey Costs" from 1966. There is another book that talks about survey costs -- Groves "Survey Errors and Survey Costs" from 1989.

Groves talks about cost models for a telephone facility. The models are quite detailed. He notes that computerized telephone facilities can quite accurately estimate many of the parameters in the model. He does give a long, detailed table comparing costs for a telephone and face-to-face survey.

Most of the discussion is in Groves' book is of telephone facilities. But the same modeling approach could be taken to face-to-face surveys. The problem is that in that kind of survey, we can't rely on computers to keep track of time that different tasks take. So estimation of the model parameters is going to be more difficult. But, at least conceptually, this would be a useful approach. That would allow us to bring in costs to more facets of the s…

I was at a very interesting workshop today on adaptive interventions. Most of the folks at the workshop design interventions for chronic conditions and would be used to testing their interventions using a randomized trial.

Much of the discussion was on heterogeneity of treatment effects. In fact, much of their research is based on the premise that individualized treatments should do better than giving everyone the same treatment. Of course, the average treatment might be the best course for everyone, but they have certainly found applications where this is not true. It seems that many more could be found.

I started to think about applications in the survey realm. We do have the concept of tailoring, which began in our field with research into survey introductions. But do we use it much? I have two feelings on this question. No, there aren't many examples like the article I linked to above. We usually test interventions (design features like incentives, letters, etc.) on the whole …

I just got back from JSM where I saw some presentations on responsive/adaptive design. The discussant did a great job summarizing the issues. He raised one of the key questions that always seems to come up for these kinds of designs: If you have those data available for all the cases, why bother changing the data collection when you can just use nonresponse adjustments to account for differences along those dimensions?

This is a big question for these methods. I think there are at least two responses (let me know if you have others).

First, in order for those nonresponse adjustments to be effective, and assuming that we will use weighting cell adjustments (the idea extends easily to propensity modeling), the respondents within any cell need to be equivalent to a random sample of the cell. That is, the respondents and nonrespondents need to have the same mean for the survey variable. A question might be, at what point does that assumption become true? Of course, we don't know. But …

### Survey Costs

I'm reading an interesting book, Seymour Sudman's "Reducing the Cost of Surveys." It was written in 1967, so some of the book is about "high tech" methods like using the telephone and scanning forms.

The part I'm interested in is the interviewer cost models. I'm used to the cost models in sampling texts, which are not very elaborate. Sudman has much more elaborate cost models. For example, the costs of surveys can vary across different types of PSUs and for interviewers who live different distances from their sample clusters.

It brings to mind Groves book on Survey Errors and Survey Costs, only because they are among the few examples that have looked closely at costs.

The problem in my work is that it is often difficult to estimate costs. Things get lumped together. Interviewers estimate how much time various activities take. It seems like we've been really focused on the "errors" part of the equation and assumed that the "costs&q…

### Exploration vs exploitation

Once more on this theme that I discussed on this blog several times last year. This is a central problem for the field of research known as reinforcement learning. I'd recommend taking a look at Sutton and Barto's book if you are interested. It's not too technical and can be understood by someone without a background in machine learning.

As I mentioned in my last post, I think learning in the survey environment is a tough problem. The paper that proposed the upper confidence bound rule said it works well for short run problems -- but the short run they envisioned was something like 100 trials.

In the survey setting, there aren't repeated rewards. We're usually looking for one interview. You might think of gaining contact as another reward, but still. We're usually limited to a relatively small number of attempts (trials). We also often have poor estimates of response and contact probabilities to start with. Given that reward structure, poor prior information, a…

### Contact Strategies: Strategies for the Hard-to-Reach

One of the issues with looking at average contact rates (like with the heat map from a few posts ago) is that it's only helpful for average cases. In fact, some cases are easy to contact no matter what strategy you use, other cases are easy to contact when you try a reasonable strategy (i.e. calling during a window with an average high contact rate), but what is the best strategy for the hard-to-reach cases? I've proposed a solution that tries to estimate the best time to call using the accruing data.

I know other algorithms might explore other options more quickly. For instance, choosing the window with the highest upper bound on a confidence interval. It might be interesting to try these approaches, particularly for studies that place limits on the number of calls that can be made. The lower the limit, the more exploration may pay off.

### Optimization of Survey Design

I recently pointed out this article by Calinescu and colleagues that uses techniques (specifically Markov Decision Process models) from operations research to design surveys. One of the citations from Calinescu et al. is to this article, which I had never seen, about using nonlinear programming techniques to solve allocation problems in stratified sampling.

I was excited to find these articles. I think these methods have the promise of being very useful for planning survey designs. If nothing else, posing the problems in the way these articles do at least forces us to apply a rigorous definition of the survey design.

It would be good if more folks with these types of skills (operations research, machine learning, and related fields) could be attracted to work on survey problems.

### Call Windows as a Pattern

The paradata book, edited by Frauke Kreuter, is out! I have a chapter in the book on call scheduling.

One of the problems that I mention is how to define call windows. The goal should be to create homogenous units. For example, I made the following heatmap that shows contact rates by hour for a face-to-face survey. The figure includes contact rates for all cases and for the subset of cases that were determined to be eligibile

I used this heatmap to define contiguous call windows that were homogenous with respect to contact rates. I used ocular inspection to define the call windows.

I think this could be improved. First, clustering techniques might produce more efficient results. I assumed that the call windows had to be contiguous, this might not be true.

Second, along what dimension do we want these windows to be homogenous? Contact rates is really a proxy. We want them to be homogenous with respect to the results of next call on any case, or really our final goal of interviewing the…

### What is "responsive design"?

This is a question that I get asked quite frequently. Most of what I would want to say on the topic is in this paper I wrote with Mick Couper a couple of years ago.

I have been thinking that a little historical context might help in answering such a question. I'm not sure the paper we wrote does that. I imagine that surveys of old were designed ahead of time, carried out, and then evaluated after they were complete. Probably too simple, but it makes sense. In field surveys, it was hard to even know what was happening until it was all over.

As response rates declined, it became more difficult to manage surveys. The uncertainty grew. Surveys ended up making ad hoc changes more and more frequently. "Oh no, we aren't hitting our targets. Increase the incentive!" That seems like a bad process. There isn't any planning, so bad decisions and inefficiency are more likely. And it's hard to replicate a survey that includes a "panic" phase.

Not to put words in…

### Incentive Experiments

This post by Andy Peytchev got me thinking about experimental results. It seems like we spend a lot of effort on experiments that are replicated elsewhere. I've been part of many incentive experiments. Only some of those results are published. It would be nice if more of those results were widely available.

Each study is a little different, and may need to evaluate incentives for its specific "essential conditions." And some of that replication is good, but it seems that the overall design of these experiments is pretty inefficient. We typically evaluate incentives at specific points in time, then change the incentive. It's like a step function.

I keep thinking there has to be inefficiency in that process. First, if we don't choose the right time to try a new experiment then we will experience losses in efficiency and/or response rates. Second, we typically ignore our prior information and allocate half the sample to each of two conditions. Third, we set up ad ho…

### Reinventing the wheel...

This blog on how machine learning reinvented many of the techniques first developed in statistics got me thinking. When I dip into non-survey methods journals to see how research from survey methodology is used, it sometimes seems like people in other fields are either not aware of our research or only vaguely aware. For instance, it seems like there is research on incentives in several substantive fields that goes on without awareness across the disciplines.

It's not that everyone needs to be a survey methodologist, but it would be nice if there were more awareness of our research. Otherwise there is the risk that researchers in other fields will simply reinvent the wheel.

### Nonresponse Bias Analysis

I've been thinking about a nonresponse bias analysis that I am working on for a particular project. We often have this goal for these kinds of analyses of saying we could lower response rates without increasing the relative bias. I wrote about this risks of this approach in a recent post.

Now I'm wondering about alternatives. There may be risks from lowering response rates, but is it wise to continue sinking resources into producing high response rates as a protection against potential biases? A recently read an article by Holle and colleagues where they actually worked out the sample size increase they could afford under reduced effort (fewer calls, no refusal conversion, etc.). They made explicit tradeoffs in this regard between the risk of nonresponse bias (judged to be minimal) and sampling error.

I'm still not completely satisfied with this approach. I'd like to see design that considers the risks and allocates resources proportional to the risk in some way. So th…

### Renamed the blog

I wanted to rename the blog from the moment that I first named it. Which just means that I should have mulled it over a little more back then. Oh well, what's in a name anyway...

### Lowering response rates may be a slippery slope

I have read a couple of recent studies that compared early to late responders and concluded that late responders did not add anything to estimates.

I have a couple of concerns. The first concern is with this approach. A simulation of this sort may not lead to the same results if you actually implement the truncated design. If interviewers know they are aiming for a lower response rate, then they may recruit differently. So, at a lower response rate, you may end up with a different set of respondents than this type of simulation would indicate.

My second concern is that it is always easy to conclude that a lower response rate yields the same result. But you could imagine a long series of these steps edging up to lower and lower response rates. None of the steps changes estimates, but cumulatively they might.

I have this feeling that we might need to look at studies like this in a new way. Not as an indication that it is OK to lower response rates, but as a challenge to redesign what we…

### Balancing Response II

My last post was about balancing response. I expressed the view that lowering response rates for subgroups to that of the lowest responding group might not be beneficial. But I left open the question of why we might benefit from balancing on covariates that we have and can use in adjustment.

At AAPOR, Barry Schouten presented some results of an empirical examination of this question. Look here for a paper he has written on this question. I have some thoughts that are more theoretical or heuristic on this question.

I start from the assumption that we want to improve response rates for low-responding groups. While true that we can adjust for these response rate differences, we can at least empirically verify this by improving response for some groups. Does going from a 40% to a 60% response rate for some subgroup change estimates for that group? Particularly when that movement in response rates results from a change in design, we can partially verify our assumptions that nonresponders a…

### Balancing Response

I have been back from the AAPOR conference for a few days. I saw several presentations that had me thinking about the question of balancing response. By "balancing response," I mean actively trying to equalize response rates across subgroups. I can define the subgroups using data that are complete (i.e. on the sampling frame or paradata available for responders and nonresponders).

I think there probably are situations where balancing response might be a bad thing. For instance, if I'm trying to balance response across two groups, persons 18-44 and 45+, and I have a 20% response rate among 18-44 year olds and a 70% response rate among 45+ persons, I might "balance response" by stopping data collection for 45+ persons when I get a 20% data collection. It's always easy to lower response rates. It might even be less expensive to do so.

But I think such a strategy avoids the basic problem. How might I optimize the data collection to reduce the risk of nonrespons…

### The participation decision -- it only matters to the methodologist!

I'm reading a book, Kluge, about the working of the human mind. The author takes an evolutionary perspective to explain the odd ways in which the brain functions. Newer functions were grafted onto older functions. The whole thing doesn't work very smoothly for certain situations, particularly modern social life.

In one example, he cites experimental evidence (I believe, using vignettes) that says people will drive across town to save \$25 on a \$100 purchase, but won't drive across town to save \$25 on a \$1,000 purchase. It's the same savings, but different relative amounts.

I tend to think that the decision to participate in surveys is not very important to anyone but the methodologist. And that's why it seems so random to us -- for example, our models predicting whether anyone will respond are so relatively poor. This book reminded me that decisions that aren't very important end up being run through mental processes that don't always produce rational outcom…

The normal strategy for a publicly-released dataset is for the data collector to impute item missing values and create a single weight that accounts for probability of selection, nonresponse, and noncoverage. This weight is constructed under a model that needs to be appropriate across every statistics that could be published from these data. The model needs to be robust, and may be less efficient for some analyses.

More efficient analyses are possible. But in order to do that, data users need more data. They need data for nonresponders. In some cases, they may need paradata on both responders and nonresponders. At the moment, one of the few surveys that I know of that is releasing these data is the NHIS. The European Social Survey is another. Are there others?

Of course, not everyone is going to be able to use these data. And, in many cases, it won't be worth the extra effort. But it does seem like there is a mismatch between the theory and practice in this case.

Not only would th…

### Sequentially estimated propensity models

We've been estimating response propensity models during data collection for a while. We have at least two reasons for doing this:
We monitor average response probability for active cases. I uses estimates from these models to determine the next step in experiments. There is some risk to estimating models in this way. Particularly for the second purpose. The data used to make the estimates is accumulating over time. And those data don't come in randomly -- the easiest cases come in early and the difficult cases tend to come in later.

If I'm interested in the average impact of adding an 8th call to active cases, I might get a different estimate early in the field period than later.

In practice, the impact of this isn't as severe as you might think and there are remedies. Which leads me to the self-promotion part of this post ... I'll be presenting on this topic at AAPOR this year.

The Journal of Official Statistics has a special issue on systems and architecture that looks very interesting. This is a very interesting topic. Many of the authors mention the phenomenon of "silos" or "stovepipes." This is the situation where production is organized our projects rather than tasks. This kind of organization can lead to multiple projects independently developing software tools to do the same thing.

I think this phenomenon also has an effect on the paradata. Since these silos are organized around projects, the opportunity to collect methodologically relevant paradata may be lost. The focus is on collecting the data for the project.

New systems do present an opportunity to develop new paradata. It seems like defining cross-project tasks and developing unified systems is the better option. Within that framework, it might be helpful to think of methodologists as performing a task and, therefore, include them in the design of new systems.

That's the…

### More thoughts on the cost of paradata....

Matt Jans had some interesting thoughts on costs on his blog. I like the idea of small pilot tests. In fact, we do a lot of turning on and off of interviewer observations and other elements. In theory, this creates nearly experimental data that I have failed to analyze. My guess is that the amount of effort created by these few elements is too small to be detected given the sample sizes we have (n=20,000ish). That's good, right? The marginal cost of any observation is next to zero.

At a certain point, adding another observation will create a problem. It will be too much. Just like adding a little more metal to a ball bearing will transform it into a... lump of metal. Have we found that point yet?

Last week, we did find an observation that was timed using keystroke data. We will be taking a look at those data.

### Responsive Design and Information

It seems odd to say, but "Responsive Design" has now been around for a while. Groves and Heeringa published their paper in 2006. The concept has probably been stretched in all directions at this point.

I find it helpful to back to the original problem statement: we design surveys as if we know the results of each design decision. For example, we know what the response rate will be given a certain design (mode, incentive, etc. -- the "essential conditions"). How would we act if we had no idea about the results? We would certainly expend some resources to gain some information.

Responsive design is built upon this idea. Fortunately, in most situations, we have some idea about what the results might be, at least within a certain range. We experiment within this range of design options in order to approach an optimal design. We expend resources relatively inefficiently in order to learn something that will improve the design of later phases.

I've seen people workin…

One of the hidden costs of paradata are the time spent analyzing these data. Here, we've spent a lot of time trying to find standard ways to convert these data into useful information. But many times, we end up doing specialized analyses. Searching for an explanation of some issue. And, sometimes, this analysis doesn't lead to clear-cut answers.

In any event, paradata aren't just collected, they are also managed and analyzed. So there are costs for generating information from these data. We could probably think of this in a total survey error perspective. "Does this analysis reduce total error more than increasing the number of interviews?" In practice, such a question is difficult to answer. What is the value of the analysis we never did? And how much would it have cost?

There might be two extreme policies in this regard. One is "paralysis by analysis." Continually seeking information and delaying decisions. The other extreme is "flying by the sea…

Still on this topic... We looked at the average time to complete a set of questions. These actions may be repeated many times (each time we have contact with a sampled household), but it still amounts to a trivial portion of total interviewer time (about .4%). They don't have to add much value to justify those costs.

On the other hand, there are still a couple of questions. 1) Could we reduce measurement error on these questions if we spent more time on them? Brady West has looked at ways to improve these observations. If a few seconds isn't enough time, would more time improve the measurements?

My hunch is that more time would improve the observations, but it would have other consequences. Which leads me to my second question: 2) Do these observations interfere with other parts of the survey process? For example, can they distract interviewers from the task of convincing sampled persons to do the interview?  My hunch on the latter question is that it is possible, but our curr…

I'm interested in this question again. I wrote about the costs of paradata a while ago. These costs can vary quite a lot depending upon the situation. There aren't a lot of data out there about these costs. It might be good to start looking at this question.

One big question is interviewer observations. The technical systems that we use here have some limitations. Our sample management system doesn't create "keystroke files" that would allow us to determine how long call records take. But when we use our CAPI software, we can capture those data.

Such timing data will allow us to answer the question about how much time it takes to create them (a key element of their costs). But they won't allow us to answer questions about how collecting those data impacts other parts of the process. For instance, does having interviewers create these data distract them from the conversation with sampled persons sufficiently to reduce response rates? The latter question probab…

### Undercoverage issues

I recently read an article by Tourangeau, Kreuter, and Eckman on undercoverage in screening surveys. One of several experiments on which they report explores how the form of the screening questions can impact eligibility rates. They compare taking a full household roster to asking if there is anyone within the eligible age range. The latter produces lower eligibility rates.

There was a panel at JSM years ago that discussed this issue. Several major screening surveys reported similar undercoverage issues.

Certainly the form of the question makes a difference. But even on screening surveys that use full household rostering, there can be undercoverage. I'm wondering what the mechanism is. If the survey doesn't advertise the eligibility criteria, how is that some sampled units avoid being identified as eligible? This might be a relatively small source of error in the survey, but it is an interesting puzzle.

I recently found a paper by some colleagues from VU University in Amsterdam and Statistics Netherlands. The paper uses dynamic programming to idea an optimal "treatment regime" for a survey. The treatment is the sequence of modes by which each sampled case is contacted for interview. The paper is titled "Optimal resource allocation in survey designs" and is in the European Journal of Operational Research. I'm pointing it out here since survey methods folks might not  follow this journal.

I'm really interested in this approach as the methods they use seem to be well-suited for the complex problems we face in survey design. Greenberg and Stokes and possibly Bollapragada and Nair are the only other examples that do anything similar to this in surveys. I'm hoping that these methods will be used more widely for surveys. Of course, there is a lot of experimentation to be done.

### Estimating Daily Contact Models in Real-Time

A couple of years ago I was running an experiment on a telephone survey. The results are described here. As part of the process, I estimated a multi-level logistic regression model on a daily basis. I had some concern that early estimates of the coefficients and resulting probabilities (which were the main interest) could be biased. The more easily interviewed cases are usually completed early in the field period. So the "sample" used for the estimate is disproportionately composed of easy responders. To mitigate the risk of this happening, I used data from prior waves of the survey (including early and late responders) when estimating the model. The estimates also controlled for level of effort (number of calls) by including all call records and estimating household-level contact rates.

During the experiment I monitored the estimated coefficients on the daily basis. They were remarkably stable over time:
Of course, nothing says it had to turn out this way. I have found exam…

### Interesting Experiment

I recently read an article about a very interesting experiment. Luiten and Schouten report on an experiment to improve the Statistics Netherlands' Survey of Consumer Sentiment. Their task was to improve representativity (defined as increasing the R-Indicator) of the survey without increasing costs and without lowering the response rate. This sounds like a difficult task. We can debate the merits of lowering response rates in "exchange" for improved representativity. But who can argue with increasing representativity without major increases in costs or decreases in response rates.

The experiment has a number of features all built with the goal of meeting these constraints. One of the things that makes their paper so interesting is that each of the design features is "tailored" to the specifics of the sampled units. For those of you who like the suspense of a good survey experiment, spoiler alert: they managed to meet their objectives.

We recently finished the development of nonresponse adjustments for a large survey. We spent a lot of time modelling response probabilities and the key variables from the survey. One of our more interesting findings was that the number of calls (modeled in a number of different ways) was not predictive of key variables but was highly predictive of response. In the end, we decided not to include this predictor. It could only add noise.

But this raises a question in mind. Their might be (at least) three sources of the noise:

1) the number of calls it takes to reach someone (as a proxy of contactibility) is unrelated to the key variables. Maybe we could speculate that people who are more busy are not different from those who are less busy on the key statistics (health, income, wealth, etc.).

2) The number of calls it takes to reach someone is effectively random. Interviewers make all kinds of choices that aren't random. These choices create a mismatch between contactibility and the n…

### Degrees of NMAR

I've been working on a paper on nonresponse bias. As part of the setup, we describe the MAR and NMAR mechanisms first defined by Little and Rubin. In short, Missing-at-Random means we can fix the bias due to the missingness using the available data. While Not-Missing-at-Random means we can't repair the bias with the available data.

It can be hard to discuss the problem with this divide. We were looking at situations where the bias could be smaller and it could be bigger. The NMAR/MAR distinction doesn't capture that very well. There is another formulation that is actually pretty good for discussing different degrees of bias remaining after adjustment. It's due to the following article (I couldn't find it online):

Kalton, G. and D. Kasprzyk (1986). "Treatment of missing survey data." Survey Methodology 12: 1-16.

They define bias as having two components: A and B. One of the components is susceptible to adjustment and the other is not. In some situations, yo…