Skip to main content

Posts

Showing posts from 2012

There are call records, and then there are call records...

In my last post, I talked about how errors in call records might lead to bad things. If these errors are biasing (i.e. interviewers always underreport and never overreport calls -- which seems likely), then adjustments based on call records can create (more) bias in estimates. I pointed to the simulation study that Paul Biemer and colleagues carried out. They used an adjustment strategy that used the call number. There are other ways to use the data from calls. For instance, if I'm using logistic regression to estimate the probability of response, I can fit a model with a parameter for each call. Under that approach, I'm not making an assumption about the relationship between calls and response. It's like the Kaplan-Meier estimator in survival analysis. If there is a relationship, then I can fit a logistic regression model with fewer parameters. Maybe as few as one if I think the relationship is linear. That smooths over some of the observed differences and assumes they

Errors in Call Records

I've been working with call records for a long while now. I started working with them in telephone labs. The quality of the record wasn't much of an issue there. But then I saw Paul Biemer give a presentation where he investigated this issue. I've been thinking about it a lot more over the last year or so. I recently saw that Biemer and colleagues have now published a paper on the topic. I read it over the weekend. They call these data "level-of-effort" (LOE) paradata. I agree with their conclusion that "if modelling with LOE data is to progress, the issue of data errors and their effects should be studied." (p. 17).

Historical Controls in Incentive Experiments

We often run experiments with incentives. The context seems to matter a lot, and the value of the incentive keeps changing, so we need to run many experiments to find the "right" amount to offer. These experiments often come on repeated cross-sectional designs where we have a fair amount of experience. That is, we have already repeated the survey several times with a specific incentive payment. Yet, when we run experiments in this situation, we ignore the evidence from prior iterations. Of course, there are problems with the evidence from past iterations. There can be differences over time in the impact that a particular incentive can have. For example, it might be that as the value of the incentive declines through inflation, the impact on response rates lessens. There may be other differences that are associated with other changes that are made in the survey over time (even undocumented, seemingly minor changes). On the other hand, to totally discount this evidence se

How do you maximize response rates?

I might be worth thinking about this problem as a contrast to maximizing something else. I've been thinking of response rate maximization as if it were a simple problem. "Always go after the case with the highest remaining probability of response." It has an intuitive appeal. But is it really that simple? We've been working really hard on this problem for many years. I think, in practice, our solutions are probably more complicated than that. If you focus on the easy to respond cases early, will that really maximize the response rate? If we looked at the whole process, and set a target response rate, we might do something different to maximize the response rate. We might start with the difficult cases and then finish up with the easy cases. Groves and Couper (1998) made suggestions along these lines. Greenberg and Stokes (1990) essentially work the problem out very formally using a Markov Decision model. They minimize calls and nonresponse rate. Their solution

How much has the response rate shaped our methods?

In recent posts, I've been speculating about what it might mean to optimize survey data collections to something other than the response rate. We might also look at the "inverse" problem -- how has the response rate shaped what we currently do? Of course, the response rate does not dominate every decisions that gets made on every survey. But it has had a far-reaching impact on practice. Why else would we need to expend so much energy reminding ourselves that it isn't the whole story? The outlines of that impact are probably difficult to determine. For example, interviewers are often judged by their response rates (or possibly conditional response rates). If they were to be judged by some other criterion, how would their behavior change? For example, if interviewers were judged by how balanced their set of respondents were, how would that impact their moment-to-moment decision-making? What would their supervisors do differently? What information would sample manageme

Which objective function?

In my last post, I argued that we need to take a multi-faceted approach to examining the possibility of nonresponse bias -- using multiple models, different approaches, etc. But any optimization problem requires that an objective function be defined. A single quantity that is to be minimized or maximized. We might argue that the current process treats the response rate as the objective function and all decisions are made with the goal of maximizing that. It's probably the cases that most survey data collections aren't fully 'optimized' in this regard, but it may be close to optimal. If we want to optimize differently, then we still need some kind of indicator to maximize (or minimize, depending on the indicator). A recent article in Survey Practice tried several different indicators in this role using simulation. Before placing a new indicator in this role, I think we need at least two things: 1) Experimental research to determine the impact of being tuned to a di

Baby and the Bathwater

This post is a follow-up on my last. Since my last post, I came across an interesting article at Survey Practice. I'm really pleased to see this article, since this is a discussion we really need to have. The article , by Koen Beullens and Geert Loosveldt, presents the results of a simulation study on the impact of using different indicators to govern data collection. In other words, the simulate the consequences of maximizing different indicators in data collection. The three indicators are the response rate, the R-Indicator (Schouten, et al., 2009), and the maximal bias (also developed by Schouten et al. 2009). The simulation shows a situation where you would get a different result from maximizing either of the latter two indicators compared to when you maximize the response rate. Maximizing the R-Indicator, for example, led to a slightly lower response rate than the data collection strategy that maximizes the response rate. This is an interesting simulation. It pretty cl

Do you really believe that?

I had an interesting discussion with someone at a conference recently. We had given a presentation that included some discussion of how response rates are not good predictors of when nonresponse bias might occur. We showed a slide from Groves and Peytcheva . Afterwards, I was speaking with someone who was not a survey methodologist. She asked me if I really believed that response rates didn't matter. I was a little taken aback. But as we talked some more, it became clear that she was thinking that we were trying to argue for achieving low response rates. I thought it was interesting that the argument could be perceived that way. To my mind, the argument wasn't about whether we should be trying to lower response rates. It was more about what tools we should be using to diagnose the problem. In the past, the response rate was used as a summary statistic for discussing nonresponse. But the evidence from Groves and Peytcheva calls into question the utility of that single statis

Estimating effort in field surveys

One of the things that I miss about telephone surveys is being able to accurately estimate how much various activities cost or even how long each call takes. Since everyone works on a centralized system on telephone surveys, and everything gets time-stamped, you can calculate how long calls take. It's not 100% accurate -- weird things happen (someone takes a break and it doesn't show up in the data, networks collapse, etc.) but usually you can get pretty accurate estimates. In the field, the interviewers tell us what they did and when. But they have to estimate how many hours each subactivity (travel, production, administration) takes, and they don't give anything at the call level. I've been using regression models to estimate how long each call takes in field studies. The idea is pretty simple, regress the hours worked in a week on the counts of the various types of calls made that week. The estimated coefficients are the estimate of the average time each type of

Call Scheduling in Cluster Samples

A couple of years ago, I tried to deliver recommended times to call housing units to interviewers doing face-to-face interviewing in an area probability sample. Interviewers drive to sampled area segments and then visit several housing units while they are there. This is how cost savings are achieved. The interviewers didn't use the recommendations -- we had experimental evidence to show this. I had thought maybe the recommendations would help them organize their work. In talking with them afterwards, they didn't see the utility since they plan trips to segments, not single housing units. I decided to try something simpler. To make sure that calls are being made at different times of day, identify segments that have not been visited in all call windows, or have been visited in only one call window. This information might help interviewers schedule trips if they haven't noticed that this situation had occurred in a segment. If this helpful, then maybe this recommendation

Exploitation and Exploration Again

I found this really interesting article ("Deciding what to observe next") from the field of machine learning. They address the problem of building a regression model using data from a "data stream." A data stream is incoming data. The example they use is daily measurements of weather at different locations. But monitoring paradata during data collection also may have this flavor. They use statistical techniques that I've seen before -- the Lasso for model selection and the EM algorithm for dealing with "missing" data.  The missing data in this case are variables that you choose not to observe at certain points. The neat thing is that their method continues to explore data that are judged to be "not useful" (i.e. not included in the model) at certain points.

Exploitation vs Exploration

I'm still thinking about this problem that gets posed in machine learning -- exploitation vs. exploration. If you want to read more on the topic, I'd recommend Sutton and Barto's Reinforcement Learning . The idea is deceptively simple. Should you take the highest reward given what you currently know, or explore actions for which you don't know the reward? In machine learning, they try to balance the two objectives. For example, in situations of greater uncertainty, you might spend more resources on exploration. In surveys, I've tended to look at experiments as discrete events. Run the experiment. Get the results. Implement the preferred method. But we know that the efficacy of methods changes over time. The simplest example is incentive amounts. What's the right amount? One way to answer this question would be to run an experiment every so often. And change your incentive amounts based on each experiment. Another approach might be to keep a low level of expl

Passively Collected Paradata

In my last post, I talked about the costs of collecting interviewer observations. These observations have to reduce total survey error (likely nonresponse error) in order to justify this cost. The original definition of paradata was that it was a by-product of the data collection process. Computers stored information on keystrokes at basically no extra cost. Call records were necessary to manage the data collection process, but they were also found to be useful for developing nonresponse adjustments and other purposes. At some point, interviewer observations became paradata. I think many of these observations started out as information that interviewers recorded for their own purposes. For example, listers would record if there were any barriers to entering the segment (e.g. gated communities or locked buildings) so that interviewers would know about that before traveling to the segment. These could be thought of as no cost. But we have added a lot of observations that the interv

Paradata and Total Survey Error

At the recent Joint Statistical Meetings I was part of an interesting discussion on paradata and nonresponse. At one point, someone reported that their survey had reduced the number of observations being recorded by interviewers. They said the observations were costly in a double sense. First, it takes interviewer time to complete them. Second, it diverts attention from the task of gathering data from persons willing to respond to the survey. I have to say that we certainly haven't done a very good job of determining the cost of these interviewer observations. First, we could look at keystroke files to estimate the costs. This is likely to be an incomplete picture as there are times when observations are entered later (e.g. after the interviewer returns home). Second, we could examine the question of whether these observations reduce the effectiveness of interviewers in other errors. This would require experiments of some sort. Once these costs are understood, then we can place

Case-level Phases

Groves and Heeringa conceived of design phases as a survey-level attribute. But nothing prevents us from thinking of methods for using the technique at a case-level. In the last post, I talked about formalizing the concept of "phase capacity" using stopping rules. It might be that a similar logic could be used to formalize decisions at the case level. For instance, a few years ago on a telephone survey we implemented two-phase sampling. We implemented the phase-boundary using a case-level rule: after a certain number of calls, you entered the second phase. There might be other ways in which stopping rules like this could be used in order to change the design at the case level.

Responsive design phases

In the paper on responsive design , Groves and Heeringa define "design phases." They argue that each phase has a capacity. Once that capacity has been reached, i.e. the current design has exhausted its possibilities, then a design change may be needed. A difficulty in practice is knowing when this capacity has been reached. There are two related issues: 1. Is there a statistical rule that can be applied to define the end of the phase? 2. Can we identify when the threshold has been met immediately after it occurs, or is there a time lag? I don't know that anyone has done much to specify these sorts of rules. I would think these are generalizations of stopping rules. A stopping rule says when to stop the last phase, but the same logic could be applied to stopping each phase. I had a paper on stopping rules for surveys a few years back. And there is another from Rao, Glickman, and Glynn . I don't know that anyone has tried this sort of extension. But I think it is

Balancing Estimated Response Propensities

One objective for field data collection other than achieving the highest response rate possible, might be to achieve the most balanced response possible (possibly with some minimum response rate). One issue with this is that we are estimating the response propensities in a dynamic setting. The estimated propensities surely have sampling error, but they also vary as the data used to estimate them change. This could lead to some bad decisions. For instance, if we target some cases one day, perhaps the next day their estimated propensities have changed and we would make a different decision about cases to target. This may be just a loss of efficiency. In a worst case, I suppose it could lead to actually increasing variation in response propensities.

Missing Data and Response Rates

I'm getting ready to teach a seminar on the calculation of response rates. Although I don't work on telephone surveys much anymore (and maybe fewer and fewer other people do), I am still intrigued by the problem of calculating response rates in RDD surveys. The estimation of "e" is a nice example of a problem where we can say with near certainty that the cases with unknown eligibility are not missing at random. This should be a nice little problem for folks working with methods for nonignorable nonresponse. How should we estimate "e" when we know that the cases for which eligibility is unobserved are systematically different from those for which it is observed? The only thing that could make this a more attractive toy problem would be if we knew the truth for each case. Probably this problem seems less important than it did a few years ago. But we still need estimates of "e" for other kinds of surveys (even if they play a less important role in

Is there value in balancing response?

A few posts ago, I talked about the value of balancing response across subgroups defined by data on the frame (or paradata from all cases). The idea was that this provides some empirical confirmation of whether the subgroups are related to the variables of interest. Paradoxically, if we balance the response rates across these subgroups, then we reduce the utility of these variable for adjustment later. That's the downside of this practice. As I said earlier, I think this does provide confirmation of our hypothesis. It also reduces our reliance on the adjustment model, although we have to assume the model is correct and there aren't unobserved covariates that are actually driving the response process. Is there an additional advantage to this approach? It seems that we are least trying to provide an ordered means of prioritizing the sample. We can describe it. Even if there are departures, we can say something about how the decisions were made to prioritize certain cases. W

The Relative Value of Paradata and Sampling Frame Data

In one of my favorite non-survey articles, Rossi and colleagues looked at the relative value of purchase history data and demographic information in predicting the impact of coupons with different values. The purchase history data was more valuable in the prediction. I believe a similar situations applies to surveys, at least in some settings. That is, paradata might be more valuable than sampling frame data. Of course, many of the surveys that I work on have very weak data on the sampling frame. In any event, I fit random intercept logistic regression models predicting contact that include some sampling frame data from an RDD survey. The sampling frame data are generally neighborhood characteristics. I recently made this chart, which shows the predicted vs observed contact rates for households in a particular time slot (call window). The dark circles are the predictions by observed values (household contact rates) for the multi-level model. I also fit a marginal logistic regressio

Imputation and the Impact of Nonresponse

I've been thinking about the evaluation of the risk of nonresponse bias a bit lately. Imputation seems to be a natural way to evaluate those risks. In my setup, I impute the unit nonresponders. Then I can use imputation to evaluate the value of the data that I observed (a retrospective view) and to predict the value of different parts of the data that I did not/have not yet observed (a prospective view). Allow me to use a little notation. Y_a is a matrix of observed data collected under protocol a . Y_b is a matrix of observed data collected under protocol b . Y_m is the matrix of data for the nonresponders. It's missing. I could break Y_m into two pieces: Y_m1 and Y_m2. 1) Retrospective. I can delete data that I observed and impute the values plus all the other missing values (i.e. the unit nonresponse). I can impute Y_b and Y_m conditional on Y_a . I can also impute Y_m conditional on Y_b and Y_a . It might be interesting to compare the estimates from these two

Quasi-Experiments and Nonresponse

In my last post, I talked about using future data collection as a quasi-experimental validation of hypotheses about nonresponse. I thought I'd follow up on that a bit more. We often have this controversy when discussing nonresponse bias: if I can adjust for some variable, then why do I need to bother making sure I get good response rates across the range of values that variable can take on? Just adjust for it. That view relies on some assumptions. We assume that no matter what response rate I end up at, the same model applies. In other words, the missing data only depend on that variable at every response rate I could choose (Missing at Random). The missing data might depend only on that variable for some response rates but not others. In most situations, we're going to make some assumptions about the missingness for adjustment purposes. We can't test those assumptions. So no one can ever prove you wrong. I like the idea that we have a hypothesis at an interim poin

Constellation of views

I'm spending time look at patterns in the nonresponse to a large survey we recently completed. I'm looking at the problem from a number of different angles. It is really very useful to be going over the details and looking at the problem from a number of angles. This is reinforcing a couple of things that I've been saying: 1. We need multi-faceted views of the problem to replace reliance on a key statistic (i.e. the response rate). 2. We need to make a leap beyond the data with reasonable assumptions. Given the uncertainty about the nonresponse bias, multi-faceted views can't give us much more than a better sense of the risks. With reasonable assumptions, we should be OK. We will be repeating this survey, so this information may help with future waves. We can use it to guide interventions into the data collection strategy. We might even think of this as quasi-experimental validation of our hypotheses about the nonresponse to prior waves.

Signal and Noise in Feedback

Still on this theme of feedback. It would be nice if we got very clear signals from sampled units about why they don't want to do our surveys. However, this isn't usually the case. It seems that our best models still are pretty weakly predictive of when someone will respond. Part of this could be that we don't have the 'right' data. We could improve our paradata and build better models. Another part might never be captured. This is the situational part. Sampled persons might not be able to say why the refuse a survey on one day and agree to do another on another day. The decision is highly sensitive to small differences in the environment that we may never be able to capture in our data. If that is the case, then the signal we can pick up for tailoring purposes is going to be weak. The good news is that it seems like we still haven't hit the limit of our ability to tailor. Onward!

A Twist on Feedback

In my last post, I talked about thinking about data collected between attempts or waves as "feedback" from sampled units. I suggested that maybe the protocol could be tailored to this feedback. Another to express this is to say that we want to increase everyone's probabilities of response by tailoring to their feedback. Of course, we might also make the problem more complex by "tailoring" the tailoring. That is, we may want to raise the response probabilities of some individuals more than that of other individuals. If so, might we consider a technique that is more likely to succeed in that subset. I'm thinking of this as a decision problem. For example, assume we can increase response probabilities by 0.1 for all cases with tailoring. But we notice that two different techniques have this same effect. 1) The first technique increases everyone by 0.1. 2) The second  technique increases a particular subgroup (say half the population) by 0.15 and everyone

Feedback from Sampled Units

A while ago, I wrote about developing algorithms that determine when to switch modes . I noted that the problem was that in many multiple mode surveys, there is very little feedback. For instance, in mailed and web surveys, the only feedback is a returned letter or email. We also know the outcome -- whether the mode succeeded or failed. I still think the most promising avenues for this type of switching are from interviewer-administered modes. For instance, can we pick up clues from answering machine messages that would indicate that we should change our policy (mode)? It may also be that panel studies with multiple modes are a good setting for developing this sort of algorithm. An event observed at one time period, or the response to questions predicting a mode more likely to induce response might be useful "feedback" in such a setting.

Sequence in a Mixed Mode Survey, II

There was another interesting thing that happened in our mixed mode that varied the sequences of modes. As a reminder, the survey was a screening survey. It identified households with an eligible person. An interviewer later returned to eligible households to complete a 'main' interview. We found that eligible households in locked buildings that had received the FtF-Mail treatment on the screening interview responded at higher rates to the main interview. Better even than Mail-FtF. The results were significant, even when accounting for the clustering in the sample. It may be that the FtF-Mail sequence displays our earnestness to the respondent more clearly. It would be nice to replicate these results.

Sequence in a Mixed Mode Survey

We recently tried to implement a mixed mode approach to a large screening survey that is usually done face-to-face (FtF). We wanted to be sure that modes with lower response rates didn't "contaminate" the sample. If not, there may be cost savings in using a mailed version of the screening survey. We varied the sequence of the mixed mode approach to see if that had any impact. We did FtF-Mail, Mail-FtF, and FtF. We also monitored response rates to the main interview, which is conducted only with eligible persons. The response rates to the screening survey were very similar across the three arms of the trial. But it turns out that one mode combination did better on the main interview response rate with cases in locked buildings. That mode combination was FtF-Mail. This might be fluke, but definitely worth exploring.

The Long Perspective on Call Scheduling

I recently went back and read papers written during the early days of the development of CATI. I found this very interesting quote from an article by J. Merrill Shanks from 1983: “Among the procedures that are supported by (or related to) CATI systems, none has proved more difficult to discuss than the algorithms or options available for management of interviewers’ time and the scheduling or assignment of actual calls to specific interviewers. Most observers agree that computer-assisted systems can yield improvements in the efficiency or productivity of interviewer labor by scheduling the calls required to contact respondents in a particular household across an appropriately designed search pattern, and by keeping track of the ‘match’ between staff availability and the schedule of calls to be made” (Shanks, 1983, p. 133). It seems like today people would agree that this is a "difficult to discuss" problem. I'm not sure that there is a sense that there are large gains

Interviewer Variability

In face-to-face surveys, interviewers play a very important role. They largely determine when they work, at which times they call cases, and how to address the concerns of sampled persons. Several studies have looked at the variability that interviewers have in achieving contact and cooperation. Durrant and Steele (2009) provide a particularly good example of this. It is also the case that interviewers have only a partial view of the data being collected. They cannot detect imbalances that may occur at higher levels of aggregation. For these reasons, it seems like controlling this variability is a useful goal. This may be done through improved training (as suggested by Groves and McGonagle , 2001), or by providing specific recommendations for actions to interviewers. We have had some success in this area. In NSFG Cycle 7, we ran a series of 16 experiments that asked interviewers to prioritize a set of specified cases over other cases in their workload. The results were positive

Response Rates as a Reward Function

I recently saw a presentation by Melanie Calinescu and Barry Schouten on adaptive survey design. They have been using optimization techniques to design mixed-mode surveys. In the optimization problems, they seek to maximize a measure of sample balance (the R-Indicator) for a fixed cost by using different allocation to the modes for different subgroups in the population (for example, <35 years of age and 35+).  The modes in their example are web and face-to-face. In their example, the older group is more responsive in both modes, so they get allocated at higher rates to web. You can read their paper here to see the very interesting setup and results. In the presentation, they showed what happens when you use the response rate as the thing that you are seeking to maximize. In some of the lower budgets, the optimal allocation was to simply ignore the younger group. You could not get a higher response rate by doing anything other than using all your resources on the older group. Once

Call Record Problems

A couple of years ago I did an experiment where I recommended times to called sampled units in a face-to-face survey based on an area probability cluster sample. The recommendations were based on estimates from multi-level logistic regression models. The interviewers ignored the recommendations. In meetings with the interviewers, several said that they didn't follow the recommendations since they call every case on every trip to an area segment. The call records certainly didn't reflect that claim. But it got me thinking that maybe the call records don't reflect everything that happens. Biemer, Chen and Wang (2011) reported a survey of interviewers where the interviewers did report that they do not always create a call record for a call. They reported that sometimes they would not report a call in order to keep a case alive (since the number of calls on any case was limited) or because they just drove by the sampled unit and saw that no one was home. Biemer, Chen, and

Are we ready for a new reward function?

I've been thinking about the harmful effects of using the response rate as a data quality indicator. It has been a key -- if not THE key -- indicator of data quality for a while. One of the big unknowns is the extent to which the pervasive use of the response rate as a data quality indicator has malformed the design of surveys. In other words, has the pursuit of high response rates led to undesirable effects? It is easy to say that we should be more focused on bias, but harder to do. Generally, we don't know the bias due to nonresponse. So if we are going to do something to reduce bias, we need a "proxy" indicator. For example, we could impute values to estimate the bias. This requires that the bias of an unweighted mean be related to things that we observe and that we specify the right model. No matter which indicator we select, we need some sort of assumptions to motivate this "proxy" indicator. Those assumptions could be wrong. When we are wrong, do w