Skip to main content

Posts

Showing posts from 2010

Exploration and Refusal Conversions

I'm still struggling to find a method that improves contact rates in the refusal conversion process for the experiment with call scheduling. As a reminder, the experimental method improves contact rates for calls prior to a refusal, but then contact rates for calls after a first refusal have lower contact rates than calls to those cases in the control group. Ouch. I already tried calling households at times other than the time at which the first refusal was taken. The hypothesis was that people were screening us out and that calling at a different time might lead to someone else in the household picking up the phone. But that didn't work. In looking at the data, searching for a reason that this is happening, I noticed that the control group seemed to be "exploring" better than the experimental group. The figures below demonstrate this. The upper figure shows calls prior to a refusal. It shows the average number of windows that have been called by call number for c

Adaptive Design and Refusal Conversions

For me, the idea of adaptive design was influenced by work from the field of clinical trials on multi-stage treatments. Susan Murphy introduced me to adaptive treatment regimes as an approach to the problem. She points to methods developed in the field of reinforcement learning as useful approaches to problems of sequential decisionmaking. Reinforcement learning describes some policies (i.e. a set of decision rules for a set of sequential decisions) as myopic. A policy is myopic if it only looks at the rewards available at the next step. I'm reading Decision Theory by John Bather right now. He uses an example similar to the following to demonstrate this issue. The following is a simple game. The goal is to get from the yellow square to the green square with the lowest cost. The number in each square is the cost of moving there.Diagonal moves are not allowed.   The myopic policy looks only at the next option and goes down a path that ends up with only expensive options to reac

"Responsive Design" and "Adaptive Design"

My dissertation was entitled "Adaptive Survey Design to Reduce Nonresponse Bias." I had been working for several years on "responsive designs" before that. As I was preparing my dissertation, I really saw "adaptive" design as a subset of responsive design. Since then, I've seen both terms used in different places. As both terms are relatively new, there is likely to be confusion about the meanings. I thought I might offer my understanding of the terms, for what it's worth. The term "responsive design" was developed by Groves and Heeringa (2006) . They coined the term, so I think their definition is the one that should be used. They defined "responsive design" in the following way: 1. Preidentify a set of design features that affect cost and error tradeoffs. 2. Identify indicators for these costs and errors. Monitor these during data collection. 3. Alter the design features based on pre-identified decision rules based on

Interviewer Variance in Face-to-Face Surveys

There have been several important studies of interviewer variance in face-to-face surveys. O'Muircheartaigh and Campanelli (1998) report on a study that used an interpenetrated design to evaluate the impact of interviewers on variance estimates. There are also studies that show interviewers vary in their ability to establish contact (Campanelli et al., Can you hear me knocking? 1999) and elicit response ( Durrant, Groves, Staetsky, Steele, 2010 ). Although O'Muricheartaigh and Campanelli  account for the clustering of the sample design, they don't account for differences in response (due to contact or refusal). It may be that variation in response rates or the composition of response may explain some (certainly not all) of the interviewer variation. If that is the case, then attempting to control interviewer recruitment protocols (like call timing) might help reduce interviewer variance.

Refusal Conversions, Some Results

We just completed a month of data collection on the RDD survey that is running my experiment on call scheduling. I discussed an interesting problem in a previous post . Basically, the algorithm seems to work for calls prior to any refusal. But it is actually less efficient for calls made after an initial refusal (i.e. refusal conversion calls). One hypothesis about why this occurred was that the person who refused would be screening calls and would not pick up if they saw that we were calling again. The model, which is tuned to contact, might lead you to call back during the same call window as that in which the first refusal was taken. If you call during another call window, you might reach another person in the household or, perhaps, the person who refused would be less likely to be screening calls. The change was to make the window in which the first refusal occurred the lowest priority window. The results were... no change (12.0% contact rate for controls, 10.1% for experimen

How do they do it?

The experiment on call scheduling in a telephone survey required specialized programming to make it work. We use Blaise SMS in our telephone facility. My colleagues here, Joe Matuzak and Dave Dybicki, are planning to present what they did to make this experiment work at the International Blaise Users Conference (IBUC) in October. They asked me to show some of the results. The first problem we faced was how to make sure that the experimental and control groups were called at the same pace. I produce files every day that show how I want the sample sorted. The control group is sorted using a different algorithm. But we had to make sure that the cases were mixed up -- we didn't want to call one group and then the other. Dave wrote a program that reads the sorted list for each group (experimental and control). It pulls a record from each list and then checks if it is still active. Maybe it was finalized after the sort occured. When it finds 5 active cases from the top of the sort in

Refusal Conversions and Timing of the Call

In the previous post, I talked about an interesting problem that a call scheduling experiment produced in relationship to refusal conversions. In the experiment, calls prior to a first refusal were more efficient under the experimental algorithm. But calls after the refusal were less efficient (in terms of contact) such that the experimental condition was only as efficient as the control when looking at call pre- and post-first refusal. Ouch. My question is: what can be done to change the call scheduling of calls after the refusal to improve their efficiency? My colleagues here in Survey Research Operations suggested that calling back at the same time as the first refusal might be bad. You might get the same person. For that to be the case, it seems as if the person who refused would have to screening their calls. While the control algorithm calls back at different times and finds someone else at home. As a test of this hypothesis, I changed the algorithm to put the window in whi

Refusal Conversions are Different

The experiment with call scheduling has been running on refusal conversions for a while. It looks as if the experimental method works well for calls prior to the first refusal, but not after that. In retrospect, this makes sense. Refusal conversions are a different problem than establishing contact. When I presented the results to my colleagues at Survey Research Operations, they suggested that it might make more sense to try calling at times other than when the initial refusal came. This would help if we are able to contact someone else in the household who would be more cooperative. I'm thinking about how to set up a different model for this part of the process. A different problem requires a different tool...

Measurement Error in Paradata

Paradata are often quite messy. I guess it shouldn't be that surprising since they are often the by-product of a process (survey interviewing) that can be messy. And, at least initially, they were a means to an end and not the end itself. But there are some issues that run a little deeper than just messy. Brady West had a very interesting paper at AAPOR that looked at measurement error in interviewer observations. On a large face-to-face survey, we ask interviewers to make guesses about key characteristics of selected persons. These guesses are (relatively) highly correlated with survey outcome variables. This is a useful property for many reasons -- monitoring for the risk of bias, adjustment, etc. But, as Brady points out, the measurement or misclassification error reduces their effectiveness. I've been thinking about another kind of error. In talking with interviewers on the same face-to-face survey, they say the visit ever sampled housing unit every time they visit a se

Seasonal effects

I've been running an experiment on a relatively small survey (300 RDD interviews per month). Since the survey is small, I need to run the experiment over many months to accumulate enough data. One unintended consequence of this long field period for the experiment is that I observe fluctuations over the course of the year that may indicate seasonal effects. April is the most profound example. In every other month, the experimental method produced higher contact rates than the control. But not April. In April, the control group did better. I have at least two hypotheses about why: 1. April is one of the toughest months for contacting households. Something about the experimental method interacts with seasonal effect to produce lower contact rates for the experimental method. Seems unlikely. 2. Sampling error. If you run the experiment in enough months, one of them will come up a loser. More likely.

Imputation of "e" as an extension of a survival model approach

The genesis of the idea for imputing "e" came from my process for estimating the fraction of missing information for an ongoing survey. I had to impute eligibility for cases each day so that I could impute survey values for the subset of eligible case (including those with imputed eligibility). I thought, "hey, I'm already imputing 'e.' I just need to set it up that way." Along the way, I had to compare the method to the life table product-limit approach advocated by Brick et al. (POQ, 2002). I found a very nifty article by Efron (JASA, 1988) that compares life table methods to logistic regression. Essentially, for the discrete time case, the life table model produces the same results as if we had a logistic regression model with a dummy variable for each time point. Efron then paramterized the model with fewer parameters ($t$, $t^2$, and $t^3$), I believe, and shows how this compares to the life table product-limit nonparametric estimate. This artic

Imputation of "e"

I'm finishing up a presentation that I'll be giving at AAPOR (Saturday, May 15th at 2:15) on using imputation methods to estimate "e." I posted on this topic a while ago. I wanted to post one of the graphics that I developed for that presentation. I start from a very simple model that predicts eligibility using the natural logarithm of the last call number as the only predictor. That generates the following distribution of imputed eligible cases. The blue line shows the eligibility rates for the cases for which the eligibility status is known. The blue dashed line shows the model (a logistic regression model predicting eligibility using the natural log of the call number) prediction of eligibility. The green line shows the eligibility for the cases where the eligibility flag is imputed. The green line is the line used to estimate "e."

Manual override II

We figured out a way to record every time a sample line is manually pulled up and called. As I mentioned in a previous post, I was concerned that this type of "manual override" might confound the results of an experiment we've been doing that compares different call scheduling algorithms. The good news is that it happens very infrequently. There were 166 such calls in March and 153 in April. In any given month, there are 13,000 to 14,000 calls made. So these manual overrides are a pretty insignificant (about 1%) part of that effort. In addition, they don't seem to concentrated in one experimental arm over another.

Fraction of Missing Information (FMI) article

Robert M. Groves and colleagues created a list of alternatives to the response rate as a measure of data quality. One of the alternatives that was mentioned in that article is the Fraction of Missing Information (FMI). I now have an article in POQ on the use of FMI as a measure of survey data quality.

Manual Override

I've been running an experiment with our call scheduling algorithm in our telephone facility for a number of months now. I've written about some of the strange results that we've had . We modified the algorithm to cover almost all calls. But we've continued to have some strange results. I've always known that supervisors can manually override the algorithm and manage the sample "by hand." I've assumed that this activity has been minimal. Just to be sure, we've added a routine that records when this happens. This routine should allow us to determine the impact of type of supervisor intervention. Of course, an experiment comparing an algorithm that allowed this type of intervention versus one that didn't would be ideal. For now, we'll see what the level of supervisor intervention is and which cases it impacts. If this doesn't explain the strange results, then I may be left to conclude that the two algorithms do actually contact differ

What's the goal?

The goal of the stopping rule for surveys is to govern the process with something that is related to the nonresponse bias (under an assumed, but reasonable model). Although we don't discuss it much in the article, I like to speculate about the effect this might have on data collection practices. If the response rate is the key metric, then the data collection process should focus on interviewing the easiest to interview set of cases that meet the target response rate. Of course, it's a bit more 'random' than that in practice. But that should be the logic. What would the logic be under a different key metric (or stopping rule)? Would data collection organizations need to learn which cases get you to your target most efficiently? How would those cases be different? It seems that in this situation there would be a bigger reward for exploring the entire covariate space on the sampling frame. How do you go about doing that? There are a set of alternative indicators out

Stopping Rules Article

The article that Raghu and I wrote on stopping rules for surveys is available now as an early view article at Statistics in Medicine. The basic idea is that you should stop collecting data once the probability that new data will change your estimate is sufficiently small. As always, you want good paradata and sampling frame data to aid with this decision.

What are "paradata" anyway?

In preparing a manuscript, I found myself defining paradata to include interviewer observations. Interviewer observations are questions asked of the interviewer while they are in the process of attempting or completing an interview. They might range from information about the neighborhood of the sampled unit, guesses about who might live in the sampled unit, or information about the person who was just interviewed. Classifying these observations as paradata seemed logical to me at the time since they are not sampling frame data and they aren't reports from the sampled unit. But are they paradata? Are they generated by the process of data collection? Or, is anything that doesn't come from the sampling frame or respondent by default count as paradata? It's probably worth the exercise of developing a more precise definition. If only for the economy of language that such precision should afford.

Response Rates in Calling Experiment

I've been continuing with the experiments in call scheduling. January was the first month where there was a difference in response rates by treatment group. Generally, the response rates across the treatment arms (control and experiment) have been similar. But that doesn't necessarily mean the two methods obtain the same result. When I look at response rates by phase, even in prior months, it appears that the experimental method has a higher response rate in the calls prior to a refusal and a lower response rate in the calls after a refusal (even though both sets of calls are now governed by the algorithm). The following tables shows the results from December and January (AAPOR RR2 by refusal status, NOT overall RR): January is the first month where the experimental group outperformed the control group in the refusal conversion phase. At first, I thought the refusal calls were more difficult in the experimental group than the control. But maybe not. We are repeating the

Early Results from the Modified Experimental Calling Algorithm

The experiment on the timing of calls in an RDD survey has been continuing. It looks like the change that we made in the experimental algorithm has an impact.  The change basically meant to govern the timing of refusal conversion calls as well. These calls were not governed by the algorithm (mainly for technical reasons) in the first few months of the experiment. The following table shows the experimental ("MLE") and control ("CON") groups broken down by whether the case was governed by the original algorithm (In Algorithm=1) or whether it was not (In Algorithm=0, i.e. the refusal conversion calls). In prior months, the "In Algorithm=0" cases were much less efficient, such that the overall efficiency was about the same for the experimental and control groups.   The efficiency of the refusal conversion calls is still a bit lower for the experimental group. I'm still trying to understand why that's the case. It's better, but still not what