Skip to main content

Posts

Showing posts from 2014

Context and Daily Surveys

I've been reading a very interesting book on daily diary surveys. One of the chapters, by Norbert Schwarz, makes some interesting points about how frequent measurement might not be the same as a one-time measurement of similar phenomena. Schwarz points to the well-known studies that he did where they varied the scale of measurement. One of the questions was about how much TV people watch. One scale had a maximum of something like 10 or more hours per week, while the other had a maximum of 2.5 hours per week. The reported distributions changed across the two different scales. It seems that people were taking normative cues from the scale, i.e. if 2.5 hours is a lot, "I must view less than that," or "I don't want to report that I watch that much TV when most other people are watching less." He points out that daily surveys may provide similar context clues about normative behavior. If you ask someone about depressive episodes every day, they may infer that

Device Usage in Web Surveys

As I have been working on a web survey, I'm following more closely the devices that people are using to complete web surveys. The results from Pew make it seem that the younger generation will move away from PCs and access the internet through portable devices like smart phones. Some of these "portable" devices have become quite large. This trend makes sense to me. I can do many/most things from my phone. I heard on the news the other day, that 25% of Cyber Monday shopping was done with tablets and phones. But some things are easier to do with a PC. Do surveys fit into the latter group? Peter Lugtig posted about a study he is working on that tracks the device used in waves of a panel survey. It appears that those who start on a PC, stay on a PC. But those who start on a tablet or phone are more likely to switch to a PC. He also notes that if you used a tablet or phone in an early wave, you are less likely to do the survey at all in the next wave. I didn't read t

Tiny Data...

I came across this interesting po st about building a Bayesian model with careful specification of priors. The problem is that they have "tiny" data. So the priors play an important role in the analysis. I liked this idea of "tiny" data. The rush to solve problems for "big data" has obscured the fact that are interesting problems for situations where you don't have much data. Frost Hubbard and I looked at a related problem in a recently published article . We look at the problem of estimating response propensities during data collection. In the early part of the data collection, we don't have much data to estimate these models. As a result, we would like to use "prior" data from another study. However, this prior information needs to be well-matched to the current study -- i.e. have the same design features, at least approximately. This doesn't always work. For example, I might have a new study with a different incentive than I&

Interviewer Travel and New Forms of Data

The Director of the Census Bureau, John Thompson, recently blogged about a field test for the 2020 Decennial Census Nonresponse Follow-up. They are testing a number of new features, including the use of smartphones in data collection. I've been working with GPS data from smartphones used by field interviewers. The data are complex, but may offer new insights into interviewer travel. Think of travel as a broad concept -- it's not just an expense or efficiency issue. The order in which calls are made may also relate to field outcomes like contact and response rates. Perhaps these GPS data can help us understand how interviewers currently make decisions about how to work their sample. For example, do they move past sampled housing units when they first arrive to the area segment? Is this action associated with higher contact rates? Of course, travel is also an expense or efficiency issue. I wouldn't want pushing for more efficient travel to interfere with other aspects o

Happy Halloween!

OK. This actually a survey-related post. I read this short article about an experiment where some kids got a candy bar and other kids got a candy bar and a piece of gum. The latter group was less happy. Seems counter-intuitive, but in the latter group, the "trajectory" of the qaulity of treats is getting worse. Turns out that this is a phenomenon that other psychologists have studied. This might be a potential mechanism to explain why sequence matters in some mixed-mode studies. Assuming that other factors aren't confounding the issue.

Quantity becomes Quality

A big question facing our field is whether it is better to adjust data collection or do post-data collection adjustments to the data in order to reduce nonresponse bias. I blogged about this a few months ago. In my view, we need to do both. I'm not sure how the argument goes that says we only need to adjust at the end. I'd like to hear more of that. In my mind, it must be an assumption that once you condition on the frame data, the biases disappear and that assumption is valid at all points during the data collection. That must be a caricature -- which is why I'd like to hear more of the argument from a proponent of the view. In my mind, that assumption may or may not be true. That's an empirical question. But it seems likely that at some point in the process of collecting data, particularly early on, that assumption is not true. That is, the data are NMAR, even when I condition on all my covariates (sampling frame and paradata). Put another way, in a cell adjustmen

Decision Support and Interviewer Compliance

When I was working on my dissertation, I got interested in a field of research known as decision support. They use technical systems to help people make decisions. These technical systems help to implement complex algorithms (i.e. complicated if... then decision rules) and may include real-time data analysis. One of the reasons I got interested in this area was because I was wondering about implementing complicated decision algorithms (e.g. highly tailored, including to incoming paradata) in the field. One of the problems associated with decision support has to do with compliance. Fortunately, Kawamoto and colleagues did a nifty systematic review of the literature to see what factors were related to compliance in a clinical setting. These features might be useful in a survey setting as well. They are: 1. The decision support should be part of the workflow. 2. It should deliver recommendations not just information. 3. The support should be delivered at the time the decision is mad

Training Works... Until it Doesn't

I recently had need for several citations showing that training interviewers works. Of course, Fowler and Mangione show that training can improve interviewer performance in delivering a questionnaire. Groves and McGonagle also show that training can have an impact on cooperation rates. But then I also thought of the example from Campanelli and colleagues where experience interviewers preferred to make call attempts during the day -- when these attempts would be less successful and despite training that other times would work better. So, an interesting question, when does training work? And when does it not?

Sensitivity Analysis and Nonresponse Bias

For a while now, when I talk about the risk of nonresponse bias, I suggest that researchers look at the problem from as many different angles as possible, employing varied assumptions. I've also pointed to work by Andridge and Little that uses proxy pattern-mixture models and a range of assumptions to do sensitivity analysis. In practice, these approaches have been rare. A couple of years ago, I saw a presentation at JSM that discussed a method for doing sensitivity analyses for binary outcomes in clinical trials with two treatments. The method they proposed was graphical and seemed like it would be simple to implement. An article on the topic has now come out. I like the idea and think it might have applications in surveys. All we need are binary outcomes where we are comparing two groups. It seems that there are plenty of those situations.

Web Panels vs Mall Intercepts

I saw this interesting article that just came out. It called to my mind a talk that was hosted here a few (8?) years ago. The talk was someone from a major corporation who talked about how they switched product testing from church basements to online panels. They found that once they switched, the data became worse. The online panels picked products that ended up failing at higher rates. This seemed like a tough problem. There isn't much of a "nonresponse" kind of relationship here. But at least understanding the mechanism that got people into online panels and how they were then selected and agreed to participate in this kind of product testing seemed important. It's not my area, so I'm wondering if this has ever been done. Not that anyone would understand the process of recruiting people to participate in product testing in church basements. But that process at least worked. This new article looks at an old process -- mall intercepts -- for recruiting people

Idenitfying all the components of a design, again...

In my last post I talked about identifying all the components of a design. At least identifying them is an important step if we want to consider randomizing them. Of course, it's not necessary... or even feasible... or even desirable to do a full factorial design for every experiment. But it is still good to at least mentally list the potentially active components. I first started thinking about this when I was doing a literature review for a paper on mixed mode designs. Most of these designs seemed to confound some elements of the design. The main thing I was looking for -- could I find any examples where someone had just varied the sequence of modes? The problem was that most people also varied the dosage of modes. For example, in a mixed mode web-telephone design, I could find studies that had web-telephone and telephone-web comparisons, but these sequences also varied the dosage. So, telephone first gets up to 10 calls, but telephone second gets 2 calls. Web first gets 3 emai

Identifying all the active components of the design...

I've been reading papers on email prenotification and reminders. They are very interesting. There are usually several important features for these emails: how many are sent, the lag between messages, the subject line, the content of the email (length etc.), the placement of the URL, etc. A full factorial design with all these factors is nearly impossible. So folks do the best they can and focus on a few of these features. I've been looking at papers on how many messages were sent, but I find that the lag time between message also varies a lot. It's hard to know which of these dimensions is the "active" component. It could be either, both, and may even be synergies (aka "interactions") between the two (and between other dimensions of the design as well). Linda Collins and colleagues talk about methods for identifying the "active components" of the treatments in these complex situations. Given the complexity of these designs, with a large nu

Big Data and Survey Data

I missed Dr. Groves blog post on this topic. It is an interesting perspective on the strengths and weaknesses of each data source. His solution is to "blend" data from both sources to compensate for the weaknesses of each.  Dr. Couper spoke along similar lines at the ESRA conference last year. An important takeaway from both of these is that surveys have an important place in the future. Surveys gather, relative to big data, rich data on individuals that allow the development and testing of models that may be used with big data. Or provide benchmarks for estimates from big data for which the characteristics of the population are only vaguely known. In any event, I'm not worried that surveys or even probability sampling have outlived their usefulness. But it is good to chart a course for the future that will keep survey folks relevant to these pressing problems.

Probability Sampling

In light of the recent kerfuffle over probability versus non-probability sampling, I've been thinking about some of the issues involved with this distinction. Here are some thoughts that I use to order the discussion in my own head: 1. The research method has to be matched to the research question. This includes cost versus quality considerations. Focus groups are useful methods that are not typically recruited using probability methods. Non-probability sampling can provide useful data. Sometimes non-probability samples are called for. 2. A role for methodologists in the process is to test and improve faulty methods. Methodologists have been looking at errors due to nonresponse for a while. We have a lot of research for using models to reduce nonresponse bias. As research moves into new arenas, methodologists have a role to play there. While we may (er... sort of) understand how to adjust for nonresponse, do we know how to adjust for an unknown probability of getting into an on

The Dual Criteria for a Useful Survey Design Feature

I've been working on a review of patterns of nonresponse to a large survey on which I worked. In my original plan, I looked at things that are related response, and then I looked at things that are related to key statistics produced by the survey. "Things" include design features (e.g. number of calls, refusal conversions, etc.) and paradata or sampling frame data (e.g. Census Region, interviewer observations about the sampled unit, etc.). We found that there were some things that heavily influenced response (e.g. calls) that did not influence the key statistics. Good, since more or less of that feature, although important for sampling error, doesn't seem important with respect to nonresponse bias. There were also some that influenced the key statistics but not response. For example, interviewer observations we have for the study. The response rates are close across subgroups of these estimates. As a result, I won't have to rely on large weights to get to unbi

"Go Big, or Go Home."

I just got back from JSM, where I participated in a session on adaptive design. Mick Couper served as a discussant for the session. The title of this blog post is one of the points from his talk. He said that innovative, adaptive methods need to show substantial results. Otherwise, it won't be convincing. As he pointed out, part of the problem is that we are often tinkering with marginal changes on existing surveys. These kinds of changes need to be low risk, that is, they can't cause damage to the results and should only help. However, these kinds of changes are often limited in what they can do. His point was to make some big changes that will show big effects may require some risk. This made sense to me. It would be nice to have some methodological studies that aren't constrained by the needs of an existing survey. I suppose this could be a separate, large sample with the same content as an existing survey. However, I wonder if this is a chicken or egg type of problem.

Better to Adjust with Weights, or Adjust Data Collection?

My feeling is that this is a big question facing our field. In my view, we need both of these to be successful. The argument runs something like this. If you are going to use those variables (frame data and paradata) for your nonresponse adjustments, then why bother using them to alter your data collection? Wouldn't it be cheaper to just use them in your adjustment strategy? There are several arguments that can be used when facing these kinds of questions. The main point I want to make here is that I believe that this is an empirical question. Let's call X my frame variable and Y the survey outcome variable. If I assume that the relationship between X and Y is the same no matter what the response rate for categories of X, then, sure, it might be cheaper to adjust. But that doesn't seem to be true very often. And that is an empirical question. There are two ways to examine this question. [Well, whenever someone says definitively there are "two ways of doing someth

Responsive Design and Uncertainty

To my mind, a key reason for responsive designs is uncertainty. This uncertainty can probably occur in at least two ways. First, at a survey level, I can be uncertain about what response rate a certain protocol can elicit. If I don't obtain the expected response rate after applying the initial protocol, then I can change the protocol and try a different one. Second, I can be uncertain about which protocol to apply at the case level. But I know what the protocol will be after I have observed a few initial trials of some starting protocol. For example, I might call a case three times on the telephone with no contact before I conclude that I should attempt the case face-to-face. In either situation, I'm not certain about which protocol specific cases will get. But I do have a pre-specified plan that will guide my decisions during data collection. There is a difference, though, in that in the latter situation (case level), I can predict that a proportion of cases will receive t

Classification Problems with Daily Estimates of Propensity Models

A few years ago, I ran several experiments with a new call-scheduling algorithm. You can read about it here . I had to classify cases based upon which call window would be the best one for contacting them. I had four call windows. I ranked them in order, for each sampled number, from best to worst probability of contact. The model was estimated using data from prior waves of the survey (cross-sectional samples) and the current data. For a paper that will be coming out soon, I looked at how often these classifications changed when you used the final data compared to the interim data. The following table shows the difference in the two rankings: Change In Ranking Percent 0 84.5 1 14.1 2 1.4 3 0.1 It looks like the rankings didn't change much. 85% were the same. 14% changed one rank. What is difficult to know is what difference these classification errors might make in the o

Responsive Design is not just Two-Phase Sampling

I recently gave, along with Brady West, a short course on paradata and responsive design. We had a series of slides on what is "responsive design." I had a slide with a title similar to that of this post. I think it was "Responsive Design is not equal to Two-Phase Sampling." I sometimes have a discussion with people about using "responsive design" on their survey, but I get the sense that what they really want to know about is two-phase sampling for nonresponse. In fact, two-phase sampling, to be efficient, should have different cost structures across the phases. But the requirements for a responsive design are higher than that. Groves and Heeringa also argued that the phases should have 'complementary' design features. That is, each phase should be attractive to different kinds of sampled people. The hope is that nonresponse biases of prior phases are cancelled out by the biases of subsequent phases. Further, responsive designs can exist wit

Formalizing the Optimization Problem

I heard Andy Peytchev speak about responsive design recently. He raised some really good points. One of these was a "total survey error" kind of observation. He pointed out that different surveys have different objectives and that these may be ranked differently.  One survey may prioritize sampling error while another has nonresponse bias as its biggest priority. As there are always tradeoffs between error sources, the priorities indicate which way those decisions were or will be made. Since responsive design has largely been thought of as a remedy for nonresponse bias, this idea seems novel. Of course, it is worth recalling that Groves and Heeringa did originally propose the idea in a total survey error perspective. On the other hand, many of their examples were related to nonresponse. I think it is important to 1) think about these tradeoffs in errors and costs, 2) explicitly state what they are for any given survey,  and 3) formalize the tradeoffs. I'm not sure tha

"Failed" Experiments

I ran an experiment a few years ago that failed. I mentioned it in my last blog post. I reported on it in a chapter in the book on paradata that Frauke edited. For the experiment, I offered a recommended call time to interviewers. The recommendations were delivered for a random half of each interviewer's sample. They followed the recommendations at about the same rate whether they saw them or not (20% compliance). So, basically, they didn't follow the recommendations. In debriefings, interviewers said "we call every case every time, so the recommendations at the housing unit were a waste of time." This made sense, but it also raised more questions for me. My first question was, why don't the call records show that? Either they exaggerated when they said they call "every" case every time. Or, there is underreporting of calls. Or both. At that point, using GPS data seemed like a good when to investigate this question. Once we started examining the GP

Setting an Appointment for Sampled Units... Without their Assent

Kreuter, Mercer, and Hicks have an interesting article in JSSAM. In a panel study, the Medical Expenditure Panel Survey (MEPS). They note my failed attempt to deliver recommended calling times to interviewers. They had a nifty idea... preload the best time to call as an appointment. Letters were sent to the panel members announcing the appointment. Good news. This method improved efficiency without harming response rates. There was some worry that setting appointments without consulting the panel members would turn them off, but that didn't happen. It does remind me of another failed experiment I did a few years ago. Well, there wasn't an experiment, just a design change. We decided that it would be good to leave answering machine messages on the first telephone call in an RDD sample. In the message, we promised that we would call back the next evening at a specified time. Like an appointment. Without experimental evidence, it's hard to say, but it did seem to increase

Costs of Face-to-Face Call Attempts

I've been working on an experiment where evaluating cost savings is an important outcome. It's difficult to measure costs in this environment. Timesheets and call records are recorded separately. It's difficult to parse out the travel time from other time. One study actually shadowed a subset of interviewers in order to generate more accurate cost estimates. This is an expensive means to evaluate costs that may not be practical in many situations. It might be that increasing computerization does away with this problem. In a telephone facility, everything is timestamped so we can calculate how long most call attempts take. It might be that we will be able to do this in face-to-face studies soon/already.

Tracking, Again

Last week, I mentioned an experiment that we ran with changing the order of tracking steps. I noted that the overall result was that the original, expert-chosen order worked better than the new, proposed order. In this example, the costs weren't all that different. But I could imagine situations where there are big differences in the costs between the different steps. In that case, the order could have big cost implications. I'm also thinking that a common situation is where you have lots of cheap (and somewhat ineffective steps) and one expensive (and effective) step. I'm wondering if it would be possible to identify cases that should skip the cheap treatments and go right to the expensive treatment. Just as a cost savings measure. It would have to result in the same chance of locating the person. In other words, the skipped steps would have to have the same or less information than the costly step. My hunch is that such situations actually exist. The trick is finding

Tracking: Does Sequence Matter?

I've wanted to run an experiment like this for a while. When we do tracking here, we either run a standard protocol. This protocol is a series of "tracking steps" that are carried out in a specific order. The other way we do this is to let the tracking team decide which order to run the steps in. In cases where we run a standard protocol, experts decide which order to run the steps in. Generally, the cheapest steps are first on the list. The problem is that you can't evaluate the effectiveness of each step because they all deal with different subgroups (i.e. those that didn't get found on the previous step). I only know of one experiment that varied the order of steps. Well, I finally found one that wasn't too objectionable. I got them to vary the order. We recently finished the survey and found that... the original order worked better. The glass half full view: it did make a difference which order you used. And the experts did choose that one.

More on Measurement Error

I'm still thinking about this problem. For me, it's much simpler conceptually to think of this as a missing data problem. Andy Peytchev's paper makes this point. If I have the "right" structure for my data, then I can use imputation to address both nonresponse and measurement error. If the measurement error is induced differently across different modes, then I need to have some cases that receive measurements in both modes. That way, I can measure differences between modes and use covariates to predict when this happens. The covariates, as I discussed last week, should help identify which cases are susceptible to measurement error. There is some work on measuring whether someone is likely to be influenced by social desirability. I'm think that will be relevant for this situation. That sounds sort of like, "so you don't want me to tell me the truth about x, but at least you will tell me that you don't want to tell me that." Or something li

Covariates of Measurement Error

I've been working on some mixed-mode problems where nonresponse and measurement error are confounded. I recently read an interesting article on using adjustment models to disentangle the two sources of error. The article is by Vannieuwenhuyze, Loosveldt, and Molenberghs. They suggest that you can make adjustments for measurement error if you have things that predict when those errors occur. They give specific examples. It's things that measure social conformity and other hypothesized mechanisms that lead to response error. This was very interesting to read about. I suppose that just as with nonresponse, the predictors of this error -- in order to be useful -- need to predict when those errors occur and the survey outcome variables themselves. This is a new and difficult task... but one worth solving giving the push to use mixed mode designs.

Proxy Y's

My last post was a bit of crankiness about the term "nonresponse bias." There is a bit of terminology, on the other hand, that I do like -- "Proxy Y's." We used this term in a paper a while ago. The thing that I like about this term, is that it puts the focus on the prediction of Y. Based on the paper by Little and Vartivarian (2005), this seemed like a more useful thing to have. And we spent time looking for things that could fit the bill. If we have something like this, the difference between responders and the full sample might be a good proxy for bias with the actual Y's. I'm not backtracking here -- it's still not "nonresponse bias" in my book. It's just a proxy for it. The paper we wrote found that good proxy Y's are hard to find. Still, it's worth looking. And, as I said, the term keeps us focused on finding these elusive measures. 

When should we use the term "nonresponse bias"?

Maybe I'm just being cranky, but I'm starting to think we need to be more careful about when we use the term "nonresponse bias." It's a simple term, right? What could be wrong here? The situation that I'm thinking about is when we are comparing responders and nonresponders on characteristics that are known for everyone. This is a common technique. It's a good idea. Everyone should do this to evaluate the quality of the data. My issue is when we start to describe the differences between responders and nonresponders on these characteristics as "nonresponse bias." These differences are really proxies for nonresponse bias. We know the value for every case, so there isn't any nonresponse bias. The danger, as I see it, is that naive readers could miss that distinction. And I think it is an important distinction. If I say "I have found a method that reduces nonresponse bias," what will some folks hear? I think such a statement is pro

The Nonresponse-Measurement Error Nexus... in Reverse

I saw this very interesting post linking measurement error and nonresponse in a new way. Instead of looking at whether difficult to respond cases exhibit more measurement error, Peter Lugtig looks at whether cases with poor measurement attrit from a panel. If this works, these kinds of behaviors during the survey are a very useful tailoring variable. They can be signals of impending attrition. One hypothesis about these cases is that they may not have sufficient commitment to the task. They do it poorly and opt out more quickly. The million dollar question is, how to we get them to commit to the task?

Defining Phases, Again

The other thing I should have mentioned in my last post is the level at which the phase is defined. We tend to think of Phases as points in time for area probability phases. This is because in a cluster sample, we want to save on travel. Taking a subsample of cases within a cluster doesn't save on travel. So, we tend to use time to find the point at which sampling could occur. But we could trigger these decisions using some other criteria. A few years ago, I tried to develop a model that detected when there was a change in the cost structure -- that is, when costs go up. The problem was that the model couldn't detect the change until a few days later. Sometimes, it never detected it at all. Still, I like the idea of dynamically detecting the boundary of the phases.

Defining phases

I have been working on a presentation on two-phase sampling. I went back to an old example from an RDD CATI survey we did several years ago. In that survey, we defined phase 1 using effort level. The first 8 calls were phase 1. A subsample of cases was selected to receive 9+ calls. It was nice in that it was easy to define the phase boundary. And that meant that it was easy to program. But, the efficiency of the phased approach relied upon their being differences in costs across the phases. Which, in this case, means that we assume that cases in phase two require similar levels of effort to be completed. This is like assuming a propensity model with calls as the only predictor. Of course, we usually have more data than that. We probably could create more homogeneity in phase 2 by using additional information to estimate response probabilities. I saw Andy Peytchev give a presentation where they implemented this idea. Even just the paradata would help. As an example, consider two

Monitoring Daily Response Propensities

I've been working on this paper for a while. It compares models estimated in the middle of data collection with those estimated at the end of data collection. It points out that these daily models may be vulnerable to biased estimates akin to the "early vs. late" dichotomy that is sometimes used to evaluate the risk of nonresponse bias.The solution is finding the right prior specification in a Bayesian setup or using the right kind and amount of data from a prior survey so that estimates will have sufficient "late" responders. But, I did manage to manufacture this figure which shows the estimates from the model fit each day with the data available that day ("Daily") and the model fit at the end of data collection ("Final"). The daily model is overly optimistic early. For this survey, there were 1,477 interviews. The daily model predicted there would be 1,683. The final model predicted 1,477. That's the average "optimism."

What would a randomized call timing experiment look like?

It's one thing to compare different call scheduling algorithms. You can compare two algorithms and measure the performance using whatever metrics you want to compare (efficiency, response rate, survey outcome variables). But what about comparing estimated contact propensities? There is an assumption often employed that these calls are randomly placed. This assumption allows us to predict what would happen under a diverse set of strategies -- e.g. placing calls at different times. Still, this had me wondering what a really randomized experiment would look like. The experiment would be best randomized sequentially as this can result in more efficient allocation. We'd then want to randomize each "important" aspect of the next treatment. This is where it gets messy. Here are two of these features: 1. Timing. The question is, how to define this. We can define it using "call windows." But even the creation of these windows requires assumptions... and tradeo

More methods research for the sake of methods...

In my last post, I suggested that it might be nice to try multiple survey requests on the same person. It reminded me of a paper I read a few years back on response propensity models that suggested continuing calling after the interview is complete, just so that you can estimate the model. At the time, I thought it was sort of humorous to suggest that. Now I'm drawing closer to that position. Not for every survey, but it would be interesting to try. In addition to validating estimated propensities at the person level, this might be another way to assess predictors of nonresponse that we can't normally assess. Peter Lugtig has an interesting paper and blog post about assessing the impact of personality traits on panel attrition. He suggests that nonresponse to a one-time, cross-sectional survey might have a different relationship to personality traits. Such a model could be estimated for a cross-sectional survey of employees who all have taken a personality test. You could do

Estimating Response Probabilities for Surveys

I recently went to a workshop on adaptive treatment regimes. We were presented with a situation where they were attempting to learn about the effectiveness of a treatment to help with a chronic condition like addiction to smoking. The treatment is applied at several points over time, and can be changed based on changes in the condition of the person (e.g. they report stronger urges to smoke). In this setup, they can learn effective treatments at the patient level. In surveys, we only observe successful outcomes one time. We get the interview, we are done. We estimate response propensities by averaging over sets of cases. Within in any set, we assume that each person is exchangeable. Not by observing response to multiple survey requests on the same person. Even panel surveys are only a little different. The follow-up interviews are often only with cases that responded at t=1. Even when there is follow-up with the entire sample, we usually leverage the fact that this is follow-up to

Use of Prior Data in Estimation of Daily Propensity Models

I'm working on a paper on this topic. One of the things that I've been looking at is accuracy of predictions from models that use data during the field period. I think of this as a missing data problem. The daily models can yield different estimates that are biased. For example, estimates based on today might overestimate the number of interviews tomorrow. This can happen if my estimate of the number of interviews to expect on the third call is based on a select set of cases that responded more easily (compared to the cases that haven't received a third call). One of the examples in the paper comes from contact propensity models I did for a monthly  telephone survey a few years ago. Since it is monthly, I could use data from prior months. Getting the right set of prior data (or, in a Bayesian perspective, priors) is important. I found that the prior months data had a contact rate of 9.4%. The current month had contact rate of 10.9%, but my estimates for the current month

Are we really trying to maximize response rates?

I sometimes speculate that we may be in a situation where the following is true: Our goal is to maximize response rate We research methods to do this We design surveys based on this Of course, the real world is never so "pure." I'm sure there must be departures from this all the time. Still, I wonder what the consequences of maximizing (or minimizing) something else would be. Could research on increasing response still be useful under a new guiding indicator? I think that in order for older research to be useful under a new guiding indicator, the information about response has to be linked to some kind of subgroups in the sample. Indicators other than the response rate would place different values on each case (the response rate places the same value on each case). So for methods to be useful in a new world governed by some other indicator, those methods would have to useful for targeting some cases. On the simplest level, we don't want the average effect on

Tracking Again...

I'm still thinking about this one. I had an additional thought about this. It is possible to predict which cases are likely to be difficult to locate. Couper and Oftedal have an interesting chapter in the book Methodology of Longitudinal Surveys on the topic. I also recall that the NSFG Cycle 5 documentation had a model for predicting probability of locating someone. Given that information, it should be easy to stratify samples for differential effort. For instance, it might be better to use expensive effort early on some cases that are expected to be difficult. If this saves on the early inexpensive steps. The money saved might be trivial. But the time could be important. If you find them more quickly, perhaps you can more easily interview them.

Tracking Experiment

I've been blogging about the dearth of experiments into methods for tracking. It can be hard to do when there are big differences in costs and effectiveness among steps. But when at least some are close in cost, it's more difficult to assume that one order is better than another. I liked the paper by Koo and colleagues since it actually experimented with which service to use for searching and found a specific order that worked better. I'm now working on a project that uses tracking. We decided to use different orderings of the steps with different groups. Not a perfect experiment, but the grouping are relatively homogenous so it won't be a huge step to infer from the results to a broader population. We'll have some results.... in a few months.

Tracking, Again

So I finished reading a large number of studies on tracking. One thing that I noticed, there is a general assumption that you should start with cheaper methods and go to more expensive. But that might not always be true. For instance, what if a cheap method almost never returns a result, while something more expensive produces more leads. I could imagine skipping the cheap step, or putting it after the expensive step. In any event, it is really a sequence of steps that needs to be optimized. How to do this involves both the costs and the expected returns. But since each of those are only known conditionally upon whatever was done prior to the current step, we need experiments that vary the order of the steps to find out what the optimal step is going to be.

Tracking Research: A Lack of Experimental Studies

I've been reading a number of papers on tracking (aka tracing or locating) of panel members in longitudinal research. Many of the papers are case studies, reporting on what particular studies did. Very few actually conduct experiments. Survey methodologists have produced a few recent experimental papers. Research on HRS showed that higher incentives had persistent effects on response at later waves. McGonagle and colleagues looked at the effects of between-wave cont act methods and incentives . Fumagelli and colleagues   also explore between wave contact methods. These experiments all involve contacting panel members. I found one interesting paper that actually experimented with the order of the steps in the tracking process. Usually, the order starts with the cheapest things to do and goes to the more expensive. If steps have a similar cost, then just choose an order. This paper by Koo et al actually randomized the order of the steps (two different websites). A haven't s

Tracking Costs

As I mentioned in my last post, I have been reading an enormous number of papers on locating respondents in panel studies. One interesting thing that I have found is that tracking costs are often described in a manner different than I would have expected. I'm used to thinking of the costs of activities -- telephone calls, internet searches, face-to-face calls, etc. These activity costs can be summed up to total costs, and then averaged over number of cases located or number of cases interviewed. I found a lot of papers reported costs as FTEs. This seemed a lot simpler. I found one review paper that summarized several other studies. They reported all the results as FTEs. This was nifty in that it was simple, and somewhat impervious to inflation and differences in pay rates -- so better than reporting dollar costs. The downside is that the costs can't be rescaled when there are differences among panels in difficulty of being tracked. Some are more difficult and require more