Skip to main content

Posts

What is Data Quality, and How to Enhance it in Research

  We often talk about “data quality” or “data integrity” when we are discussing the collection or analysis of one type of data or another. Yet, the definition of these terms might be unclear, or they may vary across different contexts. In any event, the terms are somewhat abstract -- which can make it difficult, in practice, to improve. That is, we need to know what we are describing with those terms, before we can improve them. Over the last two years, we have been developing a course on   Total Data Quality , soon to be available on Coursera. We start from an error classification scheme adopted by survey methodology many years ago. Known as the “Total Survey Error” perspective, it focuses on the classification of errors into measurement and representation dimensions. One goal of our course is to expand this classification scheme from survey data to other types of data. The figure shows the classification scheme as we have modified it to include both survey data and organic forms of d
Recent posts

Total Data Quality Update

 We have been working hard on applying the Total Survey Error (TSE) concept to hybrid data sources. That is, data that includes both designed and gathered data. We use the term "designed" for data that are designed for analysis. Gathered data, on the other hand, are not designed for analysis.  We find ourselves more and more relying on multiple sources of data, and wanted to bring our quality perspective to those problems. It feels to me like our survey experience with quality assessment is highly relevant for either hybrid data situations and for gathered data. TSE gives us a way to think through the issues. We have been offering a series of webinars on the the topic for the last few summers. We are working toward a larger course. More on that topic soon...

Total Data Quality

In an earlier post, I suggested that survey methodologists are "data quality specialists." Our focus on " total survey error " (TSE) is, in many ways, the central defining concept of our field. This focus on data quality could be an important contribution that survey methodologists make to the emerging field of data science. But in order to make that contribution, we may need to test the fit of the TSE concept on evaluations of non-survey data. One of the sources of error in surveys that we examine in surveys is "nonresponse." Does this concept apply to other sources of data? Certainly other sources of data having missing data. But nonresponse is a specific mechanism where we sample a unit and then request data, but the unit fails to supply the data. How does this concept apply to other sources of data? I wouldn't say that Twitter data suffer from "nonresponse" due to the fact that not everyone has a Twitter account or even that not every

Predictions of Nonresponse Bias

One issue that we have been discussing is indicators for the risk of nonresponse bias. There are some indicators that use observed information (i.e. largely sampling frame data) to determine whether respondents and nonrespondents are similar. The R-Indicator is an example of this type of indicator. It's not the only one. There are several sample balance indicators. There is an implicit model that the observed characteristics are related to the survey data and controlling for them will, therefore, also control the potential for nonresponse bias. Another indicator uses the observed data, including the observed survey data, and a model to fill in the missing survey data. The goal here is to predict whether nonresponse bias is likely to occur. Here, the model is explicit. An issue that impacts either of these approaches is that if you are able to predict the survey variables with the sampling frame data, then why bother addressing imbalances on them during data collection? One answ

Data Quality Specialists

I have been talking to undergraduates about survey methodology. The students I talk to have learned either some social research methods or statistics. I think that many are interested in data science and/or big data. From these conversations, I found it was useful to describe survey methodologists as "data quality specialists." Survey methodology is not a field that most undergraduates are even aware of. But when I started talking about how we evaluate the quality of data, I could see ears perking up. It reinforced for me the idea that the Total Survey Error perspective is valuable for Big Data .We can talk about nonresponse and measurement error in a coherent way. Raising questions about the quality of the data, the need to understand the processes that generated those data, and methods for evaluation of the data were all ideas that seemed to resonate with undergraduates... well, at least some. It was energizing and exciting to speak with them. Hopefully they bring that

Surveys and Other Sources of Data

Linking surveys and other sources of data is not a new idea. This has been around for a long time. It's useful in many situations. For example, when respondents would have a difficult time supplying the information (for example, exact income information). Much of the previous research on linkage has focused on either the ability to link data, possibly in a probabilistic fashion; or there have been examinations of biases associated with the willingness to consent to linkage. It seems that new questions are emerging with the pervasiveness of data generated by devices, especially smart phones. I read an interesting article by Melanie Revilla and colleagues about trying to collect data from a tracking application that people install on their devices. They examine how the "meter" as they call the application might be incompletely covering the sample. For example, persons might have multiple devices and only install it on some of them. Or, persons might share devices and no

Survey Modes and Recruitment

I've been struggling with the concept of "mode preference." It's a term we use to describe the idea that respondents might have preferences for a mode and that if we can identify or predict those preferences, then we can design a better survey (i.e. by giving people their preferred mode). In practice, I worry that people don't actually prefer modes. If you ask people what mode they might prefer, they usually say the mode in which the question is asked. In other settings, the response to that sort of question is only weakly predictive of actual behavior. I'm not sure the distinction between stated and revealed preferences is going to advance the discussion much either. The problem is that the language builds in an assumption that people actually have a preference. Most people don't think about survey modes. Most don't consider modes abstractly in the way methodologists might. In fact, these choices are likely probabilistic functions that hinge on