We often talk about “data quality” or “data integrity” when we are discussing the collection or analysis of one type of data or another. Yet, the definition of these terms might be unclear, or they may vary across different contexts. In any event, the terms are somewhat abstract -- which can make it difficult, in practice, to improve. That is, we need to know what we are describing with those terms, before we can improve them. Over the last two years, we have been developing a course on Total Data Quality , soon to be available on Coursera. We start from an error classification scheme adopted by survey methodology many years ago. Known as the “Total Survey Error” perspective, it focuses on the classification of errors into measurement and representation dimensions. One goal of our course is to expand this classification scheme from survey data to other types of data. The figure shows the classification scheme as we have modified it to include both survey data and organic forms of d
We have been working hard on applying the Total Survey Error (TSE) concept to hybrid data sources. That is, data that includes both designed and gathered data. We use the term "designed" for data that are designed for analysis. Gathered data, on the other hand, are not designed for analysis. We find ourselves more and more relying on multiple sources of data, and wanted to bring our quality perspective to those problems. It feels to me like our survey experience with quality assessment is highly relevant for either hybrid data situations and for gathered data. TSE gives us a way to think through the issues. We have been offering a series of webinars on the the topic for the last few summers. We are working toward a larger course. More on that topic soon...