I mentioned the R-Indicators in a recent post. In addition to the article in Survey Methodology, they also have a very useful website. The website includes a number of papers and presentations on the topic.
I heard another interesting episode of the Data Skeptic podcast . They were discussing how a classifier could be assessed (episode 121). Many machine learning models are so complex that a human being can't really interpret the meaning of the model. This can lead to problems. They gave an example of a problem where they had a bunch of posts from two discussion boards. One was atheist and the other board was composed of Christians. They tried to classify each post as being from one or the other board. There was one poster who posted heavily on the Christian board. His name was Keith. Sadly, the model learned that if the person who was posting was named Keith, then they were Christian. The problem is that this isn't very useful for prediction. It's an artifact of the input data. Even cross-validation would eliminate this problem. A human being can see the issue, but a model can't. In any event, the proposed solution was to build interpretable models in local areas of t...
In some respects, the R-index is contradictory to your fraction of missing data. For example, more auxiliary data can lead to lower R-index (bad) and lower fraction of missing data (good). Do you see a middle ground approach, despite how different they are in theory, assumptions, and implementation?
ReplyDeleteI can only say something very general on this question at the moment.
ReplyDeleteBoth of these measures are model-dependent, that is, they are only as good as the model they choose for their implementation. As a result, we'll need multiple views of the problem. In addition, we'll be uncertain of the results produced by either statistic. So basing actions on those statistics may need to be tempered in some fashion. If multiple statistics converge to the same solution, we should probably feel pretty good about that solution.
I hope to address these questions. I'm working on it.