Statistical Perspectives on Data Mining

Note that though Hand talks about data mining in general, for this class we should think about EDM

Core Questions

How is the view of validity in data mining, as represented by Hand, similar to and different from the viewpoints on validity previously studied in this class? (Note that Hand is snarkier about data mining than many statisticians, but more positive about its potential than many others)

Is statistical significance still meaningful in very large data sets? If so, when? If not, what should substitute for it?

Hand et al suggests that the problem of statistical significance not being useful could be addressed by simply sampling from the data set to reduce its size. Do you like this solution?

Hand claims that selection bias is a particularly big problem for data used in data mining. Is this true? And if so, what are the consequences and how could they be addressed or mitigated?

How can over-fitting and the finding of spurious patterns be reduced, in Hand's (1998) view? Which of his preferred approaches do you find most useful? Are there other superior alternatives?

Hand et al claims that data mining can find common patterns, but their value and meaningfulness can only be determined by a domain expert. Do you agree?

Secondary Questions

Hand argues that clean data cannot be expected for very large data sets, whereas Romero argues for the value of cleaning data prior to data mining. Under what circumstances is data cleaning important?

Hand et al claim that any pattern which cannot be explained should be treated as suspect. What are the benefits and drawbacks to this perspective?

Hand (1998) states that "Statistics as a discipline has a poor record for timely recognition of important ideas... statisticians have later made very significant advances in all of these fields, but the fact that the perceived natural home of these areas lies not in statistics but in other areas is demonstrated by the key journals for these areas -- they are not statistical journals. Data mining seems to be following this pattern." -- Is Hand's prophecy accurate for EDM/LAK? And if so, why might this be?

Hand claims that data mining is "almost by definition" concerned with atheoretical, purely empirical models, rather than models based on theory. Does this match the perspectives previously discussed in this class?