Tuesday, October 25, 2011

ART and the Curse of Dimensionality

I will begin my discussion of ART ("a rule of three" - as discussed in a post last week) with a look at the statistical problem that motivates ART.

In this recent presentation at Texas A&M, Achen focused on two motivations for ART:  [1] the frailty of high dimensional models, and [2] strong linearity assumptions.  I may return to the linearity problems later (I am a little less worried about this than Chris Achen seems to be), but I want to focus on dimensionality for a moment.

The "curse of dimensionality" has been the subject of some discussion within social science models -- but the development of more complicated statistical models has continued unabated.  The core idea is that finding the maximum of a high dimensional space is hard -- sometimes very hard.  In the context of social science models, parameter estimates of high dimensional models may be subject to a great deal of influence or leverage.  For example, you may have a large sample overall but have very few Hispanic respondents.  The typical strategy of including a dummy variable for Hispanic ethnicity assumes that Hispanic respondent vary only in the intercept (and respond similarly to non-Hispanic respondents in regards to all other variables).  The alternative approach (in something akin to a hierarchical model - with ethnicity as a level) is to allow both the intercept and other slopes vary by ethnicity -- but this places remarkable demands on the sample of data.  Achen's concern is that statistical models will almost always give us some answer.  Relying on variation within a small sub-sample (say, Hispanic respondents in a particular management survey), we will get a coefficient but that coefficient may be unreliable.

Some of this problem is identifiable through careful assessment of leverage diagnostics.  If you look at Cook's D values, etc. you can diagnose situations where small sub-samples sizes within your study create fragile coefficients prone to leverage from a small number of observations.  Even this becomes difficult with non-continuous independent and dependent, variables though. (but see this paper for an interesting strategy).

Achen argues that we can avoid this situation entirely by more carefully selecting a sample.  If we want to look at the behavior of Hispanic administrators, we can select a sample of Hispanic administrators.  If we think ethnicity matters but that we don't have enough Hispanic respondents to ensure stable/reliable parameter estimates, we are better off constructing a sample without variation in ethnicity (by omitting Hispanic respondents) and then saying that the sample is homogeneous in terms of ethnicity.

I will return to the subject of sampling as a solution, but I will leave you with two thoughts:  [1] Is this approach useful outside of the NES world of tens of thousands of observations from which one can create homogenous sub-samples, [2] will this ghettoize the study of race and ethnic minorities as the "normal" research proceeds to test hypotheses with racially and ethnically homogeneous samples (read "white, male, moderately educated non-southerners"). 

In the short term, are you concerned that in public management large models (those containing many independent variables) are fragile?  Are you intrigued by the strategy of selecting a sample uniform with respect to variables for which you want to control rather than introducing control variables?

No comments:

Post a Comment