Scott Robinson's Public Management Research Blog: ART and the Curse of Dimensionality

I will begin my discussion of ART ("a rule of three" - as discussed in a post last week) with a look at the statistical problem that motivates ART.

In this recent presentation at Texas A&M, Achen focused on two motivations for ART: [1] the frailty of high dimensional models, and [2] strong linearity assumptions. I may return to the linearity problems later (I am a little less worried about this than Chris Achen seems to be), but I want to focus on dimensionality for a moment.

The "curse of dimensionality" has been the subject of some discussion within social science models -- but the development of more complicated statistical models has continued unabated. The core idea is that finding the maximum of a high dimensional space is hard -- sometimes very hard. In the context of social science models, parameter estimates of high dimensional models may be subject to a great deal of influence or leverage. For example, you may have a large sample overall but have very few Hispanic respondents. The typical strategy of including a dummy variable for Hispanic ethnicity assumes that Hispanic respondent vary only in the intercept (and respond similarly to non-Hispanic respondents in regards to all other variables). The alternative approach (in something akin to a hierarchical model - with ethnicity as a level) is to allow both the intercept and other slopes vary by ethnicity -- but this places remarkable demands on the sample of data. Achen's concern is that statistical models will almost always give us some answer. Relying on variation within a small sub-sample (say, Hispanic respondents in a particular management survey), we will get a coefficient but that coefficient may be unreliable.

Some of this problem is identifiable through careful assessment of leverage diagnostics. If you look at Cook's D values, etc. you can diagnose situations where small sub-samples sizes within your study create fragile coefficients prone to leverage from a small number of observations. Even this becomes difficult with non-continuous independent and dependent, variables though. (but see this paper for an interesting strategy).

Achen argues that we can avoid this situation entirely by more carefully selecting a sample. If we want to look at the behavior of Hispanic administrators, we can select a sample of Hispanic administrators. If we think ethnicity matters but that we don't have enough Hispanic respondents to ensure stable/reliable parameter estimates, we are better off constructing a sample without variation in ethnicity (by omitting Hispanic respondents) and then saying that the sample is homogeneous in terms of ethnicity.

I will return to the subject of sampling as a solution, but I will leave you with two thoughts: [1] Is this approach useful outside of the NES world of tens of thousands of observations from which one can create homogenous sub-samples, [2] will this ghettoize the study of race and ethnic minorities as the "normal" research proceeds to test hypotheses with racially and ethnically homogeneous samples (read "white, male, moderately educated non-southerners").

In the short term, are you concerned that in public management large models (those containing many independent variables) are fragile? Are you intrigued by the strategy of selecting a sample uniform with respect to variables for which you want to control rather than introducing control variables?

Scott Robinson's Public Management Research Blog

Tuesday, October 25, 2011

ART and the Curse of Dimensionality

No comments:

Post a Comment