I will begin my discussion of ART ("a rule of three" - as discussed in a post last week) with a look at the statistical problem that motivates ART.
In this recent presentation at Texas A&M, Achen focused on two motivations for ART: [1] the frailty of high dimensional models, and [2] strong linearity assumptions. I may return to the linearity problems later (I am a little less worried about this than Chris Achen seems to be), but I want to focus on dimensionality for a moment.
The "curse of dimensionality" has been the subject of some discussion within social science models -- but the development of more complicated statistical models has continued unabated. The core idea is that finding the maximum of a high dimensional space is hard -- sometimes very hard. In the context of social science models, parameter estimates of high dimensional models may be subject to a great deal of influence or leverage. For example, you may have a large sample overall but have very few Hispanic respondents. The typical strategy of including a dummy variable for Hispanic ethnicity assumes that Hispanic respondent vary only in the intercept (and respond similarly to non-Hispanic respondents in regards to all other variables). The alternative approach (in something akin to a hierarchical model - with ethnicity as a level) is to allow both the intercept and other slopes vary by ethnicity -- but this places remarkable demands on the sample of data. Achen's concern is that statistical models will almost always give us some answer. Relying on variation within a small sub-sample (say, Hispanic respondents in a particular management survey), we will get a coefficient but that coefficient may be unreliable.
Some of this problem is identifiable through careful assessment of leverage diagnostics. If you look at Cook's D values, etc. you can diagnose situations where small sub-samples sizes within your study create fragile coefficients prone to leverage from a small number of observations. Even this becomes difficult with non-continuous independent and dependent, variables though. (but see this paper for an interesting strategy).
Achen argues that we can avoid this situation entirely by more carefully selecting a sample. If we want to look at the behavior of Hispanic administrators, we can select a sample of Hispanic administrators. If we think ethnicity matters but that we don't have enough Hispanic respondents to ensure stable/reliable parameter estimates, we are better off constructing a sample without variation in ethnicity (by omitting Hispanic respondents) and then saying that the sample is homogeneous in terms of ethnicity.
I will return to the subject of sampling as a solution, but I will leave you with two thoughts: [1] Is this approach useful outside of the NES world of tens of thousands of observations from which one can create homogenous sub-samples, [2] will this ghettoize the study of race and ethnic minorities as the "normal" research proceeds to test hypotheses with racially and ethnically homogeneous samples (read "white, male, moderately educated non-southerners").
In the short term, are you concerned that in public management large models (those containing many independent variables) are fragile? Are you intrigued by the strategy of selecting a sample uniform with respect to variables for which you want to control rather than introducing control variables?
No comments:
Post a Comment