Yacila
Yacila

Reputation: 13

Am I applying weights indiscriminately to my survey data?

I have a large dataset (N=12000) from a survey. I am using weights in my regressions, because this sample gave blood for analysis (from the whole eligible sample). My results make sense until the moment I started analyzing subgroups, e.g., respondents with genetic markers within the sample. I was thinking that it is because I am still weighting the regression when I shouldn't. My thoughts are that, since the genetic marker is in a subgroup, it is already a sample representing the population and the weights are introducing noise. I am trying to look for reasonable sources and explanations, but so far, I haven't found anything. Maybe you can help me.

Upvotes: 0

Views: 699

Answers (1)

Nico Hambauer
Nico Hambauer

Reputation: 1

Let me try to help you a bit here, i am trying to approach your question to my best knowledge. I am most familiar with Interpretable Machine Learning and especially Generalised Additive Models. Hence i read a bit of "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman.

Survey weights are generally used to correct for selection bias and make the survey results more representative of the population. The weights are usually derived based on the design of the survey and the probability of each individual being selected in the survey. They allow you to extrapolate findings from the sample to the population that the sample is supposed to represent.

When you subset the data (e.g., by selecting only respondents with certain genetic markers), the original weights may no longer be appropriate because the subgroup may not represent the population in the same way the overall sample does.

In your case, if the genetic marker is not related to the likelihood of being selected into the sample (i.e., it does not affect the survey design), then the original weights can still be used when analyzing this subgroup. This is because, from the perspective of the survey design, this subgroup is just a random subset of the overall sample.

However, if the genetic marker is related to the likelihood of being selected into the sample, you might need to adjust the weights. One possibility is to reweight the data so that the weights sum to 1 within this subgroup. This would effectively treat the subgroup as a new population, under the assumption that the survey design is the same within this subgroup.

Furthermore, the weights could indeed introduce noise to your analyses. Survey weights are typically associated with larger standard errors because they reflect the variability in the sampling design. This means that when you apply these weights to your regression analyses, your standard errors might increase, leading to wider confidence intervals and potentially non-significant results.

Maybe this one here could help to get more specific guidance on weighted regression and survey sampling, "Sampling: Design and Analysis" by Sharon L. Lohr and "Complex Surveys: A Guide to Analysis Using R" by Thomas Lumley are good options.

Are you currently coding it in python using statsmodels? The WLS class requires both a y and X input, as well as a weights parameter which is an array-like object of weights.

import statsmodels.api as sm
model = sm.WLS(y, X, weights=weights)
results = model.fit()
print(results.summary())

Upvotes: 0

Related Questions