Laura Bas
Laura Bas

Reputation: 41

Categorical variables in linear regression: one level with only one value, rest NA

I am coming with a theoretical question about how R works when running model summaries. I am doing some linear regression models where two of my variables are categorical, each with 3 levels, corresponding genotypes. I know that only two of the levels will show in the model summary, seeing as one of the levels has to be a reference. However, these variables of mine have only 1 count for one of the levels, as in:

Variable 1 levels: TT 176 counts / TC 45 counts / CC 1 count (This out of 223 individuals genotyped).

Now, this CC level usually doesn't show up in the model summary, and I'm assuming it's because, since there is only 1, R isn't taking it into account. All I need then is to find a literature reference to confirm or deny my assumption. I've been trying to google this in different ways and going through the R ?help for lm and other related searches, but either I haven't found what I'm looking for, or have and didn't understand it as such.

Any help would be greatly appreciated!

Upvotes: 0

Views: 1092

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 146119

Your assumption is incorrect.

The first level will be the reference level, and the default ordering is alphabetical. Because CC comes first alphabetically, it is the reference level in your model.

It is good practice (reduces variance of other estimates) to use a relatively common value as the reference level. Thus I would suggest modifying the alphabetical default to make TT the reference level. This should be as easy as

your_data$var = relevel(your_data$var, ref = "TT")

(of course substituting whatever your data frame and variable names are).

The way the levels are set is called the "contrasts". ?contrasts is a good place to begin reading, and with that search term you should be able to find other docs/references as well. (There are options other than "compare everything to the reference level", but that is out of the scope here.)

Similarly, it sounds suspect to include a level at all that has only a single observation, but that is a statistical question and not a programming one (and would require more information than is in your question), so I won't address it further here.

Upvotes: 1

Related Questions