Reputation: 989
I have a dataframe df
with a column foo
containing data of type factor:
df <- data.frame("bar" = c(1:4), "foo" = c("M", "F", "F", "M"))
When I inspect the structure with str(df$foo)
, I get this:
Factor w/ 3 levels "","F",..: 2 2 2 2 2 2 2 2 2 2 ..
Why does it report 3 levels when there are only 2 in my data?
Edit:
There seems to be a missing value ""
that I clean up by assigning it NA
.
When I call table(df$foo)
, it seems to still count the "missing value" level, but finds no occurences:
F M
0 2 2
However, when I call df$foo
I find it reports only two levels:
Levels: F M
How is it possible that table
still counts the empty level, and how can I fix that behaviour?
Upvotes: 2
Views: 521
Reputation: 2289
Check whether your dataframe indeed has no missing values, because it does look to be that way. Try this:
# works because factor-levels are integers, internally; "" seems to be level 1
which(as.integer(df$MF) == 1)
# works if your missing value is just ""
which(df$MF == "")
You should then clean up your dataframe to properly refeclet missing values. A factor
will handle NA
:
df <- data.frame("rest" = c(1:5), "sex" = c("M", "F", "F", "M", ""))
df$sex[which(as.integer(df$sex) == 1)] <- NA
Once you have cleaned your data, you will have to drop unused levels to avoid tabulations such as table
counting occurences of the empty level.
Observe this sequence of steps and its outputs:
# Build a dataframe to reproduce your behaviour
> df <- data.frame("Restaurant" = c(1:5), "MF" = c("M", "F", "F", "M", ""))
# notice the empty level "" for the missing value
> levels(df$MF)
[1] "" "F" "M"
# notice how a tabulation counts the empty level;
# this is the first column with a 1 (it has no label because
# there is no label, it is "")
> table(df$MF)
F M
1 2 2
# find the culprit and change it to NA
> df$MF[which(as.integer(df$MF) == 1)] <- as.factor(NA)
# AHA! So despite us changing the value, the original factor
# was not updated! I wonder what happens if we tabulate the column...
> levels(df$MF)
[1] "" "F" "M"
# Indeed, the empty level is present in the factor, but there are
# no occurences!
> table(df$MF)
F M
0 2 2
# droplevels to the rescue:
# it is used to drop unused levels from a factor or, more commonly,
# from factors in a data frame.
> df$MF <- droplevels(df$MF)
# factors fixed
> levels(df$MF)
[1] "F" "M"
# tabulation fixed
> table(df$MF)
F M
2 2
Upvotes: 3