Reputation: 133
I have a variable df1$StudyAreaVisitNote
which I turn into a factor. But when I subsetted the df1
into BS
this variable did not remain a factor: using the table( ) function on the subsetted data would show results that seemed to be what should be returned if table()
was run on the original data?
Why does this happen?
The two workarounds I found were:
Code:
# My dataset can be found here: http://textuploader.com/9tx5 (I'm sure there's a better way to host it, but I'm new, sorry!)
# Load Initial Dataset (df1)
df1 <- read.csv("/Users/user/Desktop/untitled folder/pre_subset.csv", header=TRUE,sep=",")
# Make both columns factors
df1$Trap.Type <- factor(df1$Trap.Type)
df1$StudyAreaVisitNote <-factor(df1$StudyAreaVisitNote)
# Subset out site of interest
BS <- subset(df1, Trap.Type=="HR-BA-BS")
# Export to Excel, save as CSV after it's in excel
library(WriteXLS)
WriteXLS("BS", ExcelFileName = "/Users/user/Desktop/test.xlsx", col.names = TRUE, AdjWidth = TRUE, BoldHeaderRow = TRUE, FreezeRow = 1)
# Load second Dataset (df2)
df2 <- read.csv("/Users/user/Desktop/untitled folder/post_subset.csv", header=TRUE, sep=",")
# both datasets should be identical, and they are superficially, but...
# Have a look at df2
summary(df2$StudyAreaVisitNote) # Looks good, only counts levels that are present
# Now, look at BS from df1
summary(BS$StudyAreaVisitNote) # sessions not present in the subsetted data (but present in df1?) are included???
# Make BS$StudyAreaVisitNote a factor...Again??
BS$StudyAreaVisitNote <- factor(BS$StudyAreaVisitNote)
# Try line 31 again
summary(BS$StudyAreaVisitNote) # this time it works, why is factor not carried through during subset?
Upvotes: 0
Views: 186
Reputation: 206232
A factor is maintained a factor even after subsetting. I'm sure class(BS$StudyAreaVisitNote)=="factor"
. But, factors don't automatically drop their unused levels. This can be helpful when you are doing stuff like
set.seed(16)
dd<-data.frame(
gender=sample(c("M","F"), 25, replace=T),
age=rpois(25, 20)
)
dd
table(subset(dd, age<15)$gender)
# F M
# 0 3
Here the factor remember that it had M and F's and even if the subset doesn't have any F's the levels are still retained. You may explicitly call droplevels()
if you want to get rid of unused levels.
table(droplevels(subset(dd, age<15))$gender)
# M
# 3
(now it forgot about the F's)
So instead of summary
, compare the results of table
on your two data.frames.
Upvotes: 2