Reputation: 9509
I want to extract a set of rows of an existing dataset:
dataset.x <- dataset[(as.character(dataset$type))=="x",]
however when I run
summary(dataset.x$type)
It displays all types which were present in the original dataset. Basically I get a result that says
x 12354235 #the correct itemcount
y 0
z 0
a 0
...
Not only is the presence of 0 elements ugly but it also messes up any plot of dataset.x due to the presence of hundrets of entries with the value 0.
Upvotes: 0
Views: 1572
Reputation: 49640
Others have explained what is happening and how to fix it, I just want to show why it is a desirable default.
Consider the following sample code:
mydata <- data.frame(
x = factor( rep( c(0:5,0:5), c(0,5,10,20,10,5,5,10,20,10,5,0))),
sex = rep( c('F','M'), each=50 ) )
mydata.males <- mydata[ mydata$sex=='M', ]
mydata.males.dropped <- droplevels(mydata.males)
mydata.females <- mydata[ mydata$sex=='F', ]
mydata.females.dropped <- droplevels(mydata.females)
par(mfcol=c(2,2))
barplot(table(mydata.males$x), main='Male', sub='Default')
barplot(table(mydata.females$x), main='Female', sub='Default')
barplot(table(mydata.males.dropped$x), main='Male', sub='Drop')
barplot(table(mydata.females.dropped$x), main='Female', sub='Drop')
Which produces this plot:
Now, which is the more meaningful comparison, the 2 plots on the left? or the 2 plots on the right?
Instead of dropping unused levels it may be better to rethink what you are doing. If the main goal is to get the count of the x's then you can use sum
rather than subsetting and getting the summary. And how meaningful can a plot be on a variable that you have already forced to be a single value?
Upvotes: 3
Reputation: 173527
Building on Chase's answer, subsetting and dropping unused levels in factors comes up a lot, so it pays to just create your own function by combining droplevels
and subset
:
subsetDrop <- function(...){droplevels(subset(...))}
Upvotes: 3
Reputation: 1652
Try
dataset$type <- as.character(dataset$type)
followed by your original code. It's probably just that R is still treating that column as a
factor
and is keeping all of the information about that factor in the column.
Upvotes: 1
Reputation: 69151
I'm assuming this is a factor? If so, droplevels()
can be used: http://stat.ethz.ch/R-manual/R-patched/library/base/html/droplevels.html
If you add a small reproducible example, it will help others get on the same page and give better advice if this isn't right.
Upvotes: 3