MattLBeck
MattLBeck

Reputation: 5831

subset data.frame for ggplot2 bar chart

I have the following data:

    Splice.Pair  proportion
1         AA-AG 0.010909091
2         AA-GC 0.003636364
3         AA-TG 0.003636364
4         AA-TT 0.007272727
5         AC-AC 0.003636364
6         AC-AG 0.003636364
7         AC-GA 0.003636364
8         AC-GG 0.003636364
9         AC-TC 0.003636364
10        AC-TG 0.003636364
11        AC-TT 0.003636364
12        AG-AA 0.010909091
13        AG-AC 0.007272727
14        AG-AG 0.003636364
15        AG-AT 0.003636364
16        AG-CC 0.003636364
17        AG-CT 0.007272727
...       ...   ...

I want to get a barchart visualising the proportion of each splice pair but only for splice pairs that have a proportion over, say, 0.004. I tried the following:

nc.subset <- subset(nc.dat, proportion > 0.004)
qplot(Splice.Pair, proportion, data=nc.dat.subset,geom="bar", xlab="Splice Pair", ylab="Proportion of total non-canonical splice sites") + coord_flip();

But this just gives me a bar chart with all splice pairs on the Y-axis, except that the splice pairs that were filtered out are missing bars. enter image description here

I have no idea what is happening to allow all categories to still be present :s

Upvotes: 5

Views: 4184

Answers (2)

joran
joran

Reputation: 173577

What's happening is that Splice.Pair is a factor. When you subset your data frame, the factor retains it's levels attribute, which still has all of the original levels. You can avoid this kind of problem by simply wrapping your subsetting in droplevels:

nc.subset <- droplevels(subset(nc.dat, proportion > 0.004))

More generally, if you dislike this kind of automatic retention of levels with factors, you can set R to store strings as character vectors rather than factors by default by setting:

options(stringsAsFactors = FALSE)

at the beginning of your R session (this can also be passed as an option to data.frame as well).

EDIT

Regarding the issue of running older versions of R that may lack droplevels, @rcs points out in a comment that the method for a single factor is very simple to implement on your own. The method for data frames is only slightly more complicated:

function (x, except = NULL, ...) 
{
    ix <- vapply(x, is.factor, NA)
    if (!is.null(except)) 
        ix[except] <- FALSE
    x[ix] <- lapply(x[ix], factor)
    x
}

But of course, the best solution is still to upgrade to the latest version of R.

Upvotes: 6

Rahul Premraj
Rahul Premraj

Reputation: 1595

Check whether Splice.Pair is a factor. If that's the case, use droplevels() to remove the levels that are no longer used to resolve your problem.

nc.subset <- subset(nc.dat, proportion > 0.004)
nc.subset$Splice.Pair <- droplevels(nc.subset$Splice.Pair)
qplot(Splice.Pair, proportion, data=nc.dat.subset,geom="bar", xlab="Splice Pair", ylab="Proportion of total non-canonical splice sites") + coord_flip();

You may be able to incorporate droplevels into qlot, but that's for you to find you :-)

Upvotes: 1

Related Questions