dplyr gives me different answers depending on how I select columns

Question

I may be having trouble understanding some of the basics of dplyr, but it appears that R behaves very differently depending on whether you subset columns as one column data frames or as traditional vectors. Here is an example:

mtcarsdf<-tbl_df(mtcars)

example<-function(x,y) {
  df<-tbl_df(data.frame(x,y))
  df %>% group_by(x) %>% summarise(total=sum(y))
}
#subsetting to cyl this way gives integer vector
example(mtcars$gear,mtcarsdf$cyl)
# 3 112
# 4 56
# 5 30

#subsetting this way gives a one column data table
example(mtcars$gear,mtcarsdf[,"cyl"])
# 3 198
# 4 198
# 5 198
all(mtcarsdf$cyl==mtcarsdf[,"cyl"])
# TRUE

Since my inputs are technically equal the fact that I am getting different outputs tells me I am misunderstanding how the two objects behave. Could someone please enlighten me on how to improve the example function so that it can handle different objects more robustly?

Thanks

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

First, the items that you are comparing with == are not really the same. This could be identified using all.equal instead of ==:

all.equal(mtcarsdf$cyl, mtcarsdf[, "cyl"])
## [1] "Modes: numeric, list"                           
## [2] "Lengths: 32, 1"                                 
## [3] "names for current but not for target"           
## [4] "Attributes: < target is NULL, current is list >"
## [5] "target is numeric, current is tbl_df"

With that in mind, you should be able to get the behavior you want by using [[ to extract the column instead of [.

mtcarsdf <- tbl_df(mtcars)

example<-function(x,y) {
  df<-tbl_df(data.frame(x,y))
  df %>% group_by(x) %>% summarise(total=sum(y))
}

example(mtcars$gear, mtcarsdf[["cyl"]])

However, a safer approach might be to integrate the renaming of the columns as part of your function, like this:

example2 <- function(x, y) {
  df <- tbl_df(setNames(data.frame(x, y), c("x", "y")))
  df %>% group_by(x) %>% summarise(total = sum(y))
}

Then, any of the following should give you the same results.

example2(mtcars$gear, mtcarsdf$cyl)
example2(mtcars$gear, mtcarsdf[["cyl"]])
example2(mtcars$gear, mtcarsdf[, "cyl"])

dplyr gives me different answers depending on how I select columns

Answers (1)

Related Questions