ohnoplus
ohnoplus

Reputation: 1335

dplyr giving different results with the rowwise operator than looping that function over each row

I have a data frame of taxonomic variables that looks like this (but longer).

taxTest <- structure(list(Kingdom = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Bacteria", class = "factor"), 
Phylum = structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("Bacteroidetes", 
"Proteobacteria"), class = "factor"), Class = structure(c(2L, 
1L, 1L, 1L, 1L), .Label = c("Bacteroidia", "Gammaproteobacteria"
), class = "factor"), Order = structure(c(2L, 1L, 1L, 1L, 
1L), .Label = c("Bacteroidales", "Enterobacteriales"), class = "factor"), 
Family = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("Bacteroidaceae", 
"Enterobacteriaceae", "Prevotellaceae"), class = "factor"), 
Genus = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("Bacteroides", 
"Escherichia/Shigella", "Prevotella"), class = "factor"), 
Genus.y = structure(c(NA, 1L, 2L, 1L, 2L), .Label = c("Bacteroides", 
"Prevotella"), class = "factor"), Species = structure(c(1L, 
4L, 2L, 5L, 3L), .Label = c("albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris", 
"copri", "disiens", "dorei", "dorei/vulgatus"), class = "factor")), .Names = c("Kingdom", 
"Phylum", "Class", "Order", "Family", "Genus", "Genus.y", "Species"
), row.names = c("tax1", "tax2", "tax3", "tax4", "tax5"), class = "data.frame")

taxTest_output

I want to come up with a short taxonomic name from this data and so I run a function that is slightly more complicated than this one (it has to deal with deal with NA data in a bunch of these taxonomic levels), but fails in the same way.

library(dplyr)

tag_taxon <- function(tvdf){
    species <- tvdf %>% dplyr::select(Species) %>% unlist

    genus2 <- tvdf %>% dplyr::select(Genus, Genus.y) %>% unlist
    genus <- genus2 %>% na.omit %>% .[1]

    #genus <- tvdf %>% dplyr::select(Genus) %>% unlist

        out <- paste(genus, species)

out }

If I run this function against each row of the table, I get an answer that I am expecting, a Genus and species name.

for(i in 1:5){
    print(taxTest %>% .[i,] %>% tag_taxon)
}

[1] "Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris"

[1] "Bacteroides dorei"

[1] "Prevotella copri"

[1] "Bacteroides dorei/vulgatus"

[1] "Prevotella disiens"

I feel like I should be able to use dplyr to apply this function over each row of the data frame. Unfortunately, this returns counter-intuitive results.

 taxTest %>% rowwise %>% tag_taxon

'Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris' 'Escherichia/Shigella dorei' 'Escherichia/Shigella copri' 'Escherichia/Shigella dorei/vulgatus' 'Escherichia/Shigella disiens'

I thought maybe the apply function might also work here, but this just outright fails with a cryptic error message.

 apply(taxTest, 1, tag_taxon)

Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "character" Traceback:

  1. apply(taxTest, 1, tag_taxon)
  2. FUN(newX[, i], ...)
  3. tvdf %>% dplyr::select(Species) %>% unlist # at line 4 of file
  4. withVisible(eval(quote(_fseq(_lhs)), env, env))
  5. eval(quote(_fseq(_lhs)), env, env)
  6. eval(quote(_fseq(_lhs)), env, env)
  7. _fseq(_lhs)
  8. freduce(value, _function_list)
  9. function_list[i]
  10. dplyr::select(., Species)
  11. select.default(., Species)
  12. select_(.data, .dots = compat_as_lazy_dots(...))

Any ideas about what is going on here? I can totally solve this problem with a for loop, but I'd rather use dplyr if I can.

Thanks!

Edit: One more thing! I forgot to mention in my original post that if one un-comments the #genus <- tvdf %>% dplyr::select(Genus) %>% unlist line (that is, I don't try to append the species information to the genus information) the plyr function gives the expected results.

Upvotes: 1

Views: 598

Answers (1)

eipi10
eipi10

Reputation: 93871

paste is vectorized, so there's no need for a separate function to operate by row. The code below requires Genus and Genus.y to be character rather than factor, so I've done the conversion before running the code.

taxTest[,c("Genus","Genus.y")] = lapply(taxTest[,c("Genus","Genus.y")] , as.character)

taxTest %>% 
  mutate(tag = gsub("NA ", "", paste(Genus, ifelse(Genus.y==Genus, NA, Genus.y), Species)))

The gsub is to remove NA plus the space after it. Here's what the tag column looks like:

  tag
1 Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris
2                                                                                        Bacteroides dorei
3                                                                                         Prevotella copri
4                                                                               Bacteroides dorei/vulgatus
5                                                                                       Prevotella disiens

To see what's going on with your original code, we can add some cat statements to tag_taxon.

tag_taxon <- function(tvdf){
  species <- tvdf %>% dplyr::select(Species) %>% unlist

  genus2 <- tvdf %>% dplyr::select(Genus, Genus.y) %>% unlist

  cat("genus2 = ", genus2,"\n")

  genus <- genus2 %>% na.omit %>% .[1]

  cat("genus = ", genus,"\n")

  #genus <- tvdf %>% dplyr::select(Genus) %>% unlist

  out <- paste(genus, species)

  out }

for(i in 1:5){
  print(taxTest %>% .[i,] %>% tag_taxon)
}
genus2 =  Escherichia/Shigella NA 
genus =  Escherichia/Shigella
[1] "Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris"
genus2 =  Bacteroides Bacteroides 
genus =  Bacteroides 
[1] "Bacteroides dorei"
genus2 =  Prevotella Prevotella 
genus =  Prevotella 
[1] "Prevotella copri"
genus2 =  Bacteroides Bacteroides 
genus =  Bacteroides 
[1] "Bacteroides dorei/vulgatus"
genus2 =  Prevotella Prevotella 
genus =  Prevotella 
[1] "Prevotella disiens"

Okay, the for loop is doing what we expect. Now for dplyr::rowwise:

taxTest %>% rowwise %>% tag_taxon
genus2 =  Escherichia/Shigella Bacteroides Prevotella Bacteroides Prevotella NA Bacteroides Prevotella Bacteroides Prevotella 
genus =  Escherichia/Shigella 
[1] "Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris"
[2] "Escherichia/Shigella dorei"                                                                              
[3] "Escherichia/Shigella copri"                                                                              
[4] "Escherichia/Shigella dorei/vulgatus"                                                                     
[5] "Escherichia/Shigella disiens"

So dplyr is returning as genus2 a vector with all the values in Genus and Genus.y concatenated (except for the NA values). Then genus keeps just the first value and uses it over and over again. This may have something to do with the way dplyr performs non-standard evaluation, but I'm not positive.

If you wanted to use your function, it will work the way you expect with by_row from the purrrlyr package:

library(purrrlyr)

taxTest %>% by_row(tag_taxon)

Upvotes: 1

Related Questions