Reputation: 1335
I have a data frame of taxonomic variables that looks like this (but longer).
taxTest <- structure(list(Kingdom = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "Bacteria", class = "factor"),
Phylum = structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("Bacteroidetes",
"Proteobacteria"), class = "factor"), Class = structure(c(2L,
1L, 1L, 1L, 1L), .Label = c("Bacteroidia", "Gammaproteobacteria"
), class = "factor"), Order = structure(c(2L, 1L, 1L, 1L,
1L), .Label = c("Bacteroidales", "Enterobacteriales"), class = "factor"),
Family = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("Bacteroidaceae",
"Enterobacteriaceae", "Prevotellaceae"), class = "factor"),
Genus = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("Bacteroides",
"Escherichia/Shigella", "Prevotella"), class = "factor"),
Genus.y = structure(c(NA, 1L, 2L, 1L, 2L), .Label = c("Bacteroides",
"Prevotella"), class = "factor"), Species = structure(c(1L,
4L, 2L, 5L, 3L), .Label = c("albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris",
"copri", "disiens", "dorei", "dorei/vulgatus"), class = "factor")), .Names = c("Kingdom",
"Phylum", "Class", "Order", "Family", "Genus", "Genus.y", "Species"
), row.names = c("tax1", "tax2", "tax3", "tax4", "tax5"), class = "data.frame")
I want to come up with a short taxonomic name from this data and so I run a function that is slightly more complicated than this one (it has to deal with deal with NA data in a bunch of these taxonomic levels), but fails in the same way.
library(dplyr)
tag_taxon <- function(tvdf){
species <- tvdf %>% dplyr::select(Species) %>% unlist
genus2 <- tvdf %>% dplyr::select(Genus, Genus.y) %>% unlist
genus <- genus2 %>% na.omit %>% .[1]
#genus <- tvdf %>% dplyr::select(Genus) %>% unlist
out <- paste(genus, species)
out }
If I run this function against each row of the table, I get an answer that I am expecting, a Genus and species name.
for(i in 1:5){
print(taxTest %>% .[i,] %>% tag_taxon)
}
[1] "Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris"
[1] "Bacteroides dorei"
[1] "Prevotella copri"
[1] "Bacteroides dorei/vulgatus"
[1] "Prevotella disiens"
I feel like I should be able to use dplyr to apply this function over each row of the data frame. Unfortunately, this returns counter-intuitive results.
taxTest %>% rowwise %>% tag_taxon
'Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris' 'Escherichia/Shigella dorei' 'Escherichia/Shigella copri' 'Escherichia/Shigella dorei/vulgatus' 'Escherichia/Shigella disiens'
I thought maybe the apply function might also work here, but this just outright fails with a cryptic error message.
apply(taxTest, 1, tag_taxon)
Error in UseMethod("select_"): no applicable method for 'select_' applied to an object of class "character" Traceback:
- apply(taxTest, 1, tag_taxon)
- FUN(newX[, i], ...)
- tvdf %>% dplyr::select(Species) %>% unlist # at line 4 of file
- withVisible(eval(quote(
_fseq
(_lhs
)), env, env))- eval(quote(
_fseq
(_lhs
)), env, env)- eval(quote(
_fseq
(_lhs
)), env, env)_fseq
(_lhs
)- freduce(value,
_function_list
)- function_list[i]
- dplyr::select(., Species)
- select.default(., Species)
- select_(.data, .dots = compat_as_lazy_dots(...))
Any ideas about what is going on here? I can totally solve this problem with a for loop, but I'd rather use dplyr if I can.
Thanks!
Edit: One more thing! I forgot to mention in my original post that if one un-comments the #genus <- tvdf %>% dplyr::select(Genus) %>% unlist
line (that is, I don't try to append the species information to the genus information) the plyr function gives the expected results.
Upvotes: 1
Views: 598
Reputation: 93871
paste
is vectorized, so there's no need for a separate function to operate by row. The code below requires Genus
and Genus.y
to be character rather than factor, so I've done the conversion before running the code.
taxTest[,c("Genus","Genus.y")] = lapply(taxTest[,c("Genus","Genus.y")] , as.character)
taxTest %>%
mutate(tag = gsub("NA ", "", paste(Genus, ifelse(Genus.y==Genus, NA, Genus.y), Species)))
The gsub
is to remove NA
plus the space after it. Here's what the tag
column looks like:
tag 1 Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris 2 Bacteroides dorei 3 Prevotella copri 4 Bacteroides dorei/vulgatus 5 Prevotella disiens
To see what's going on with your original code, we can add some cat
statements to tag_taxon
.
tag_taxon <- function(tvdf){
species <- tvdf %>% dplyr::select(Species) %>% unlist
genus2 <- tvdf %>% dplyr::select(Genus, Genus.y) %>% unlist
cat("genus2 = ", genus2,"\n")
genus <- genus2 %>% na.omit %>% .[1]
cat("genus = ", genus,"\n")
#genus <- tvdf %>% dplyr::select(Genus) %>% unlist
out <- paste(genus, species)
out }
for(i in 1:5){
print(taxTest %>% .[i,] %>% tag_taxon)
}
genus2 = Escherichia/Shigella NA genus = Escherichia/Shigella [1] "Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris" genus2 = Bacteroides Bacteroides genus = Bacteroides [1] "Bacteroides dorei" genus2 = Prevotella Prevotella genus = Prevotella [1] "Prevotella copri" genus2 = Bacteroides Bacteroides genus = Bacteroides [1] "Bacteroides dorei/vulgatus" genus2 = Prevotella Prevotella genus = Prevotella [1] "Prevotella disiens"
Okay, the for loop is doing what we expect. Now for dplyr::rowwise
:
taxTest %>% rowwise %>% tag_taxon
genus2 = Escherichia/Shigella Bacteroides Prevotella Bacteroides Prevotella NA Bacteroides Prevotella Bacteroides Prevotella genus = Escherichia/Shigella [1] "Escherichia/Shigella albertii/boydii/coli/coli,/dysenteriae/enterica/fergusonii/flexneri/sonnei/vulneris" [2] "Escherichia/Shigella dorei" [3] "Escherichia/Shigella copri" [4] "Escherichia/Shigella dorei/vulgatus" [5] "Escherichia/Shigella disiens"
So dplyr
is returning as genus2
a vector with all the values in Genus
and Genus.y
concatenated (except for the NA
values). Then genus
keeps just the first value and uses it over and over again. This may have something to do with the way dplyr
performs non-standard evaluation, but I'm not positive.
If you wanted to use your function, it will work the way you expect with by_row
from the purrrlyr
package:
library(purrrlyr)
taxTest %>% by_row(tag_taxon)
Upvotes: 1