Reputation: 161

How to fill gaps in series of strings?

I am facing a problem where a package retrieves taxonomic information (on species) which are not of the same length. Consequently, the function stores the output in a list which elements contains tables of 2 rows and various numbers of columns (1 row for the taxonomic rank, 1 row for the info itself):

taxo.spA <- data.frame(name=c("Animalia", "Arthropoda", "Chelicerata", 
                                 "Arachnida", "Acari"), 
                       rank=c("Kingdom", "Phylum", "Subphylum", "Class", 
                              "Subclass"))

taxo.spB <- data.frame(name=c("Animalia", "Chordata", "Vertebrata", 
                               "Gnathostomata", "Actinopterygii", "Perciformes", 
                               "Trachinoidei", "Ammodytidae", "Ammodytes", 
                               "Ammodytes tobianus"),
                       rank=c("Kingdom", "Phylum", "Subphylum", "Superclass", 
                              "Class", "Order", "Suborder", "Family", "Genus", 
                              "Species"))

I would like to end up with a table with ranks as columns and names as rows. The main issue is that the taxonomy usually varies in terms of ranks with some taxa not being resolved down to the species level (like this Acari), or if resolved, the ranks may differ (absence of a Superclass) so you can not cbind or rbind those tables (=differing number of columns, or rows).

However, the taxonomic ranks follow a hierarchy so I have been trying to reconstruct this series of ranks (Kingdom down to Species, or subspecies). I wonder what is the best approach to this? Is there a package/function that find match between two strings and location where to insert what is missing?

For instance:

ranks1 <- c("Kingdom", "Phylum", "Subphylum", "Class")
ranks2 <- c("Kingdom", "Phylum", "Subphylum", "Superclass", "Class", "Order")

The function would identify that Kingdom:Subphylum and Class are in common. But also that Subphylum and Class surround Superclass so that Superclass can be inserted between Subphylum and Class. Finally that Order is missing and should be right after Class, on its right side:

"Kingdom", "Phylum", "Subphylum", "Superclass", "Class", "Order"

Ultimately, the function I am writing will build a data.frame with n columns (=the longest taxonomy) and S rows (the number of taxa) and fill it with the taxonomic info I have on each taxa, in the correct column, leaving the rest as NAs.

desired.output <- data.frame(rbind(c("Animalia", "Arthropoda", "Chelicerata", 
                                     NA, "Arachnida", "Acari", NA), 
                                   c("Animalia", "Chordata", "Vertebrata", 
                                     "Gnathostomata", "Actinopterygii", NA, 
                                     "Perciformes")))

names(desired.output) <- c("Kingdom", "Phylum", "Subphylum", "Superclass", 
                           "Class", "Subclass", "Order")

I have tried to start with one of the most complete info I have and fill in the gaps comparing with other taxa. I have played with setdiff(), intersect(), %in%; and tried to find what's in common, what belongs to only one of the two strings and rebuild that but I am not sure that's the best way to go?

Any ideas? Suggestions?

N.B. I will keep the dataset as a dataframe (although for now more a matrix) as I will merge it with other datasets later on.

EDITS/ANSWER BELOW

So, first of all, thanks for the help. I inspired myself from the answers and managed to get this to work.

The main problem was that the tables contained in the list (1) did not have the same number of rows, (2) rows could contain different info (some ranks may be skipped in a taxonomy) making it hard to merge everything inside one single table.

However, the taxonomy has this tree-like hierarchy that I could use to find how those ranks branch together. How I tackled the problem:

I used the organism which had the most resolved info as my reference (=the highest number of ranks), then took each list of ranks (a vector of ranks) and found the differences with this most resolved vector. Then I search for the position of those missing ranks by looking at what ranks would come above and below them in the hierarchy and where they would match in my reference.

Four cases were possible (N.B. the highest rank is on the left, lowest on the right):

no match: I can't place that rank in the taxonomy (yet)
2 matches: I can place the missing info between the two matches in my reference
1 match on the left: I can place it after the match
1 match on the right: I can place it before the match

I looped over the missing ranks and grew sequentially the ranks until all possible ranks in the dataset were included in the vector: I used the function append() to sequentially add the missing rank after a specific position defined by positions of common ranks between the reference and other taxonomies.

Finally, I used this vector as my columns' names for the final table and filled the table with the taxa information (see below). Maybe not the best but should be consistent across taxonomies.

Thanks a lot! (P.S. Feels nice when it is finally doing what it is supposed to do)

Upvotes: 2

Answers (2)

jay.sf

Reputation: 73212

You could first define a function that transforms your taxo*s into the incomplete end format.

myTransform <- function(x) {
  tr <- t(x[2:1])
  colnames(tr) <- make.names(tr[1, ], unique=TRUE)  # `make.names()` to get unique column names
  return(as.data.frame(t(tr[-1, ])))
}

Then put all taxo*s in to a list l. E.g. with mget() if they are loaded into the workspace.

l <- lapply(mget(ls(pattern="taxo")), myTransform)

(This is basically the same what l <- lapply(list(taxo.spA, taxo.spB), myTransform) does, but it's assumed that you have a whole bunch of taxo*s.)

It makes sense to add an id column to the data frames in the list.

l <- l <- lapply(1:length(l), function(x) cbind(id=names(l)[x], l[[x]]))

Now run merge() wrapped into Reduce() like so:

out <- Reduce(function(...) merge(..., all=TRUE), l)

Giving

> out
        id  Kingdom     Phylum   Subphylum          Class Subclass
1 taxo.spA Animalia Arthropoda Chelicerata      Arachnida    Acari
2 taxo.spB Animalia   Chordata  Vertebrata Actinopterygii     <NA>
3 taxo.spC Animalia Arthropoda Chelicerata      Arachnida    Acari
     Superclass       Order     Suborder      Family     Genus
1          <NA>        <NA>         <NA>        <NA>      <NA>
2 Gnathostomata Perciformes Trachinoidei Ammodytidae Ammodytes
3          <NA>        <NA>         <NA>        <NA>      <NA>
             Species Subclass.1
1               <NA>       <NA>
2 Ammodytes tobianus       <NA>
3               <NA>  Something
1               <NA>       <NA>
2 Ammodytes tobianus       <NA>
3               <NA>  Something

Additional data (to simulate duplicated column)

taxo.spC <- structure(list(name = structure(c(2L, 4L, 5L, 3L, 1L, 6L), .Label = c("Acari", 
"Animalia", "Arachnida", "Arthropoda", "Chelicerata", "Something"
), class = "factor"), rank = structure(c(2L, 3L, 5L, 1L, 4L, 
4L), .Label = c("Class", "Kingdom", "Phylum", "Subclass", "Subphylum"
), class = "factor")), row.names = c(NA, -6L), class = "data.frame")

Upvotes: 1

DS_UNI

Reputation: 2650

How about something like this:

library(dplyr)
# add a column with the name of the taxonomy
taxo.spA$tax <- "taxo.spA"
taxo.spB$tax <- "taxo.spB"

# bind the rows together (an alternative to do.call(rbind, .) would be data.table::rbindlist())
# this would also work if you have more than two taxonomies 
result <- list(taxo.spA, taxo.spB) %>% 
  do.call(rbind, .) %>% 
  reshape2::dcast(tax ~ rank, value.var = "name") 

# choose the columns and the order you want
orderd_classes <- c("Kingdom", "Phylum", "Subphylum", "Superclass", "Class", "Subclass", "Order")
result[orderd_classes]

The result would be:

# Kingdom     Phylum   Subphylum    Superclass          Class Subclass       Order
# Animalia Arthropoda Chelicerata          <NA>      Arachnida    Acari        <NA>
# Animalia   Chordata  Vertebrata Gnathostomata Actinopterygii     <NA> Perciformes

Upvotes: 0

How to fill gaps in series of strings?

EDITS/ANSWER BELOW

Answers (2)

Related Questions