Reputation: 87

Extract row wise common substring within more than 2 columns dataframe R

In a microbiology lineage classifier I want to compare three methods. I have the family, genus and specie taxonomic ranks, separated each one by ";" in a R dataframe (df).

As a toy example, I have the next df:

df <- data.frame(classifier_a = c("Lachnospiraceae;Blautia;NA", 'Succinivibrionaceae;Succinivibrio;NA', 'NA;NA;NA'), 
                 classifier_b = c('Lachnospiraceae;Blautia;Blautia faecis', 'Succinivibrionaceae;Succinivibrio;NA', 'UGC-10;NA;NA'), 
                 classifier_c= c('Lachnospiraceae;Blautia;Blautia faecis', 'Succinivibrionaceae;Succinivibrio;Succinivibrio dextrinosolvens', 'NA;NA;NA'))

df

For each row I am interested to find out the method or methods that gives me the same results (for this I thought with the statistical mode concept). Nevertheless, If I apply this approach, does not work as I desire because I obtain the following output:

df$Row_mode <- apply(df[,1:3],1, function(x) {names(which.max(table(factor(x,unique(x)))))})

df <- data.frame(classifier_a = c("Lachnospiraceae;Blautia;NA", 'Succinivibrionaceae;Succinivibrio;NA', 'NA;NA;NA'), 
                 classifier_b = c('Lachnospiraceae;Blautia;Blautia faecis', 'Succinivibrionaceae;Succinivibrio;NA', 'UGC-10;NA;NA'), 
                 classifier_c= c('Lachnospiraceae;Blautia;Blautia faecis', 'Succinivibrionaceae;Succinivibrio;Succinivibrio dextrinosolvens', 'NA;NA;NA'),
                 Row_mode=c("Lachnospiraceae;Blautia;Blautia faecis", "Succinivibrionaceae;Succinivibrio;NA","NA;NA;NA"))

In the new Row_mode variable created in the first row I obtained the desired output: Lachnospiraceae;Blautia;Blautia faecis. Nevertheless, on the second row I do not obtain the classification obtained by classifier_c which is the only that gives me the family, genus and specie. It is normal because I used the mode concept.

Therefore, I need a part from the mode to take into account the maximization of the three taxonomic levels. In this second case I would like to obtain the case:

Succinivibrionaceae;Succinivibrio;Succinivibrio dextrinosolvens instead of Succinivibrionaceae;Succinivibrio;NA

And finnally in the third row, I would like to obtain the case which less NA values although it is not the statistical mode.

In general terms, I like the mode statistical because from 3 classifier methods I hope to obtain similar taxonomic lineage but in each row I desire to obtain the longest (more informative) lineage from almost one of the three methods compared.

On summary, the desired output is:

df <- data.frame(classifier_a = c("Lachnospiraceae;Blautia;NA", 'Succinivibrionaceae;Succinivibrio;NA', 'NA;NA;NA'), 
                 classifier_b = c('Lachnospiraceae;Blautia;Blautia faecis', 'Succinivibrionaceae;Succinivibrio;NA', 'UGC-10;NA;NA'), 
                 classifier_c= c('Lachnospiraceae;Blautia;Blautia faecis', 'Succinivibrionaceae;Succinivibrio;Succinivibrio dextrinosolvens', 'NA;NA;NA'),
                 desired_output=c("Lachnospiraceae;Blautia;Blautia faecis", "Succinivibrionaceae;Succinivibrio;Succinivibrio dextrinosolvens","UGC-10;NA;NA"))

Thanks on advance for your help and hints.

Upvotes: 1

Answers (3)

one

Reputation: 3912

Updated answer (to account for multiple minimum, dataset in the comment):

least_na <- function(x){
  index <- stringr::str_count(x,"NA")
  paste(x[which(index==min(index))],collapse="|")
}
df_2$Row_mode <- apply(df_2,1,least_na)

> df_2$Row_mode
[1] "Lachnospiraceae;Blautia;Blautia faecis|Lachnospiraceae;Escherichia;Escherichia coli"

We can create user-defined function for this:

least_na <- function(x){
  x[which.min(str_count(x,"NA"))]
}

df$Row_mode <- apply(df,1,least_na)

> df$Row_mode
[1] "Lachnospiraceae;Blautia;Blautia faecis"                         
[2] "Succinivibrionaceae;Succinivibrio;Succinivibrio dextrinosolvens"
[3] "UGC-10;NA;NA"

Upvotes: 1

Onyambu

Reputation: 79338

df %>%
  rowid_to_column()%>%
  separate_rows(everything(), sep = ';') %>%
  mutate(new_col = do.call('coalesce', across(-1, ~na_if(.,'NA'))),
         new_col = replace_na(new_col, 'NA'))%>%
  group_by(rowid) %>%
  summarise(across(everything(), ~str_c(.x, collapse = ';')))

Upvotes: 0

Luci

Reputation: 51

In your code, you used which.max to find the most common answer between the three classifiers, which is why you get Succinivibrionaceae;Succinivibrio;NA instead of Succinivibrionaceae;Succinivibrio;Succinivibrio dextrinosolvens in column 2.

I changed your code to use which.max to find the longest answer between the three classifiers. This will correspond to that that gives you the family, genus and species in your toy example.

df <- data.frame(classifier_a = c("Lachnospiraceae;Blautia;NA", 'Succinivibrionaceae;Succinivibrio;NA', 'NA;NA;NA'), 
                 classifier_b = c('Lachnospiraceae;Blautia;Blautia faecis', 'Succinivibrionaceae;Succinivibrio;NA', 'UGC-10;NA;NA'), 
                 classifier_c= c('Lachnospiraceae;Blautia;Blautia faecis', 'Succinivibrionaceae;Succinivibrio;Succinivibrio dextrinosolvens', 'NA;NA;NA'))


df$Row_mode <- apply(df, 1, function(x) {x[which.max(nchar(x))]})
df

Which returns:


classifier_a                           classifier_b                                                    classifier_c
1           Lachnospiraceae;Blautia;NA Lachnospiraceae;Blautia;Blautia faecis                          Lachnospiraceae;Blautia;Blautia faecis
2 Succinivibrionaceae;Succinivibrio;NA   Succinivibrionaceae;Succinivibrio;NA Succinivibrionaceae;Succinivibrio;Succinivibrio dextrinosolvens
3                             NA;NA;NA                           UGC-10;NA;NA                                                        NA;NA;NA
                                                         Row_mode
1                          Lachnospiraceae;Blautia;Blautia faecis
2 Succinivibrionaceae;Succinivibrio;Succinivibrio dextrinosolvens
3                                                    UGC-10;NA;NA

Upvotes: 2

Extract row wise common substring within more than 2 columns dataframe R

Answers (3)

Related Questions