MagíBC
MagíBC

Reputation: 77

Compare multiple character columns dataframe R and create new column based on condition

I am trying to automatize a process in R if it is possible in order to avoid to do it manually because it will be 5000 rows to check manually.

I attach a toy example to be more clear of the process that I would like to do.

I have compared 5 methods to classify some reads to species.

Consider for example the first 5 cases:

code <- sprintf("sample % d", 1:5)

Specie_methodA<- c("NA", "NA","NA","NA", "Escherichia coli")
Specie_methodB<- c("Methanobrevibacter smithii", "NA", "NA","Blautia faecis","NA")
Specie_methodC<- c("","","","Blautia faecis","")
Specie_methodD<-c("NA","NA","CAG-41_sp900066215","NA","")
Specie_methodE<-c("","","","","Campylobacter coli")

table <- data.frame(code, Specie_methodA, Specie_methodB, Specie_methodC, Specie_methodD, Specie_methodE)

For each row, I would like to check if a particular specie is obtained,and if it is the case to print it his name in a new column (desired_output in table2, see code below). If two different species are obtained within a row between the 5 methods, I desire a "ERROR" string output. And if no specie is detect by any of the 5 methods, that will print "NA".

Therefore by the table indicated above, I desired to obtain the next output:

desired_output<-c("Methanobrevibacter smithii", "NA","CAG-41_sp90006621","Blautia faecis","ERROR")
table2 <- data.frame(code, Specie_methodA, Specie_methodB, Specie_methodC, Specie_methodD, Specie_methodE,desired_output)

Upvotes: 1

Views: 118

Answers (1)

one
one

Reputation: 3902

We can create a user-defined function

get_desired_output <- function(specie1,specie2,specie3,specie4,specie5){
  species <- c(specie1,specie2,specie3,specie4,specie5)
  # remove empty string, NA string and duplicates
  species <- species[!(species%in%c('NA',''))]%>%unique()
  if(length(species)==0){
    return('NA')
  }
  if(length(species)>1){
    return('ERROR')
  }
  return(species)
}

ifdplyr>=1.0.0:

output <- table%>%
  mutate(across(Specie_methodA:Specie_methodE, as.character))%>%
  rowwise()%>%
  mutate(desired_output=get_desired_output(Specie_methodA,Specie_methodB,Specie_methodC,Specie_methodD,Specie_methodE))


ifdplyr<1.0.0:

output <- table%>%
  mutate_at(vars(Specie_methodA:Specie_methodE),as.character)%>%
  rowwise()%>%
  mutate(desired_output=get_desired_output(Specie_methodA,Specie_methodB,Specie_methodC,Specie_methodD,Specie_methodE))

> output
Source: local data frame [5 x 7]
Groups: <by row>

# A tibble: 5 x 7
  code     Specie_methodA  Specie_methodB       Specie_methodC Specie_methodD   Specie_methodE   desired_output      
  <fct>    <chr>           <chr>                <chr>          <chr>            <chr>            <chr>               
1 sample ~ NA              Methanobrevibacter ~ ""             NA               ""               Methanobrevibacter ~
2 sample ~ NA              NA                   ""             NA               ""               NA                  
3 sample ~ NA              NA                   ""             CAG-41_sp900066~ ""               CAG-41_sp900066215  
4 sample ~ NA              Blautia faecis       Blautia faecis NA               ""               Blautia faecis      
5 sample ~ Escherichia co~ NA                   ""             ""               Campylobacter c~ ERROR

Upvotes: 1

Related Questions