Ömer Coskun
Ömer Coskun

Reputation: 49

Replacing a text pattern with a new text from adjacent column using a loop

I have a problem with my dataset. Here is my data:

df <- data.frame(OTU=c(1,2,3),
             Domain=c("Bacteria", "Bacteria", "Archaea"),
             Phylum= c("Atribacteria", "Proteobacteria", "uncultured 
             archaea"),
             Class =c("JS1", "uncultured bacterium", "uncultured archaea"),
             Order=c("uncultured bacterium", "uncultured", 
                     "Ambiguous_taxa"),
             Family=c("uncultured bacterium", "uncultured", 
                      "Ambiguous_taxa"), stringsAsFactors = FALSE)
df


OTU   Domain             Phylum                Class                Order               Family
1   1 Bacteria       Atribacteria                  JS1 uncultured bacterium uncultured bacterium
2   2 Bacteria     Proteobacteria uncultured bacterium           uncultured           uncultured
3   3  Archaea uncultured archaea   uncultured archaea       Ambiguous_taxa       Ambiguous_taxa

Summary of my question: Here I would like to change the every text starting with uncultured or Ambiguous with the left column info. If there is more than "uncultured or Ambiguous" written column, It should get the information from the left columns where it has a specific name. For example: In the Order column of third row I have "Ambiguous taxa". So this row should get its name from the Domain where it finds a name without any uncultured or ambiguous in it. So, all the other columns in the right of "Phylum" column should be "uncultured Archaea". Here is the output table that I want to see:

     OTU   Domain          Phylum                     Class                     Order                    Family
1   1 Bacteria       Atribacteria                       JS1            uncultured JS1            uncultured JS1
2   2 Bacteria     Proteobacteria uncultured Proteobacteria uncultured Proteobacteria uncultured Proteobacteria
3   3  Archaea uncultured Archaea        uncultured Archaea        uncultured Archaea        uncultured Archaea 

I have tried to that in for loop but failed to do so. I am getting warnings and it is not changing anything. I am kind of new to "R". I am trying to say, find "uncultured" pattern using grep in the Order column and change it to for example to "uncultured JS1" using paste function for every "uncultured" pattern that it finds.

> Changing_uncultured <- function(DATA){ for(i in 1:length(DATA$Order))
> {    if(grep("uncultured", DATA$Order)) {
>     DATA$Order[i] <- paste("uncultured", DATA$Class[i])   }   }  }
Changing_uncultured(DATA=df)

Thanks in advance. Sorry for edits, It is my fault that I did not consider the fact that the uncultured names can start from any column. Now It reflects the actual data.

Upvotes: 2

Views: 81

Answers (3)

GKi
GKi

Reputation: 39747

In base you can test with grepl and sapply where you have a match with ^uncultured|^Ambiguous. With apply and any you get the rows where you have a hit. And then you simply have to overwrite those line:

df <- data.frame(OTU=c(1,2,3),
             Domain=c("Bacteria", "Bacteria", "Archaea"),
             Phylum= c("Atribacteria", "Proteobacteria", "uncultured archaea"),
             Class =c("JS1", "uncultured bacterium", "uncultured archaea"),
             Order=c("uncultured bacterium", "uncultured", 
                     "Ambiguous_taxa"),
             Family=c("uncultured bacterium", "uncultured", 
                      "Ambiguous_taxa"), stringsAsFactors = FALSE)

t1 <- sapply(df, grepl, pattern="^uncultured|^Ambiguous")
t2 <- apply(t1, 1, any)
t3 <- apply(t1, 1, which.max)

for(i in seq_len(nrow(df))) {
  if(t2[i]) {df[i, t3[i]:ncol(df)]  <- paste("uncultured", df[i, t3[i]-1])}
}
df
#  OTU   Domain             Phylum                     Class                     Order                    Family
#1   1 Bacteria       Atribacteria                       JS1            uncultured JS1            uncultured JS1
#2   2 Bacteria     Proteobacteria uncultured Proteobacteria uncultured Proteobacteria uncultured Proteobacteria
#3   3  Archaea uncultured Archaea        uncultured Archaea        uncultured Archaea        uncultured Archaea

Answer before the Question-update:

df <- data.frame(OTU=c(1,2,3),
             Domain=c("Bacteria", "Bacteria", "Bacteria"),
             Phylum= c("Atribacteria", "Proteobacteria", "Y"),
             Class =c("JS1", "X", "JS2"),
             Order=c("uncultured bacterium", "uncultured", 
                      "Ambiguous_taxa"),
             Family=c("uncultured bacterium", "uncultured", 
                      "Ambiguous_taxa"), stringsAsFactors = FALSE)

tt  <- apply(sapply(df[,c("Order", "Family")], grepl, pattern="^uncultured|^Ambiguous"), 1, any) #get rows to relpace
df[tt, c("Order", "Family")] <- paste("uncultured", df$Class[tt])
df
#  OTU   Domain         Phylum Class          Order         Family
#1   1 Bacteria   Atribacteria   JS1 uncultured JS1 uncultured JS1
#2   2 Bacteria Proteobacteria     X   uncultured X   uncultured X
#3   3 Bacteria              Y   JS2 uncultured JS2 uncultured JS2

Upvotes: 1

akrun
akrun

Reputation: 887981

We can use tidyverse options

library(tidyverse)
df %>% 
  mutate_at(vars(Order, Family), 
          ~ case_when(str_detect(., 'uncultured|ambiguous') ~ str_c(
                   'uncultured', Class), TRUE ~ .))
#   OTU   Domain         Phylum Class          Order         Family
#1   1 Bacteria   Atribacteria   JS1  unculturedJS1  unculturedJS1
#2   2 Bacteria Proteobacteria     X    unculturedX    unculturedX
#3   3 Bacteria              Y   JS2 Ambiguous_taxa Ambiguous_taxa

data

df <- data.frame(OTU=c(1,2,3),
             Domain=c("Bacteria", "Bacteria", "Bacteria"),
             Phylum= c("Atribacteria", "Proteobacteria", "Y"),
             Class =c("JS1", "X", "JS2"),
             Order=c("uncultured bacterium", "uncultured", 
                     "Ambiguous_taxa"),
             Family=c("uncultured bacterium", "uncultured", 
                      "Ambiguous_taxa"), stringsAsFactors = FALSE)

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389335

We can use lapply to loop over multiple columns. Select column by their index or by name. Find out values which start with "uncultured" or "Ambiguous" and replace them by adding corresponding Class value from the same index.

cols <- 5:6
#Or
#cols <- c("Order", "Family")

df[cols] <- lapply(df[cols], function(x) {
    inds <- grep("^uncultured|^Ambiguous", x)  
    x[inds] <- paste0("uncultured ", df$Class[inds])
    x
})

df
#  OTU   Domain         Phylum Class          Order         Family
#1   1 Bacteria   Atribacteria   JS1 uncultured JS1 uncultured JS1
#2   2 Bacteria Proteobacteria     X   uncultured X   uncultured X
#3   3 Bacteria              Y   JS2 uncultured JS2 uncultured JS2

data

df <- data.frame(OTU=c(1,2,3),
             Domain=c("Bacteria", "Bacteria", "Bacteria"),
             Phylum= c("Atribacteria", "Proteobacteria", "Y"),
             Class =c("JS1", "X", "JS2"),
             Order=c("uncultured bacterium", "uncultured", 
                     "Ambiguous_taxa"),
             Family=c("uncultured bacterium", "uncultured", 
                      "Ambiguous_taxa"), stringsAsFactors = FALSE)

Upvotes: 2

Related Questions