Loay Jabre
Loay Jabre

Reputation: 41

Extract string up to a different word in each row - R

I have a dataframe with a column containing various words. I also have a separate list of strings (not the same length as the df), and I'd like to create a new column in the dataframe which matches the strings to the words in the column, but only keep the part of the string up to that word.

So for example: I have this table:

words
apple
plant
banana
animal
fly
ecoli

and these strings of words:

stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana","eukaryote;animal;dog", "eukaryote;plant;orange" "eukaryote;animal;cat"; "eukaryote;insect;fly", "prokaryote;bacterium;ecoli")

and I'd like to get this:

words new_words
apple eukaryote;plant;apple
plant eukaryote;plant
banana eukaryote;plant;banana
animal eukaryote;animal
fly eukaryote;insect;fly
ecoli prokaryote;bacterium;ecoli

I've tried something along the lines of :

df$words <- c("apple", "plant", "banana", "animal", "fly", "ecoli")
df$new_words<- sub(df$words, "", stringlist)

Upvotes: 1

Views: 480

Answers (1)

akrun
akrun

Reputation: 887048

Loop over the 'words' column, get the matching 'stringlist' value with grep, use sub to capture the characters including the word and replace it with backreference (\\1) of the captured group

df$new_words <- sapply(df$words, function(x) 
    sub(sprintf("(.*%s).*", x), "\\1", grep(x, stringlist, 
     value = TRUE)[1]))

-output

> df
   words                  new_words
1  apple      eukaryote;plant;apple
2  plant            eukaryote;plant
3 banana     eukaryote;plant;banana
4 animal           eukaryote;animal
5    fly       eukaryote;insect;fly
6  ecoli prokaryote;bacterium;ecoli

data

df <- structure(list(words = c("apple", "plant", "banana", "animal", 
"fly", "ecoli")), class = "data.frame", row.names = c(NA, -6L
))

stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana", 
"eukaryote;animal;dog", 
"eukaryote;plant;orange", "eukaryote;animal;cat", "eukaryote;insect;fly", 
"prokaryote;bacterium;ecoli")

Upvotes: 1

Related Questions