Jamie_R
Jamie_R

Reputation: 87

How to index and gsub a string within a dataframe using regex in R

I am working on a text cleaning pipeline where I hope to apply a list of target words and corresponding replacement words within a dataframe to a given string (e.g., goats) goats <- c("goats like apples applesauce. goats like bananas bananasplits. goats like cheese cheesecake.")

I am using a for loop to run down the list of targets and gsub with their corresponding replacements in the specified text (goats). I want the substitution to only catch exact string matches (e.g., banana but not bananasplit). Here's the loop:

goatclean <- goats
for (i in seq_along(swap$target)) {
    goatclean <- gsub(swap$target[i], swap$replace[i], goatclean)
}
print(goatclean)

The output of this loop is: "goats like mary maryauce. goats like linda lindaplits. goats like jane janecake."

I cannot figure out a way to gsub 'apples' from the dataframe when it is only an isolated word using regex -- I am getting errors when I add \s+ to:

gsub(\\s+(swap$target[i])\\s+, swap$replace[i], goatclean)

Any advice on how to get the output to the following: "goats like mary applesauce. goats like linda bananasplits. goats like jane cheesecake."

Thanks everyone!

Upvotes: 0

Views: 180

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388907

Try using word boundaries (\\b) around the pattern -

for (i in seq_along(swap$target)) {
  goatclean <- gsub(paste0('\\b', swap$target[i], '\\b'), swap$replace[i], goatclean)
}

Upvotes: 1

Related Questions