Reputation: 87
I am working on a text cleaning pipeline where I hope to apply a list of target words and corresponding replacement words within a dataframe to a given string (e.g., goats)
goats <- c("goats like apples applesauce. goats like bananas bananasplits. goats like cheese cheesecake.")
I am using a for loop to run down the list of targets and gsub with their corresponding replacements in the specified text (goats). I want the substitution to only catch exact string matches (e.g., banana but not bananasplit). Here's the loop:
goatclean <- goats
for (i in seq_along(swap$target)) {
goatclean <- gsub(swap$target[i], swap$replace[i], goatclean)
}
print(goatclean)
The output of this loop is: "goats like mary maryauce. goats like linda lindaplits. goats like jane janecake."
I cannot figure out a way to gsub 'apples' from the dataframe when it is only an isolated word using regex -- I am getting errors when I add \s+ to:
gsub(\\s+(swap$target[i])\\s+, swap$replace[i], goatclean)
Any advice on how to get the output to the following: "goats like mary applesauce. goats like linda bananasplits. goats like jane cheesecake."
Thanks everyone!
Upvotes: 0
Views: 180
Reputation: 388907
Try using word boundaries (\\b
) around the pattern -
for (i in seq_along(swap$target)) {
goatclean <- gsub(paste0('\\b', swap$target[i], '\\b'), swap$replace[i], goatclean)
}
Upvotes: 1