String Editing in R - Taking Out Repetition

Question

I'm working with some character data in R, and I have some parts that have (foo)(foo) in the middle of the string. Is there anyway to automatically find those repetitions, and remove them (representing them as (foo) in the same position)?

I'm wondering if a possible solution is to use strsplit by ), and check if there is any equivalency, and then just reappend the ) back. Would this work?

Ex. string: "abc def (foo)(foo) abc def"

Itamar · Accepted Answer

You could use a perl regular expression substitution within R as in the following example:

test <- "abc def (foo)(foo) abc def"
gsub('($\w+$)\1','\1',test,perl=TRUE)

Alternatively, you can run a perl one-liner to clean the data beforehand:

echo "abc def (foo)(foo) abc def\n" | perl -ne 's/($\w+$)\1/$1/gi;print'

String Editing in R - Taking Out Repetition

Answers (2)

Related Questions