Jon
Jon

Reputation: 455

Replace dynamic sequences in text string using a vector in R

I'm trying to process a large text field to replace specific two-part text combinations in long strings.

The components I'm searching for are of the format (*edit as too specific) Parent.Subtype1 _ Parent.Subtype2 and when this exact order occurs (and for all occurrences) I want to replace it with Parent.Subtype2

Pretty much as per this pseudo code

# text.seq actually loaded from SQL via RODBC
# data frame is loaded, where text.seq is a field of type chr
text.seq <- ''
text.seq[1] <- 'An.i _ Bo.i _ An.i _ An.c _ Cx.i _ Cx.i _ Cx.c'
text.seq[2] <- 'An.i _ Bo.i _ Dz.c'
text.seq[3] <- 'Cx.c _ Cx.i _ An.i _ An.c'

uniques <- unique(unlist(strsplit(text.seq, ' _ ', fixed = TRUE), use.names = FALSE))
uniques <- uniques[grep(".i", uniques)] # Get PARENTS with .i only
uniques <- gsub(".i", "", uniques) # Get PARENT precursor

uniques
# Returns "An" "Bo" "Cx"

# Need help here
# List function applied to text.seq using uniques variable
# replacing sequence  "X.i _ X.c" with "X.c" where X is each if the parents in *uniques* in turn

The desired output would be

# text.seq[1] == 'An.i _ Bo.i _ An.c _ Cx.i _ Cx.c'
# text.seq[2] == 'An.i _ Bo.i _ Dz.c'
# text.seq[3] == 'Cx.c _ Cx.i _ A.c'

I feel I could achieve this with a loop function over each element of the uniques variable, but I'd much rather use an apply function as it feels as though it'd be faster and 'best practice'.

I'd appreciate if anyone can help me structure this apply function as I'm still new to R and struggle with the composition of these.

Thanks

Upvotes: 0

Views: 94

Answers (1)

Andrew Gustar
Andrew Gustar

Reputation: 18425

You can do this with gsub. The regex says to look for a combination of letters followed by ".i _ " then the same combination followed by ".c", and replace the whole thing with the same combination followed by ".c".

ts <- gsub("([A-Za-z]+)\\.i\\s_\\s\\1\\.c","\\1\\.c",text.seq)

ts
[1] "An.i _ Bo.i _ An.c _ Cx.i _ Cx.c" "An.i _ Bo.i _ Dz.c" "Cx.c _ Cx.i _ An.c" 

Upvotes: 1

Related Questions