Ankhnesmerira
Ankhnesmerira

Reputation: 1430

Replace the whole word that starts with a pattern using gsub in R

I'm having issues with a problem that should be so simple to resolve. I'd like to replace the whole words in a string which start with a pattern.

> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't."

    ## this is what i want
    > output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

the best one I've come with so far is this

# this is what get, but it's not correct
> gsub("\\<wasn*.\\>", "wasn't", test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't't aware. Just wasn't't."

I'm really running out of ideas. I would also be happy with

 # second desired output without the . at the end
    > output
    [1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"

Edit: it seems my question was a bit too specific. so, i'm adding other test cases. Basically, i wouldn't know what character(s) would follow "wasn" and i would like to convert all to wasn't

> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
> test
[1] "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"

#desired output
> output
 [1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"

Upvotes: 2

Views: 6733

Answers (3)

s_baldur
s_baldur

Reputation: 33488

Why not keep it simple and replace any word that starts with wasn with wasn't?

test2 <- paste0(
  "i really wasn aware and i wasnt aware at all. but i wasn't aware. just",
  "wasn't. this wasn45'e meant to be. it wasn@'re simple"
)
gsub("wasn[^ ]*", "wasn't", test2)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't this wasn't meant to be. it wasn't simple"

If dealing with upper-case also then you could just add ignore.case = TRUE to gsub().

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626835

I suggest a solution like this:

test <- c("i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple", "Wasn&^$tt that nice?", "You say wasnmmmt?", "No, he wasn&#t#@$.", "She wasn%#@t##, I know.")
 gsub("\\b(wasn)\\S*\\b(?:\\S*(\\p{P})\\B)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
[2] "Wasn't that nice?"                                                                                                          
[3] "You say wasn't?"                                                                                                            
[4] "No, he wasn't."                                                                                                             
[5] "She wasn't, I know." 

See an online R demo.

This solution accounts for cases when wasn* appears at the start of the string or is capitalized, and does not replace the trailing punctuation.

Pattern details

  • \\b - a word boundary
  • (wasn) - Capturing group 1 (later referred to with \\1 in the replacement pattern): a wasn substring (case insenstive due to ignore.case=TRUE)
  • \\S*\\b - any 0+ chars other than whitespace followed with a word boundary
  • (?:\\S*(\\p{P})\\B)? - an optional non-capturing group, matching 1 or 0 occurrences of
    • \\S* - 0+ non-whitespace chars
    • (\\p{P}) - Capturing group 2 (later referred to with \\2 in the replacement pattern): any 1 punctuation (not a symbol! \p{P} is not equal to [:punct:]!) symbol not followed with...
    • \\B - a letter, digit or _ (it is a non-word boundary pattern).

For even messier strings (like She wasn%#@t##,$#^ I know.), when the punctuation can be inside other punctuation symbols, you may restrict the punctuation you want to stop at using a custom bracket expression and adding a \S* at the end:

gsub("\\b(wasn)\\S*\\b(?:\\S*([?!.,:;])\\S*)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)

See the regex demo.

Upvotes: 1

Onyambu
Onyambu

Reputation: 79228

You can use a negative look ahead provided by perl.. pattern=wasn(?!')t*

gsub("wasn(?!')t*","wasn't",test,perl=T)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

or you can do:

gsub("wasn'*t*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."

For the second desired output:

gsub("wasn'*t*[.]?","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"

AFTER THE EDIT:

gsub("wasn[^. ]*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"

Upvotes: 2

Related Questions