Reputation: 1430
I'm having issues with a problem that should be so simple to resolve. I'd like to replace the whole words in a string which start with a pattern.
> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't."
## this is what i want
> output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."
the best one I've come with so far is this
# this is what get, but it's not correct
> gsub("\\<wasn*.\\>", "wasn't", test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't't aware. Just wasn't't."
I'm really running out of ideas. I would also be happy with
# second desired output without the . at the end
> output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"
Edit: it seems my question was a bit too specific. so, i'm adding other test cases. Basically, i wouldn't know what character(s) would follow "wasn" and i would like to convert all to wasn't
> test <- "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
> test
[1] "i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple"
#desired output
> output
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
Upvotes: 2
Views: 6733
Reputation: 33488
Why not keep it simple and replace any word that starts with wasn
with wasn't
?
test2 <- paste0(
"i really wasn aware and i wasnt aware at all. but i wasn't aware. just",
"wasn't. this wasn45'e meant to be. it wasn@'re simple"
)
gsub("wasn[^ ]*", "wasn't", test2)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't this wasn't meant to be. it wasn't simple"
If dealing with upper-case also then you could just add ignore.case = TRUE
to gsub().
Upvotes: 0
Reputation: 626835
I suggest a solution like this:
test <- c("i really wasn aware and i wasnt aware at all. but i wasn't aware. just wasn't. this wasn45'e meant to be. it wasn@'re simple", "Wasn&^$tt that nice?", "You say wasnmmmt?", "No, he wasn&#t#@$.", "She wasn%#@t##, I know.")
gsub("\\b(wasn)\\S*\\b(?:\\S*(\\p{P})\\B)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
[2] "Wasn't that nice?"
[3] "You say wasn't?"
[4] "No, he wasn't."
[5] "She wasn't, I know."
See an online R demo.
This solution accounts for cases when wasn*
appears at the start of the string or is capitalized, and does not replace the trailing punctuation.
Pattern details
\\b
- a word boundary(wasn)
- Capturing group 1 (later referred to with \\1
in the replacement pattern): a wasn
substring (case insenstive due to ignore.case=TRUE
) \\S*\\b
- any 0+ chars other than whitespace followed with a word boundary(?:\\S*(\\p{P})\\B)?
- an optional non-capturing group, matching 1 or 0 occurrences of
\\S*
- 0+ non-whitespace chars(\\p{P})
- Capturing group 2 (later referred to with \\2
in the replacement pattern): any 1 punctuation (not a symbol! \p{P}
is not equal to [:punct:]
!) symbol not followed with...\\B
- a letter, digit or _
(it is a non-word boundary pattern). For even messier strings (like She wasn%#@t##,$#^ I know.
), when the punctuation can be inside other punctuation symbols, you may restrict the punctuation you want to stop at using a custom bracket expression and adding a \S*
at the end:
gsub("\\b(wasn)\\S*\\b(?:\\S*([?!.,:;])\\S*)?", "\\1't\\2", test, ignore.case=TRUE, perl=TRUE)
See the regex demo.
Upvotes: 1
Reputation: 79228
You can use a negative look ahead provided by perl.. pattern=wasn(?!')t*
gsub("wasn(?!')t*","wasn't",test,perl=T)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."
or you can do:
gsub("wasn'*t*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't."
For the second desired output:
gsub("wasn'*t*[.]?","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't"
AFTER THE EDIT:
gsub("wasn[^. ]*","wasn't",test)
[1] "i really wasn't aware and i wasn't aware at all. but i wasn't aware. just wasn't. this wasn't meant to be. it wasn't simple"
Upvotes: 2