Reputation: 443
I have a string that downloaded from the web:
x = "the company 's newly launched cryptocurrency , Libra , hasn 't been contacted by Facebook , according to a report ."
They parsed the string such that: ...In addition, contracted words like (can't) are separated into two parts (ca n't) and punctuation is separated from words (eye level . As her).
I want to make the string back to normal, for example:
x = "the company's newly launched cryptocurrency, Libra, hasn't been contacted by Facebook, according to a report."
How do I trim the space before the punctuation?
Have though about using str_remove_all with regex:
str_remove_all(x,"\\s[[:punct:]]'")
but it will also remove the punctuation.
Any ideas?
Upvotes: 1
Views: 73
Reputation: 1081
With back referencing:
x <- "the company 's newly launched cryptocurrency , Libra , hasn 't been contacted by Facebook , according to a report ."
gsub("(\\s+)([[:punct:]])", "\\2", x, perl = TRUE)
# [1] "the company's newly launched cryptocurrency, Libra, hasn't been contacted by Facebook, according to a report."
Upvotes: 2
Reputation: 626871
You may use
str_remove_all(x,"\\s+(?=[[:punct:]])")
str_remove_all(x,"\\s+(?=[\\p{S}\\p{P}])")
Or base R:
gsub("\\s+(?=[\\p{S}\\p{P}])", "", x, perl=TRUE)
See the regex demo.
Details
\s+
- 1 or more whitespace chars(?=[[:punct:]])
- a positive lookahead that matches a location that is immediately followed with a punctuation character.Please check R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character? before choosing the variant with [[:punct:]]
.
Upvotes: 2