Gabriel
Gabriel

Reputation: 443

R regex trimming a string whitespace

I have a string that downloaded from the web:

x = "the company 's newly launched cryptocurrency , Libra , hasn 't been contacted by Facebook , according to a report ." 

They parsed the string such that: ...In addition, contracted words like (can't) are separated into two parts (ca n't) and punctuation is separated from words (eye level . As her).

I want to make the string back to normal, for example:

x = "the company's newly launched cryptocurrency, Libra, hasn't been contacted by Facebook, according to a report."

How do I trim the space before the punctuation?

Have though about using str_remove_all with regex:

str_remove_all(x,"\\s[[:punct:]]'") 

but it will also remove the punctuation.

Any ideas?

Upvotes: 1

Views: 73

Answers (2)

Eyayaw
Eyayaw

Reputation: 1081

With back referencing:

x <- "the company 's newly launched cryptocurrency , Libra , hasn 't been contacted by Facebook , according to a report ."

gsub("(\\s+)([[:punct:]])", "\\2", x, perl = TRUE)

# [1] "the company's newly launched cryptocurrency, Libra, hasn't been contacted by Facebook, according to a report."

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626871

You may use

str_remove_all(x,"\\s+(?=[[:punct:]])")
str_remove_all(x,"\\s+(?=[\\p{S}\\p{P}])")

Or base R:

gsub("\\s+(?=[\\p{S}\\p{P}])", "", x, perl=TRUE) 

See the regex demo.

Details

  • \s+ - 1 or more whitespace chars
  • (?=[[:punct:]]) - a positive lookahead that matches a location that is immediately followed with a punctuation character.

Please check R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character? before choosing the variant with [[:punct:]].

Upvotes: 2

Related Questions