Reputation: 11
Would like to know the error while using str_replace_all
while doing a transformation on a string:
abc <- "Good Product ...but it's darken the skin tone..why...?"
I would like to do an additional manipulation in order to enable convert it to something like below, before running sentence tokenization using quanteda:
abc_new <- "Good Product. But it's darken the skin tone. Why?"
I am using the following regex to enable this:
str_replace_all(abc,"\\.{2,15}[a-z]{1}", paste(".", toupper(str_extract_all(str_extract_all(abc,"\\.{2,15}[a-z]{1}"),"[a-z]{1}")[[1]])[[1]], collapse = " "))
However this throws: "Good Product. Cut it's darken the skin tone. Chy...?"
Can someone suggest a solution for this?
Upvotes: 1
Views: 74
Reputation: 4537
It seems like you're trying to match a pattern to remove, using a part of what you want to keep in that pattern. In regular expressions you can use ()
to flag a portion of pattern to use in the replacement.
Consider in your case:
abc <- "Good Product ...but it's darken the skin tone..why...?"
step1 <- gsub(" ?\\.+([a-zA-Z])",". \\U\\1",abc,perl=TRUE)
step1
#> [1] "Good Product. But it's darken the skin tone. Why...?"
The matching expressions breaks down as:
? #Optionally match a space (to handle the space after Good Product)
\\.+ #Match at least one period
([a-zA-Z]) #Match one letter and remember it
The replacement pattern
. #Insert a period followed by a space
\\U #Insert an uppercase version...
\\1 #of whatever was matched in the first set of parenthesis
Now, this doesn't fix the ellipses followed by a question mark. A followup match can fix this.
step2 = gsub("\\.+([^\\. ])","\\1",step1)
step2
#> [1] "Good Product. But it's darken the skin tone. Why?"
Here we're matching
\\.+ #at least one period
([^\\. ]) #one character that is not a period or a space and remember it
Replacing with
\\1 #The thing we remembered
So, two steps, two fairly generic regular expressions that should extend to other use cases as well.
Upvotes: 1
Reputation: 6213
It's really, really difficult to read and understand the replacement code you provided, given how long and nested it is.
I would try to break down the complex pattern to smaller and traceable ones, that I can easily debug. One can do that either by assigning the intermediate results to temporary variables, or using the pipe-operator:
library(magrittr)
string <- "Good Product ...but it's darken the skin tone..why...?"
string %>%
gsub("\\.+\\?", "?", .) %>% # Remove full-stops before question marks
gsub("\\.+", ".", .) %>% # Replace all multiple dots with a single one
gsub(" \\.", ".", .) %>% # Remove space before dots
gsub("(\\.)([^ ])", ". \\2", .) %>% # Add a space between the full-stop and the next sentance
gsub("(\\.) ([[:alpha:]])", ". \\U\\2", ., perl=TRUE) # Replace first letter after the full-stop with it's upper caps
# [1] "Good Product. But it's darken the skin tone. Why?"
Upvotes: 1