AJosh
AJosh

Reputation: 11

Partial regex results in R

Would like to know the error while using str_replace_all while doing a transformation on a string:

abc <- "Good Product ...but it's darken the skin tone..why...?"

I would like to do an additional manipulation in order to enable convert it to something like below, before running sentence tokenization using quanteda:

abc_new <- "Good Product. But it's darken the skin tone. Why?"

I am using the following regex to enable this:

str_replace_all(abc,"\\.{2,15}[a-z]{1}", paste(".", toupper(str_extract_all(str_extract_all(abc,"\\.{2,15}[a-z]{1}"),"[a-z]{1}")[[1]])[[1]], collapse = " "))

However this throws: "Good Product. Cut it's darken the skin tone. Chy...?"

Can someone suggest a solution for this?

Upvotes: 1

Views: 74

Answers (2)

Mark
Mark

Reputation: 4537

It seems like you're trying to match a pattern to remove, using a part of what you want to keep in that pattern. In regular expressions you can use () to flag a portion of pattern to use in the replacement.

Consider in your case:

abc <- "Good Product ...but it's darken the skin tone..why...?"
step1 <- gsub(" ?\\.+([a-zA-Z])",". \\U\\1",abc,perl=TRUE)
step1
#> [1] "Good Product. But it's darken the skin tone. Why...?"

The matching expressions breaks down as:

 ?         #Optionally match a space (to handle the space after Good Product)
\\.+       #Match at least one period
([a-zA-Z]) #Match one letter and remember it

The replacement pattern

.       #Insert a period followed by a space
\\U     #Insert an uppercase version...
   \\1    #of whatever was matched in the first set of parenthesis

Now, this doesn't fix the ellipses followed by a question mark. A followup match can fix this.

step2 = gsub("\\.+([^\\. ])","\\1",step1)
step2
#> [1] "Good Product. But it's darken the skin tone. Why?"

Here we're matching

\\.+      #at least one period
([^\\. ]) #one character that is not a period or a space and remember it

Replacing with

\\1 #The thing we remembered

So, two steps, two fairly generic regular expressions that should extend to other use cases as well.

Upvotes: 1

Deena
Deena

Reputation: 6213

It's really, really difficult to read and understand the replacement code you provided, given how long and nested it is.

I would try to break down the complex pattern to smaller and traceable ones, that I can easily debug. One can do that either by assigning the intermediate results to temporary variables, or using the pipe-operator:

library(magrittr)
string <- "Good Product ...but it's darken the skin tone..why...?"
string %>% 
  gsub("\\.+\\?", "?", .) %>%   # Remove full-stops before question marks
  gsub("\\.+", ".", .) %>%      # Replace all multiple dots with a single one
  gsub(" \\.", ".", .) %>%      # Remove space before dots
  gsub("(\\.)([^ ])", ". \\2", .) %>%  # Add a space between the full-stop and the next sentance 
  gsub("(\\.) ([[:alpha:]])", ". \\U\\2", ., perl=TRUE) # Replace first letter after the full-stop with it's upper caps

  # [1] "Good Product. But it's darken the skin tone. Why?"

Upvotes: 1

Related Questions