Reputation: 357

how to remove duplicate words in a certain pattern from a string in R

I aim to remove duplicate words only in parentheses from string sets.

a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )

What I want to get is just like this

a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'

In order to get the result, I used a code like this

a = gsub('\\|', " | ",  a)
a = gsub('\\(', "(  ",  a)
a = gsub('\\)', "  )",  a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))

However, it resulted in undesirable outputs.

a    
[1] "I (  have | has ) certain words word worded"                 
[2] "(  You | Youre ) can cans do this works worked"              
[3] "I (  am | are ) sure surely you know what when her should do"

Why did my code remove parentheses located in the latter part of strings? What should I do for the result I want?

Upvotes: 3

Answers (3)

Mandar

Reputation: 1769

a longer but more elaborate try

a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
       '(You|You|Youre) (can|cans|can) do this (works|works|worked)',
       'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

# blank output     
new_a <- c()
for (sentence in 1:length(a)) {
split <- trim(unlist(strsplit(a[sentence],"[( )]")))
newsentence <- c()
for (i in split) {
  j1 <- as.character(unique(trim(unlist(strsplit(gsub('\\|'," ",i)," ")))))
   if( length(j1)==0) {
     next
   } else {
     ifelse(length(j1)>1,
         newsentence <- c(newsentence,paste("(",paste(j1,collapse="|"),")",sep="")),
         newsentence <- c(newsentence,j1[1]))
   }
}
newsentence <- paste(newsentence,collapse=" ")
print(newsentence)
new_a <- c(new_a,newsentence)}
# [1] "I (have|has) certain (words|word|worded) certain"                 
# [2] "(You|Youre) (can|cans) do this (works|worked)"                    
# [3] "I (am|are) (sure|surely) you know (what|when) (you|her) should do"

Upvotes: 1

Roman

Reputation: 17648

Take the answer above. This is more straightforward, but you can also try:

library(stringi)
library(stringr)
a_new <- gsub("[|]","-",a) # replace this | due to some issus during the replacement later
a1 <- str_extract_all(a_new,"[(](.*?)[)]") # extract the "units"
# some magic using stringi::stri_extract_all_words()
a2 <- unlist(lapply(a1,function(x) unlist(lapply(stri_extract_all_words(x), function(y) paste(unique(y),collapse = "|")))))
# prepare replacement
names(a2) <- unlist(a1)
# replacement and finalization
str_replace_all(a_new, a2)
[1] "I (have|has) certain (words|word|worded) certain"                   
[2] "(You|Youre) (can|cans) do this (works|worked)"                      
[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"

The idea is to extract the words within the brackets as unit. Then remove the duplicates and replace the old unit with the updated.

Upvotes: 2

akrun

Reputation: 887128

We can use gsubfn. Here, the idea is to select the characters inside the brackets by matching the opening bracket (\\( have to escape the bracket as it is a metacharacter) followed by one or more characters that are not a closing bracket ([^)]+), capture it as a group within the brackets. In the replacement, we split the group of characters (x) with strsplit, unlist the list output, get the unique elements and paste it together

library(gsubfn)
gsubfn("\\(([^)]+)", ~paste0("(", paste(unique(unlist(strsplit(x, 
                "[|]"))), collapse="|")), a)
#[1] "I (have|has) certain (words|word|worded) certain"                   
#[2] "(You|Youre) (can|cans) do this (works|worked)"                      
#[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"

Upvotes: 5

how to remove duplicate words in a certain pattern from a string in R

Answers (3)

Related Questions