Regex to match sentences with adjacent and non-adjacent word repetition in R

Question

I have a dataframe with sentences; in some sentences, words get used more than once:

df <- data.frame(Turn = c("well this is what the grumble about do n't they ?",
                          "it 's like being in a play-group , in n it ?",
                          "oh is that that steak i got the other night ?",
                          "well where have the middle sized soda stream bottle gone ?",
                          "this is a half day , right ? needs a full day",
                          "yourself , everybody 'd be changing your hair in n it ?",
                          "cos he finishes at four o'clock on that day anyway .",
                          "no no no i 'm dave and you 're alan .",
                          "yeah , i mean the the film was quite long though",
                          "it had steve martin in it , it 's a comedy",
                          "oh it is a dreary old day in n it ?",
                          "no it 's not mother theresa , it 's saint theresa .",
                          "oh have you seen that face lift job he wants ?",
                          "yeah bolshoi 's right so which one is it then ?"))

I want to match those sentences in which a word, any word, gets repeated once or more times.

EDIT 1:

The repeated words **can* be adjacent but they need not be. That's the reason why Regular Expression For Consecutive Duplicate Words does not provide an answer to my question.

I've been modestly successful with this code:

df[grepl("(\w+\b\s)\1{1,}", df$Turn),]
[1] well this is what the grumble about do n't they ?      
[2] it 's like being in a play-group , in n it ?           
[3] oh is that that steak i got the other night ?          
[4] this is a half day , right ? needs a full day          
[5] yourself , everybody 'd be changing your hair in n it ?
[6] no no no i 'm dave and you 're alan .                  
[7] yeah , i mean the the film was quite long though       
[8] it had steve martin in it , it 's a comedy             
[9] oh it is a dreary old day in n it ?

The success is just modest because some sentences are matched that should not be matched, e.g., yourself , everybody 'd be changing your hair in n it ?, while others are not matched that should be, e.g., no it 's not mother theresa , it 's saint theresa .. How can the code be improved to produce exact matches?

Expected result:

df
                                                         Turn
2                it 's like being in a play-group , in n it ?
3               oh is that that steak i got the other night ?
5               this is a half day , right ? needs a full day
8                       no no no i 'm dave and you 're alan .
9            yeah , i mean the the film was quite long though
10                 it had steve martin in it , it 's a comedy
11                        oh it is a dreary old day in n it ?
12        no it 's not mother theresa , it 's saint theresa .

EDIT 2:

Another question would be how to define the exact amount of repeated words. The above, imperfect, regex matches words that are repeated at least once. If I change the quantifier to {2}, thus looking for a triple occurrence of a word, I'd get this code and this result:

df[grepl("(\w+\b\s)\1{2}", df$Turn),]
[1] no no no i 'm dave and you 're alan .         # "no" occurs 3 times

But again the match is imperfect as the expected result would be:

[1] no no no i 'm dave and you 're alan .          # "no" occurs 3 times
[2] it had steve martin in it , it 's a comedy     # "it" occurs 3 times

Any help is much appreciated!

Hsiang Yun Chan · Accepted Answer

An option for defining the exact amount of repeated words.

extract sentences in which the same words occur 3 times

change regex.

(\s?\b\w+\b\s)(.*\1){2}

(\s?\b\w+\b\s) captured by Group 1
- \s? : blank space occurs zero or once.
- \b\w+\b : the exact word character.
- \s : blank space occurs once.
  
  (.*\1) captured by Group 2
  - (.*\1) : any characters that occur zero or more times before Group 1 matches again.
  - (.*\1){2} : Group 2 matches twice.

Code

df$Turn[grepl("(\s?\b\w+\b\s)(.*\1){2}", df$Turn, perl = T)]
# [1] "no no no i 'm dave and you 're alan ."     
# [2] "it had steve martin in it , it 's a comedy"

Use strsplit(split="\s") split sentences into words.
- use sapply and table to count the number of occurrence of words in each list element, and then select sentences that satisfy the requirement.

Code

library(magrittr)
df$Turn %<>% as.character()
s<-strsplit(df$Turn,"\s") %>% sapply(.,function(i)table(i) %>% .[.==3])
df$Turn[which(s!=0)]
# [1] "no no no i 'm dave and you 're alan ."     
# [2] "it had steve martin in it , it 's a comedy"

Hope this may help you :)

Regex to match sentences with adjacent and non-adjacent word repetition in R

Answers (2)

Related Questions