Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Regex to match sentences with adjacent and non-adjacent word repetition in R

I have a dataframe with sentences; in some sentences, words get used more than once:

df <- data.frame(Turn = c("well this is what the grumble about do n't they ?",
                          "it 's like being in a play-group , in n it ?",
                          "oh is that that steak i got the other night ?",
                          "well where have the middle sized soda stream bottle gone ?",
                          "this is a half day , right ? needs a full day",
                          "yourself , everybody 'd be changing your hair in n it ?",
                          "cos he finishes at four o'clock on that day anyway .",
                          "no no no i 'm dave and you 're alan .",
                          "yeah , i mean the the film was quite long though",
                          "it had steve martin in it , it 's a comedy",
                          "oh it is a dreary old day in n it ?",
                          "no it 's not mother theresa , it 's saint theresa .",
                          "oh have you seen that face lift job he wants ?",
                          "yeah bolshoi 's right so which one is it then ?"))

I want to match those sentences in which a word, any word, gets repeated once or more times.

EDIT 1:

The repeated words **can* be adjacent but they need not be. That's the reason why Regular Expression For Consecutive Duplicate Words does not provide an answer to my question.

I've been modestly successful with this code:

df[grepl("(\\w+\\b\\s)\\1{1,}", df$Turn),]
[1] well this is what the grumble about do n't they ?      
[2] it 's like being in a play-group , in n it ?           
[3] oh is that that steak i got the other night ?          
[4] this is a half day , right ? needs a full day          
[5] yourself , everybody 'd be changing your hair in n it ?
[6] no no no i 'm dave and you 're alan .                  
[7] yeah , i mean the the film was quite long though       
[8] it had steve martin in it , it 's a comedy             
[9] oh it is a dreary old day in n it ?

The success is just modest because some sentences are matched that should not be matched, e.g., yourself , everybody 'd be changing your hair in n it ?, while others are not matched that should be, e.g., no it 's not mother theresa , it 's saint theresa .. How can the code be improved to produce exact matches?

Expected result:

df
                                                         Turn
2                it 's like being in a play-group , in n it ?
3               oh is that that steak i got the other night ?
5               this is a half day , right ? needs a full day
8                       no no no i 'm dave and you 're alan .
9            yeah , i mean the the film was quite long though
10                 it had steve martin in it , it 's a comedy
11                        oh it is a dreary old day in n it ?
12        no it 's not mother theresa , it 's saint theresa .

EDIT 2:

Another question would be how to define the exact amount of repeated words. The above, imperfect, regex matches words that are repeated at least once. If I change the quantifier to {2}, thus looking for a triple occurrence of a word, I'd get this code and this result:

df[grepl("(\\w+\\b\\s)\\1{2}", df$Turn),]
[1] no no no i 'm dave and you 're alan .         # "no" occurs 3 times

But again the match is imperfect as the expected result would be:

[1] no no no i 'm dave and you 're alan .          # "no" occurs 3 times
[2] it had steve martin in it , it 's a comedy     # "it" occurs 3 times

Any help is much appreciated!

Upvotes: 1

Views: 280

Answers (2)

Hsiang Yun Chan
Hsiang Yun Chan

Reputation: 151

An option for defining the exact amount of repeated words.

extract sentences in which the same words occur 3 times

  1. change regex.

    (\s?\b\w+\b\s)(.*\1){2}

    (\s?\b\w+\b\s) captured by Group 1

    • \s? : blank space occurs zero or once.
    • \b\w+\b : the exact word character.
    • \s : blank space occurs once.

      (.*\1) captured by Group 2

      • (.*\1) : any characters that occur zero or more times before Group 1 matches again.

      • (.*\1){2} : Group 2 matches twice.

Code

df$Turn[grepl("(\\s?\\b\\w+\\b\\s)(.*\\1){2}", df$Turn, perl = T)]
# [1] "no no no i 'm dave and you 're alan ."     
# [2] "it had steve martin in it , it 's a comedy"
  1. Use strsplit(split="\\s") split sentences into words.
    • use sapply and table to count the number of occurrence of words in each list element, and then select sentences that satisfy the requirement.

Code

library(magrittr)
df$Turn %<>% as.character()
s<-strsplit(df$Turn,"\\s") %>% sapply(.,function(i)table(i) %>% .[.==3])
df$Turn[which(s!=0)]
# [1] "no no no i 'm dave and you 're alan ."     
# [2] "it had steve martin in it , it 's a comedy"

Hope this may help you :)

Upvotes: 1

jazzurro
jazzurro

Reputation: 23574

I would rather take another pass to handle this task. First, I added a group variable to the original data frame. Then, I counted how many times each word appears in each sentence and created a data frame, which is mytemp.

library(tidyverse)

mutate(df, id = 1:n()) -> df

mutate(df, id = 1:n()) %>% 
mutate(word = strsplit(x = Turn, split = " ")) %>% 
unnest(word) %>% 
count(id, word, name = "frequency", sort = TRUE) -> mytemp  

Using this data frame, it is straightforward to identify sentences. I subset the data and obtained id for the sentences that have a word appearing three times. I similarly identified words that appeared more than once and obtained id. Finally, I subset the original data using the id numbers in three and twice.

# Search words that appear 3 times 

three <- filter(mytemp, frequency == 3) %>% 
         pull(id) %>% 
         unique()

# Serach words that appear more than once.

twice <- filter(mytemp, frequency > 1) %>% 
         pull(id) %>% 
         unique()

# Go back to the original data and handle subsetting
filter(df, id %in% three)

  Turn                                          id
  <chr>                                      <int>
1 no no no i 'm dave and you 're alan .          8
2 it had steve martin in it , it 's a comedy    10

filter(df, id %in% twice)

  Turn                                                   id
  <chr>                                               <int>
1 it 's like being in a play-group , in n it ?            2
2 oh is that that steak i got the other night ?           3
3 this is a half day , right ? needs a full day           5
4 no no no i 'm dave and you 're alan .                   8
5 yeah , i mean the the film was quite long though        9
6 it had steve martin in it , it 's a comedy             10
7 oh it is a dreary old day in n it ?                    11
8 no it 's not mother theresa , it 's saint theresa .    12

Upvotes: 1

Related Questions