Reputation: 21400
I have a dataframe with sentences; in some sentences, words get used more than once:
df <- data.frame(Turn = c("well this is what the grumble about do n't they ?",
"it 's like being in a play-group , in n it ?",
"oh is that that steak i got the other night ?",
"well where have the middle sized soda stream bottle gone ?",
"this is a half day , right ? needs a full day",
"yourself , everybody 'd be changing your hair in n it ?",
"cos he finishes at four o'clock on that day anyway .",
"no no no i 'm dave and you 're alan .",
"yeah , i mean the the film was quite long though",
"it had steve martin in it , it 's a comedy",
"oh it is a dreary old day in n it ?",
"no it 's not mother theresa , it 's saint theresa .",
"oh have you seen that face lift job he wants ?",
"yeah bolshoi 's right so which one is it then ?"))
I want to match those sentences in which a word, any word, gets repeated once or more times.
EDIT 1:
The repeated words **can* be adjacent but they need not be. That's the reason why Regular Expression For Consecutive Duplicate Words does not provide an answer to my question.
I've been modestly successful with this code:
df[grepl("(\\w+\\b\\s)\\1{1,}", df$Turn),]
[1] well this is what the grumble about do n't they ?
[2] it 's like being in a play-group , in n it ?
[3] oh is that that steak i got the other night ?
[4] this is a half day , right ? needs a full day
[5] yourself , everybody 'd be changing your hair in n it ?
[6] no no no i 'm dave and you 're alan .
[7] yeah , i mean the the film was quite long though
[8] it had steve martin in it , it 's a comedy
[9] oh it is a dreary old day in n it ?
The success is just modest because some sentences are matched that should not be matched, e.g., yourself , everybody 'd be changing your hair in n it ?
, while others are not matched that should be, e.g., no it 's not mother theresa , it 's saint theresa .
. How can the code be improved to produce exact matches?
Expected result:
df
Turn
2 it 's like being in a play-group , in n it ?
3 oh is that that steak i got the other night ?
5 this is a half day , right ? needs a full day
8 no no no i 'm dave and you 're alan .
9 yeah , i mean the the film was quite long though
10 it had steve martin in it , it 's a comedy
11 oh it is a dreary old day in n it ?
12 no it 's not mother theresa , it 's saint theresa .
EDIT 2:
Another question would be how to define the exact amount of repeated words. The above, imperfect, regex matches words that are repeated at least once. If I change the quantifier to {2}
, thus looking for a triple occurrence of a word, I'd get this code and this result:
df[grepl("(\\w+\\b\\s)\\1{2}", df$Turn),]
[1] no no no i 'm dave and you 're alan . # "no" occurs 3 times
But again the match is imperfect as the expected result would be:
[1] no no no i 'm dave and you 're alan . # "no" occurs 3 times
[2] it had steve martin in it , it 's a comedy # "it" occurs 3 times
Any help is much appreciated!
Upvotes: 1
Views: 280
Reputation: 151
An option for defining the exact amount of repeated words.
extract sentences in which the same words occur 3 times
change regex.
(\s?\b\w+\b\s)(.*\1){2}
(\s?\b\w+\b\s) captured by Group 1
\s : blank space occurs once.
(.*\1) captured by Group 2
(.*\1) : any characters that occur zero or more times before Group 1 matches again.
(.*\1){2} : Group 2 matches twice.
Code
df$Turn[grepl("(\\s?\\b\\w+\\b\\s)(.*\\1){2}", df$Turn, perl = T)]
# [1] "no no no i 'm dave and you 're alan ."
# [2] "it had steve martin in it , it 's a comedy"
strsplit(split="\\s")
split sentences into words.
sapply
and table
to count the number of occurrence of words in each list element, and then select sentences that satisfy the requirement.Code
library(magrittr)
df$Turn %<>% as.character()
s<-strsplit(df$Turn,"\\s") %>% sapply(.,function(i)table(i) %>% .[.==3])
df$Turn[which(s!=0)]
# [1] "no no no i 'm dave and you 're alan ."
# [2] "it had steve martin in it , it 's a comedy"
Hope this may help you :)
Upvotes: 1
Reputation: 23574
I would rather take another pass to handle this task. First, I added a group variable to the original data frame. Then, I counted how many times each word appears in each sentence and created a data frame, which is mytemp
.
library(tidyverse)
mutate(df, id = 1:n()) -> df
mutate(df, id = 1:n()) %>%
mutate(word = strsplit(x = Turn, split = " ")) %>%
unnest(word) %>%
count(id, word, name = "frequency", sort = TRUE) -> mytemp
Using this data frame, it is straightforward to identify sentences. I subset the data and obtained id
for the sentences that have a word appearing three times. I similarly identified words that appeared more than once and obtained id
. Finally, I subset the original data using the id
numbers in three
and twice
.
# Search words that appear 3 times
three <- filter(mytemp, frequency == 3) %>%
pull(id) %>%
unique()
# Serach words that appear more than once.
twice <- filter(mytemp, frequency > 1) %>%
pull(id) %>%
unique()
# Go back to the original data and handle subsetting
filter(df, id %in% three)
Turn id
<chr> <int>
1 no no no i 'm dave and you 're alan . 8
2 it had steve martin in it , it 's a comedy 10
filter(df, id %in% twice)
Turn id
<chr> <int>
1 it 's like being in a play-group , in n it ? 2
2 oh is that that steak i got the other night ? 3
3 this is a half day , right ? needs a full day 5
4 no no no i 'm dave and you 're alan . 8
5 yeah , i mean the the film was quite long though 9
6 it had steve martin in it , it 's a comedy 10
7 oh it is a dreary old day in n it ? 11
8 no it 's not mother theresa , it 's saint theresa . 12
Upvotes: 1