Keep specific rows of a data frame based on word sequence in R

Question

I have a dataframe (df) like this. What I want to do is to go through the values for each ID and if there are two strings starting with the same word, I want to compare them to keep distinct values.

df <- data.frame(id = c(1,1,2,3,3,4,4,4,4,5), 
                 value = c('australia', 'australia sydney', 'brazil',
                           'australia', 'usa', 'australia sydney', 'australia sydney randwick', 'australia', 'australia sydney circular quay', 'australia sydney'))

I want to get the first words to compare them and if they are different keep both but if they are the same go to the second words to compare them and so on... so like for ID 1 I want to keep the row with the value 'australia sydney' and for Id 4 I want to keep both 'australia sydney circular quay', 'australia sydney randwick'. For this example I need to get rows 2:5, 7, 9,10

lroha · Accepted Answer

Based on your edit, you can check within groups if any entry matches the start of any other entry and remove entries that do:

library(tidyverse)

df %>%
  group_by(id) %>%
  filter(!map_lgl(seq_along(value), ~ any(if (length(value) == 1) FALSE else str_detect(value[-.x], paste0("^", value[.x])))))

# A tibble: 7 x 2
# Groups:   id, value [7]
     id value                         
                            
1     1 australia sydney              
2     2 brazil                        
3     3 australia                     
4     3 usa                           
5     4 australia sydney randwick     
6     4 australia sydney circular quay
7     5 australia sydney

Keep specific rows of a data frame based on word sequence in R

Answers (1)

Related Questions