Ashti
Ashti

Reputation: 97

Create an order of rows to keep in r based on column matches

This is a portion of my dataset:

df<-data.frame(
  group=c("A","A","A","A","B","B","C","C","D","D","E","F","G","G"), 
  first_name=c("Linda","Linda","Linda","Linda","Henry","Henry","Hazel","Hazel","Owen","Owen","Ava","Nora","Rose","Rose"),
  first_name_match=c("Linda","Linda", "John","John","Oliver","Oliver","Hazel","Violet","Owen","Owen/Ben","Ava",NA,"Alex/Rose","Alex"))

For each group,

1) If the column first_name and first_name match are equal (exactly the same), keep ONLY those rows and get rid of the other rows.

2) If the column first_name and first_name match are equal (exactly the same), and there is a partial match b/w other rows. By partial match, I mean whether the first_name_match contains part of the first_name then ONLY grab the exact match.

2)If there is no exact match(they are not equal), I would like to keep the rows that partially match. By partially match, I mean whether the first_name_match contains part of the first_name.

3) if there is no match/partial match- keep the rows regardless and flag them for further understanding

Overall, I want to keep the order that have an exact match, if no exact match then partial match and then keep the rows that have no match or are NA as well.

Please see output:

df_final<-data.frame(
  group=c("A","A","B","B","C","D","E","F","G"), 
  first_name=c("Linda","Linda","Henry","Henry","Hazel","Owen","Ava","Nora","Rose"),
  first_name_match=c("Linda","Linda","Oliver","Oliver","Hazel","Owen","Ava",NA,"Alex/Rose"))

Upvotes: 0

Views: 108

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388817

You can write a function to select rows for each group based on condition.

library(dplyr)
library(stringr)

select_rows <- function(x, y) {
   #If any exact match is found return exact match
   if(any(x == y, na.rm = TRUE)) x == y
   #else if partial match is found return partial match
   else if(any(str_detect(y, x), na.rm = TRUE)) str_detect(y, x)
   #If none of the above then return all rows
   else TRUE
}

and apply this function by group.

df %>% group_by(group) %>% filter(select_rows(first_name, first_name_match))


# group first_name first_name_match
#  <chr> <chr>      <chr>           
#1 A     Linda      Linda           
#2 A     Linda      Linda           
#3 B     Henry      Oliver          
#4 B     Henry      Oliver          
#5 C     Hazel      Hazel           
#6 D     Owen       Owen            
#7 E     Ava        Ava             
#8 F     Nora       NA              
#9 G     Rose       Alex/Rose       

Upvotes: 1

Waldi
Waldi

Reputation: 41220

library(data.table)
library(dplyr)
df<-data.frame(
  group=c("A","A","A","A","B","B","C","C","D","D","E","F","G","G"), 
  first_name=c("Linda","Linda","Linda","Linda","Henry","Henry","Hazel","Hazel","Owen","Owen","Ava","Nora","Rose","Rose"),
  first_name_match=c("Linda","Linda", "John","John","Oliver","Oliver","Hazel","Violet","Owen","Owen/Ben","Ava",NA,"Alex/Rose","Alex"))%>% as.data.table()

#1 exact match
df1 <- df[first_name==first_name_match]

#2 partial match
df2 <- df[mapply(grepl,df$first_name,df$first_name_match)] 
df2 <- df2[!df1,on = .(group,first_name)]

#3 no match
df3 <- df[!rbind(df1,df2), on = .(group,first_name)]

#Result
result <- rbind(df1,df2,df3)
result

group first_name first_name_match
1:     A      Linda            Linda
2:     A      Linda            Linda
3:     C      Hazel            Hazel
4:     D       Owen             Owen
5:     E        Ava              Ava
6:     G       Rose        Alex/Rose
7:     B      Henry           Oliver
8:     B      Henry           Oliver
9:     F       Nora             <NA>

Upvotes: 2

Related Questions