Reputation: 97
This is a portion of my dataset:
df<-data.frame(
group=c("A","A","A","A","B","B","C","C","D","D","E","F","G","G"),
first_name=c("Linda","Linda","Linda","Linda","Henry","Henry","Hazel","Hazel","Owen","Owen","Ava","Nora","Rose","Rose"),
first_name_match=c("Linda","Linda", "John","John","Oliver","Oliver","Hazel","Violet","Owen","Owen/Ben","Ava",NA,"Alex/Rose","Alex"))
For each group,
1) If the column first_name and first_name match are equal (exactly the same), keep ONLY those rows and get rid of the other rows.
2) If the column first_name and first_name match are equal (exactly the same), and there is a partial match b/w other rows. By partial match, I mean whether the first_name_match contains part of the first_name then ONLY grab the exact match.
2)If there is no exact match(they are not equal), I would like to keep the rows that partially match. By partially match, I mean whether the first_name_match contains part of the first_name.
3) if there is no match/partial match- keep the rows regardless and flag them for further understanding
Overall, I want to keep the order that have an exact match, if no exact match then partial match and then keep the rows that have no match or are NA as well.
Please see output:
df_final<-data.frame(
group=c("A","A","B","B","C","D","E","F","G"),
first_name=c("Linda","Linda","Henry","Henry","Hazel","Owen","Ava","Nora","Rose"),
first_name_match=c("Linda","Linda","Oliver","Oliver","Hazel","Owen","Ava",NA,"Alex/Rose"))
Upvotes: 0
Views: 108
Reputation: 388817
You can write a function to select rows for each group based on condition.
library(dplyr)
library(stringr)
select_rows <- function(x, y) {
#If any exact match is found return exact match
if(any(x == y, na.rm = TRUE)) x == y
#else if partial match is found return partial match
else if(any(str_detect(y, x), na.rm = TRUE)) str_detect(y, x)
#If none of the above then return all rows
else TRUE
}
and apply this function by group.
df %>% group_by(group) %>% filter(select_rows(first_name, first_name_match))
# group first_name first_name_match
# <chr> <chr> <chr>
#1 A Linda Linda
#2 A Linda Linda
#3 B Henry Oliver
#4 B Henry Oliver
#5 C Hazel Hazel
#6 D Owen Owen
#7 E Ava Ava
#8 F Nora NA
#9 G Rose Alex/Rose
Upvotes: 1
Reputation: 41220
library(data.table)
library(dplyr)
df<-data.frame(
group=c("A","A","A","A","B","B","C","C","D","D","E","F","G","G"),
first_name=c("Linda","Linda","Linda","Linda","Henry","Henry","Hazel","Hazel","Owen","Owen","Ava","Nora","Rose","Rose"),
first_name_match=c("Linda","Linda", "John","John","Oliver","Oliver","Hazel","Violet","Owen","Owen/Ben","Ava",NA,"Alex/Rose","Alex"))%>% as.data.table()
#1 exact match
df1 <- df[first_name==first_name_match]
#2 partial match
df2 <- df[mapply(grepl,df$first_name,df$first_name_match)]
df2 <- df2[!df1,on = .(group,first_name)]
#3 no match
df3 <- df[!rbind(df1,df2), on = .(group,first_name)]
#Result
result <- rbind(df1,df2,df3)
result
group first_name first_name_match
1: A Linda Linda
2: A Linda Linda
3: C Hazel Hazel
4: D Owen Owen
5: E Ava Ava
6: G Rose Alex/Rose
7: B Henry Oliver
8: B Henry Oliver
9: F Nora <NA>
Upvotes: 2