Reputation: 101
Consider the following data.frame:
df <- data.frame(ID = 1:3, Name = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Chayes, Michael Jordan, John DeNero, Ani Adhikari, Jordan, Mia Scher", "Nenshad Bardoliwalla, Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood") , stringsAsFactors = FALSE)
I would like to remove the duplicates First Name/Last Name if the Full Name is available in the String. Also, no changes made to the string if there is no match. The result should be like the data-frame provided below;
df <- data.frame(ID = 1:3, Name = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Chayes, Michael Jordan, John DeNero, Ani Adhikari, Jordan, Mia Scher", "Nenshad Bardoliwalla, Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood"), UniqueName = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Michael Jordan, John DeNero, Ani Adhikari, Mia Scher", "Nenshad Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood") , stringsAsFactors = FALSE)
Any Inputs will be really appreciable.
Upvotes: 0
Views: 75
Reputation: 3235
Answer
Use grepl to find strings that [1] do not contain a space, and [2] are present in other names.
Code
df$UniqueName <- sapply(df$Name, function(x) {
sn <- unlist(strsplit(x, split = ", ", fixed = TRUE))
sn2 <- sn[!(!grepl(" ", sn) & sapply(sn, function(y) sum(grepl(y, sn)) > 1))]
paste(sn2, collapse = ", ")
})
Rationale
We use sapply
since each entry needs a lot of work. We essentially perform 3 steps: [1] split the string with strsplit
, [2] subset to keep only those that you want, [3] paste the string back together with paste
.
The reasoning here is that single first or last names do not contain a space, and if they are present in other names then you want to remove them. Hence, we find those that do not have a space (!grepl(" ", sn)
) and that are a substring of another entry (sapply(sn, function(y) sum(grepl(y, sn)) > 1)
). Then, we remove those using [!( )]
.
Upvotes: 1