JBH
JBH

Reputation: 101

Remove duplicate first names from a full name in R

Consider the following data.frame:

df <- data.frame(ID = 1:3, Name = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Chayes, Michael Jordan, John DeNero, Ani Adhikari, Jordan, Mia Scher", "Nenshad Bardoliwalla, Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood") , stringsAsFactors = FALSE)

I would like to remove the duplicates First Name/Last Name if the Full Name is available in the String. Also, no changes made to the string if there is no match. The result should be like the data-frame provided below;

df <- data.frame(ID = 1:3, Name = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Chayes, Michael Jordan, John DeNero, Ani Adhikari, Jordan, Mia Scher", "Nenshad Bardoliwalla, Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood"), UniqueName = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Michael Jordan, John DeNero, Ani Adhikari, Mia Scher", "Nenshad Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood") , stringsAsFactors = FALSE)

Any Inputs will be really appreciable.

Upvotes: 0

Views: 75

Answers (1)

slamballais
slamballais

Reputation: 3235

Answer

Use grepl to find strings that [1] do not contain a space, and [2] are present in other names.

Code

df$UniqueName <- sapply(df$Name, function(x) {
  sn <- unlist(strsplit(x, split = ", ", fixed = TRUE))
  sn2 <- sn[!(!grepl(" ", sn) & sapply(sn, function(y) sum(grepl(y, sn)) > 1))]
  paste(sn2, collapse = ", ")
})

Rationale

We use sapply since each entry needs a lot of work. We essentially perform 3 steps: [1] split the string with strsplit, [2] subset to keep only those that you want, [3] paste the string back together with paste.

The reasoning here is that single first or last names do not contain a space, and if they are present in other names then you want to remove them. Hence, we find those that do not have a space (!grepl(" ", sn)) and that are a substring of another entry (sapply(sn, function(y) sum(grepl(y, sn)) > 1)). Then, we remove those using [!( )].

Upvotes: 1

Related Questions