Pavel
Pavel

Reputation: 69

Removing duplicate entries while retaining unique information

I have a dataset with multiple rows, which is a database of contact details. Some entries are repeated several times, but contain different information. Example:

> example
     Title             Name              Email         Phone      
[1,] "Chair"           "Oswald Gruber"   "abc@abc.com" "+33 12345"
[2,] "Respondent"      "Oswald Gruber"   NA            "+44 54321"
[3,] "Comm.mngr"       "Kaspar Villiger" "qwe@rty.com" "+1 123456"
[4,] "Investment mngr" "Markus Urben"    "jkl@jkl.com" NA         
[5,] "Responsible"     "Markus Urben"    "jkl@jkl.com" "+1 33333"

So what I'd like is to have each person mentioned only once, but without losing the info. So, if I merge two first rows in this example, I'd like to retain the email address and both phone numbers, but only one kind of title.

I would appreciate if anybody would suggest any solutions.

Upvotes: 1

Views: 132

Answers (4)

bramtayl
bramtayl

Reputation: 4024

I'd split it out into a relational database. This is assuming that names can serve as a unique key to identify a person.

library(dplyr)

test = 
  data_frame(
    Title = c("Chair", "Respondent", "Comm.mngr", "Investment mngr", "Responsible"),
    Name = c("Oswald Gruber", "Oswald Gruber", "Kaspar Villiger", "Markus Urben", "Markus Urben"),
    Email = c("abc@abc.com", NA, "qwe@rty.com", "jkl@jkl.com", "jkl@jkl.com"),
    Phone = c("+33 12345", "+44 54321", "+1 123456", NA, "+1 33333") )

Person__Title = 
  test %>%
  select(Name, Title) %>%
  distinct %>%
  filter(!is.na(Title))

Person__Phone =
  test %>%
  select(Name, Phone) %>%
  distinct %>%
  filter(!is.na(Phone))

Person__Email = 
  test %>%
  select(Name, Email) %>%
  distinct %>%
  filter(!is.na(Email))

Person = 
  test %>%
  select(Name) %>%
  distinct

Then, if you want to merge it all together, you could do

make_list = function(vector)
  vector %>%
  unique %>%
  na.omit %>%
  paste(collapse = ", ")

merge = 
  Person %>%
  left_join(Person__Email) %>%
  left_join(Person__Phone) %>%
  left_join(Person__Title) %>%
  group_by(Name) %>%
  summarise_each(funs(make_list))

But I wouldn't recommend it. Having multiple pieces of information in the same cell goes against the principles of good data design.

Upvotes: -2

Rorschach
Rorschach

Reputation: 32466

You can group by Name and convert each column to a list.

library(dplyr)
dat %>% group_by(Name) %>%
  summarise_each(funs(list)) -> res

So, it looks like

data.frame(res)
#              Name                        Title                    Email
# 1 Kaspar Villiger                    Comm.mngr              qwe@rty.com
# 2    Markus Urben Investment mngr, Responsible jkl@jkl.com, jkl@jkl.com
# 3   Oswald Gruber            Chair, Respondent          abc@abc.com, NA
#                  Phone
# 1            +1 123456
# 2         NA, +1 33333
# 3 +33 12345, +44 54321

I don't know what further modifications you want, but you could easily remove NA values/duplicates and choose to only keep one Title with a slight change to the code. For example, to remove NA/duplicates, you could use funs(list(unique(na.omit(.)))).

Upvotes: 1

jlhoward
jlhoward

Reputation: 59425

And a data.table solution:

library(data.table)
mrg <- function(x) paste(unique(x[!is.na(x)]),collapse=", ")
setDT(example)[,list(Title=head(Title,1), Email=mrg(Email), Phone=mrg(Phone)), by="Name"]

#               Name           Title       Email                Phone
# 1:   Oswald Gruber           Chair abc@abc.com +33 12345, +44 54321
# 2: Kaspar Villiger       Comm.mngr qwe@rty.com            +1 123456
# 3:    Markus Urben Investment mngr jkl@jkl.com             +1 33333

This returns the first Title in the natural order, which may or may not be what you want. It also consolidates emails properly and removes NAs.

Upvotes: 2

jeanlain
jeanlain

Reputation: 418

Assuming your "example" object is a data.frame (may also work with a matrix). For those columns you want to retain all the information, you could do this (untested for coding typos):

result <- aggregate(example[,c("Phone","Email")], by = list(Name = example[,"Name"]), FUN = paste, sep = ", ")

Thus phone numbers and eMail adresses will be pasted into single columns as string elements, the values being separated by comas). Alternatively,

result <- aggregate(example[,c("Phone","Email")], by = list(Name = example[,"Name"]), FUN = c)

will give 2 columns containing list of vectors of length = the number of entries for a particular Name.

For the columns you want to retain just one value, well you have to specify which you want. If it's always the first value to come, that would be

result2 <- aggregate(example[,"title"], by = list(Name = example[,"Name"]), function(x) x[1])

And then you can

merged <- merge(results, result2)

Upvotes: 1

Related Questions