Reputation: 69
I have a dataset with multiple rows, which is a database of contact details. Some entries are repeated several times, but contain different information. Example:
> example
Title Name Email Phone
[1,] "Chair" "Oswald Gruber" "abc@abc.com" "+33 12345"
[2,] "Respondent" "Oswald Gruber" NA "+44 54321"
[3,] "Comm.mngr" "Kaspar Villiger" "qwe@rty.com" "+1 123456"
[4,] "Investment mngr" "Markus Urben" "jkl@jkl.com" NA
[5,] "Responsible" "Markus Urben" "jkl@jkl.com" "+1 33333"
So what I'd like is to have each person mentioned only once, but without losing the info. So, if I merge two first rows in this example, I'd like to retain the email address and both phone numbers, but only one kind of title.
I would appreciate if anybody would suggest any solutions.
Upvotes: 1
Views: 132
Reputation: 4024
I'd split it out into a relational database. This is assuming that names can serve as a unique key to identify a person.
library(dplyr)
test =
data_frame(
Title = c("Chair", "Respondent", "Comm.mngr", "Investment mngr", "Responsible"),
Name = c("Oswald Gruber", "Oswald Gruber", "Kaspar Villiger", "Markus Urben", "Markus Urben"),
Email = c("abc@abc.com", NA, "qwe@rty.com", "jkl@jkl.com", "jkl@jkl.com"),
Phone = c("+33 12345", "+44 54321", "+1 123456", NA, "+1 33333") )
Person__Title =
test %>%
select(Name, Title) %>%
distinct %>%
filter(!is.na(Title))
Person__Phone =
test %>%
select(Name, Phone) %>%
distinct %>%
filter(!is.na(Phone))
Person__Email =
test %>%
select(Name, Email) %>%
distinct %>%
filter(!is.na(Email))
Person =
test %>%
select(Name) %>%
distinct
Then, if you want to merge it all together, you could do
make_list = function(vector)
vector %>%
unique %>%
na.omit %>%
paste(collapse = ", ")
merge =
Person %>%
left_join(Person__Email) %>%
left_join(Person__Phone) %>%
left_join(Person__Title) %>%
group_by(Name) %>%
summarise_each(funs(make_list))
But I wouldn't recommend it. Having multiple pieces of information in the same cell goes against the principles of good data design.
Upvotes: -2
Reputation: 32466
You can group by Name
and convert each column to a list.
library(dplyr)
dat %>% group_by(Name) %>%
summarise_each(funs(list)) -> res
So, it looks like
data.frame(res)
# Name Title Email
# 1 Kaspar Villiger Comm.mngr qwe@rty.com
# 2 Markus Urben Investment mngr, Responsible jkl@jkl.com, jkl@jkl.com
# 3 Oswald Gruber Chair, Respondent abc@abc.com, NA
# Phone
# 1 +1 123456
# 2 NA, +1 33333
# 3 +33 12345, +44 54321
I don't know what further modifications you want, but you could easily remove NA values/duplicates and choose to only keep one Title with a slight change to the code. For example, to remove NA/duplicates, you could use funs(list(unique(na.omit(.))))
.
Upvotes: 1
Reputation: 59425
And a data.table solution:
library(data.table)
mrg <- function(x) paste(unique(x[!is.na(x)]),collapse=", ")
setDT(example)[,list(Title=head(Title,1), Email=mrg(Email), Phone=mrg(Phone)), by="Name"]
# Name Title Email Phone
# 1: Oswald Gruber Chair abc@abc.com +33 12345, +44 54321
# 2: Kaspar Villiger Comm.mngr qwe@rty.com +1 123456
# 3: Markus Urben Investment mngr jkl@jkl.com +1 33333
This returns the first Title in the natural order, which may or may not be what you want. It also consolidates emails properly and removes NAs.
Upvotes: 2
Reputation: 418
Assuming your "example" object is a data.frame (may also work with a matrix). For those columns you want to retain all the information, you could do this (untested for coding typos):
result <- aggregate(example[,c("Phone","Email")], by = list(Name = example[,"Name"]), FUN = paste, sep = ", ")
Thus phone numbers and eMail adresses will be pasted into single columns as string elements, the values being separated by comas). Alternatively,
result <- aggregate(example[,c("Phone","Email")], by = list(Name = example[,"Name"]), FUN = c)
will give 2 columns containing list of vectors of length = the number of entries for a particular Name.
For the columns you want to retain just one value, well you have to specify which you want. If it's always the first value to come, that would be
result2 <- aggregate(example[,"title"], by = list(Name = example[,"Name"]), function(x) x[1])
And then you can
merged <- merge(results, result2)
Upvotes: 1