Reputation: 77
I have a dataset with User Information. For a specific user I have often multiple rows with more or less complete information. I want to summarize all rows that belong to a customer on the basis of First_Name, Last_Name, Street while keeping all information of the other columns and if there are two unique observation for a specific column I want to collapse them with ",".
This is what the df looks like
First_Name Last_Name Street Column1 Colum2 Colum_n
Mike Smith X abc ab a
Mike Smith X abc ad b
John Smith Y xyz xy n
John Smith Y xyz xm NA
My desired output would be
First_Name Last_Name Street Column1 Colum2 Colum_n
Mike Smith X abc ab,ad a,b
John Smith Y xyz xy,xm n
I would like using dplyr and tried something with
df %>%
group_by(First_Name,Last_Name, Street) %>%
summarise_all(funs())
The problem with that function is that I only had the option of using something like the mean or the first occuring value for a column and this would mean the loss of values. What I would like are columns with all unique values without NA's
Upvotes: 3
Views: 1637
Reputation: 28685
If you want to keep them as a vector, instead of converting them to a single character string, you can do
library(dplyr)
df %>%
group_by(First_Name,Last_Name, Street) %>%
summarise_all(~list(unique(.[!is.na(.)]))) %>%
print.data.frame
# First_Name Last_Name Street Column1 Colum2 Colum_n
# 1 John Smith Y xyz xy, xm n
# 2 Mike Smith X abc ab, ad a, b
or with data.table
library(data.table)
setDT(df)
df[, lapply(.SD, function(x) .(unique(x[!is.na(x)])))
, by = .(First_Name,Last_Name, Street)]
# First_Name Last_Name Street Column1 Colum2 Colum_n
# 1: Mike Smith X abc ab,ad a,b
# 2: John Smith Y xyz xy,xm n
Upvotes: 2
Reputation: 39858
Using tidyverse
:
df %>%
group_by(First_Name, Last_Name, Street) %>%
summarise_all(funs(paste0(unique(.[!is.na(.)]), collapse= ",")))
First_Name Last_Name Street Column1 Colum2 Colum_n
<fct> <fct> <fct> <chr> <chr> <chr>
1 John Smith Y xyz xy,xm n
2 Mike Smith X abc ab,ad a,b
First, it is grouping by "First_Name", "Last_Name" and "Street". Then, it takes all the unique non-NA values and collapses them into one string.
Upvotes: 2
Reputation: 39154
A solution using tidyverse
.
library(tidyverse)
dat2 <- dat %>%
group_by(First_Name, Last_Name, Street) %>%
# Replace NA with ""
mutate_all(funs(replace(., is.na(.), ""))) %>%
# Combine all strings
summarize_all(funs(toString(unique(.)))) %>%
# Replace the strings ended with ", "
mutate_all(funs(str_replace(., ", $", ""))) %>%
ungroup()
dat2
# # A tibble: 2 x 6
# First_Name Last_Name Street Column1 Colum2 Colum_n
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 John Smith Y xyz xy, xm n
# 2 Mike Smith X abc ab, ad a, b
After seeing others answer, I realized that we don't have to deal with NA
and ,
as strings. The following is more efficient.
dat2 <- dat %>%
group_by(First_Name, Last_Name, Street) %>%
# Combine all strings
summarize_all(funs(toString(unique(.[!is.na(.)])))) %>%
ungroup()
dat2
# # A tibble: 2 x 6
# First_Name Last_Name Street Column1 Colum2 Colum_n
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 John Smith Y xyz xy, xm n
# 2 Mike Smith X abc ab, ad a, b
DATA
dat <- read.table(text = 'First_Name Last_Name Street Column1 Colum2 Colum_n
Mike Smith X abc ab a
Mike Smith X abc ad b
John Smith Y xyz xy n
John Smith Y xyz xm NA',
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 2
Reputation: 1736
You can write your own summarization function like
concat_unique <- function(x){paste(unique(x), collapse=',')}
and then apply it using
summarize_all(concat_unique)
Upvotes: 3