Reputation: 7752
I'm a newbie to R and data.table and but I'm trying to collapse a customer data set that takes the following format - although it extends across 90 columns:
frame <- data.frame(
customer_id = c(123, 123, 123),
time = c(1, 2, 3),
rec_type = c('contact', 'appointment', 'sale'),
variable_1 = c('Yes', NA, "Yes"),
variable_2 = c(NA, 'No', NA),
variable_3 = c(NA, NA, 'complete'),
variable_4 = NA, stringsAsFactors = FALSE)
customer_id time rec_type variable_1 variable_2 variable_3 variable_4
123 1 contact Yes NA NA NA
123 2 appointment NA No NA NA
123 3 sale Yes NA complete NA
I asked before - What's the best way to collapse sparse data into single rows in R? - how to collapse the data for each customer into a single row and got two useful answers in data.table and dplyr.
However, those answers couldn't handle multiple values such as the 'rec_type' field or where are multiple instances of the same value variable_1.
I'd like to lapply a function which works across columns and returns a row vector in which each field is either the single value for each field, NA if all column values are blank or 'multiple'
In this case: my output would be
customer_id time rec_type variable_1 variable_2 variable_3 variable_4
123 multiple multiple Yes No complete NA
I worked out how to count the unique values across columns:
unique_values <- function(x){
uniques <- dt[contact_no == x,][,lapply(.SD, uniqueN)]
uniques
}
lapply(dt$contact_no, unique_values)
But couldn't work how to use the results from uniques to return the results I'd like.
Can anyone suggest an approach I can use?
Is there a simpler way of tackling the problem?
Upvotes: 1
Views: 1717
Reputation: 38500
Here is one data.table method.
setDT(frame)[, lapply(.SD, function(x)
{x <- unique(x[!is.na(x)])
if(length(x) == 1) as.character(x)
else if(length(x) == 0) NA_character_
else "multiple"}),
by=customer_id]
The idea is to use lapply
to apply an anonymous function to all variables and construct the function in a manner that returns the desired results. This function strips out NA values and duplicates and then checks the length of the resulting vector. The output of each is cast as a character in order to comply with the possibility of "multiple" occurring for another customer_id.
this returns
customer_id time rec_type variable_1 variable_2 variable_3 variable_4
1: 123 multiple multiple Yes No complete NA
Upvotes: 3