elksie5000
elksie5000

Reputation: 7752

How to collapse rows of data

I'm a newbie to R and data.table and but I'm trying to collapse a customer data set that takes the following format - although it extends across 90 columns:

frame <- data.frame(
  customer_id = c(123, 123, 123),
  time = c(1, 2, 3),
  rec_type = c('contact', 'appointment', 'sale'),
  variable_1 = c('Yes', NA, "Yes"),
  variable_2 = c(NA, 'No', NA),
  variable_3 = c(NA, NA, 'complete'),
  variable_4 = NA, stringsAsFactors = FALSE)



customer_id     time    rec_type     variable_1     variable_2  variable_3 variable_4 
123             1        contact      Yes            NA          NA        NA
123             2        appointment  NA             No          NA        NA
123             3        sale         Yes             NA         complete    NA

I asked before - What's the best way to collapse sparse data into single rows in R? - how to collapse the data for each customer into a single row and got two useful answers in data.table and dplyr.

However, those answers couldn't handle multiple values such as the 'rec_type' field or where are multiple instances of the same value variable_1.

I'd like to lapply a function which works across columns and returns a row vector in which each field is either the single value for each field, NA if all column values are blank or 'multiple'

In this case: my output would be

customer_id     time     rec_type     variable_1     variable_2  variable_3 variable_4
    123         multiple multiple      Yes            No          complete    NA

I worked out how to count the unique values across columns:

unique_values <- function(x){
  uniques <- dt[contact_no == x,][,lapply(.SD, uniqueN)]
  uniques
}

lapply(dt$contact_no, unique_values)

But couldn't work how to use the results from uniques to return the results I'd like.

Can anyone suggest an approach I can use?

Is there a simpler way of tackling the problem?

Upvotes: 1

Views: 1717

Answers (1)

lmo
lmo

Reputation: 38500

Here is one data.table method.

setDT(frame)[, lapply(.SD, function(x)
                      {x <- unique(x[!is.na(x)])
                       if(length(x) == 1) as.character(x)
                       else if(length(x) == 0) NA_character_
                       else "multiple"}),
             by=customer_id]

The idea is to use lapply to apply an anonymous function to all variables and construct the function in a manner that returns the desired results. This function strips out NA values and duplicates and then checks the length of the resulting vector. The output of each is cast as a character in order to comply with the possibility of "multiple" occurring for another customer_id.

this returns

   customer_id     time rec_type variable_1 variable_2 variable_3 variable_4
1:         123 multiple multiple        Yes         No   complete         NA

Upvotes: 3

Related Questions