jacatra
jacatra

Reputation: 527

Combining variables together into a list

Folks,

I'm stumped with the following challenge. I have a data set that looks like this:

BuyerID    Fruit.1     Fruit.2    Fruit.3    Amount.1    Amount.2    Amount.3
879        Banana      Apple                 4           3
765        Strawberry  Apple      Orange     1           2           4
123        Orange      Banana                1           1           1
 11        Strawberry                        3
773        Kiwi        Banana                1           2

What I would like to do is to simplify the data (if possible) and collapse the "Fruit" and "Amount" variables

BuyerID    Fruit                             Amount      Total    Count
879        "Banana" "Apple"                  4  3            7        2
765        "Strawberry" "Apple" "Orange"     1  2  4         7        3
123        "Orange" "Banana"                 1  1  1         3        2
 11        "Strawberry"                      3               3        1
773        "Kiwi" "Banana"                   1  2            3        2

I have tried using c() and rbind() but they do not produce the results that I want - I've tried the tip here: data.frame rows to a list as well but am not too sure whether this is the best way to simplify my data.

This is so that presumably it would be easier for me to deal with fewer variables to count the occurrence of certain items for example (e.g. 60% buyers purchase Banana).

I hope this is doable - am also open to any suggestions. Any solutions appreciated!

Thank you.

Upvotes: 7

Views: 2488

Answers (5)

jaimedash
jaimedash

Reputation: 2743

It didn't exist when the question was asked, but tidyr works well for this.

Reusing data from @mnel's answer,

library(tidyr)
separator <- ' '
DT %>%
  unite(Fruit, grep("Fruit", names(.)), sep = separator) %>%
  unite(Amount, grep("Amount", names(.)), sep = separator)

#   BuyerID                   Fruit  Amount Total Count
# 1     879         Banana Apple NA  4 3 NA     7     2
# 2     765 Strawberry Apple Orange   1 2 4     7     3
# 3     123        Orange Banana NA   1 1 1     3     2
# 4      11        Strawberry NA NA 3 NA NA     3     1
# 5     773          Kiwi Banana NA  1 2 NA     3     2

Upvotes: 0

agstudy
agstudy

Reputation: 121568

Here a solution , with base package. It is like Tyler solution but with a single apply.

res <- apply(DT,1,function(x){
  data.frame(Fruit= paste(na.omit(x[2:4]),collapse=' '),
             Amount = paste(na.omit(x[5:7]),collapse =','),
             Total = sum(as.numeric(na.omit(x[5:7]))),
             Count = length(na.omit(x[2:4])))
})
do.call(rbind,res)
                    Fruit  Amount Total Count
1            Banana Apple    4, 3     7     2
2 Strawberry Apple Orange 1, 2, 4     7     3
3           Orange Banana 1, 1, 1     3     2
4              Strawberry       3     3     1
5             Kiwi Banana    1, 2     3     2

I would also change the index number by a grep, something like this

 Fruit  = gregexpr('Fruit[.][0-9]', colnames(dat)) > 0  
 Amount = gregexpr('Amount[.][0-9]', colnames(dat)) > 0 

 x[2:4] replace by x[which(Fruit)]....

EDIT add some benchmarking.

library(microbenchmark)
library(data.table)
microbenchmark(ag(),mn(), am(), tr())
Unit: milliseconds
  expr       min        lq    median        uq       max
1 ag() 11.584522 12.268140 12.671484 13.317934 109.13419
2 am()  9.776206 10.515576 10.798504 11.437938 137.44867
3 mn()  6.470190  6.805646  6.974797  7.290722  48.68571
4 tr()  1.759771  1.929870  2.026960  2.142066   7.06032

For a small data.frame, Tyler Rinker is the winner!! How I explain this (just a guess)

  1. data:table solution suffer from the use of reshape and generally data.table is faster for big data.
  2. Ag study solution is slower because of the subsetting for each row , not like Tyler solution which subset before using apply.
  3. am solution is slow because of the use of reshape and merge..

Upvotes: 6

mnel
mnel

Reputation: 115390

Attempting to replicate your data, and using data.table

DT  <- data.frame(
  BuyerID = c(879,765,123,11,773), 
  Fruit.1 = c('Banana','Strawberry','Orange','Strawberry','Kiwi'),
  Fruit.2 = c('Apple','Apple','Banana',NA,'Banana'),
  Fruit.3 = c( NA, 'Orange',NA,NA,NA),
  Amount.1 = c(4,1,1,3,1), Amount.2 = c(3,2,1,NA,2), Amount.3 = c(NA,4,1,NA,NA),
  Total = c(7,7,3,3,3), 
  Count = c(2,3,2,1,2), 
  stringsAsFactors = FALSE)

# reshaping to long form and data.table

library(data.table)
DTlong <- data.table(reshape(DT, varying = list(Fruit = 2:4, Amount = 5:7), 
  direction = 'long'))

# create lists (without NA values)
# also adding count and total columns 
# by using <- to save Fruit and Amount for later use

DTlist <- DTlong[, list(Fruit <- list(as.vector(na.omit(Fruit.1))), 
                        Amount <- list(as.vector(na.omit(Amount.1))), 
                        Count  = length(unlist(Fruit)),
                        Total = sum(unlist(Amount))), 
                 by = BuyerID]

  BuyerID                      V1    V2 Count Total
1:     879            Banana,Apple   4,3     2     7
2:     765 Strawberry,Apple,Orange 1,2,4     3     7
3:     123           Orange,Banana 1,1,1     2     3
4:      11              Strawberry     3     1     3
5:     773             Kiwi,Banana   1,2     2     3

@RicardoSaporta edit:

You can skip the reshape step, if youd like, using list(list(c(....)))
This will probably save a fair bit of execution time (the drawback is it adds NAs not blank spaces). However, as @Marius points out, DTlong above is probably easier to work with.

DT <- data.table(DT)
DT[,   Fruit := list(list(c(  Fruit.1,   Fruit.2,   Fruit.3))), by=BuyerID]
DT[, Ammount := list(list(c(Amount.1, Amount.2, Amount.3))), by=BuyerID]

# Or as a single line
DT[,   list(  Fruit = list(c( Fruit.1,  Fruit.2,  Fruit.3)), 
            Ammount = list(c(Amount.1, Amount.2, Amount.3)), 
            Total, Count),  # other columns used
            by = BuyerID]

Upvotes: 11

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193527

Adding to the already existing great answers, here is another (sticking to base R):

with(DT, {
  # Convert to long format
  DTlong <- reshape(DT, direction = "long", 
                    idvar = "BuyerID", varying = 2:ncol(DT))
  # aggregate your fruit columns 
  # You need the `do.call(data.frame, ...)` to convert
  #   the resulting matrix-as-a-column into separate columns
  Agg1 <- do.call(data.frame, 
                  aggregate(Fruit ~ BuyerID, DTlong,
                            function(x) c(Fruit = paste0(x, collapse = " "),
                                          Count = length(x))))
  # aggregate the amount columns
  Agg2 <- aggregate(Amount ~ BuyerID, DTlong, sum)
  # merge the results
  merge(Agg1, Agg2)
})
#   BuyerID             Fruit.Fruit Fruit.Count Amount
# 1      11              Strawberry           1      3
# 2     123           Orange Banana           2      3
# 3     765 Strawberry Apple Orange           3      7
# 4     773             Kiwi Banana           2      3
# 5     879            Banana Apple           2      7

The basic concept is to:

  1. Use reshape to get your data in long form (which is where I think you should stop, actually)
  2. Use two different aggregate commands, one to aggregate your fruit columns, and one to aggregate your amount columns. The formula approach of aggregate takes care of removing NA, but you can specify your desired behavior with the na.action argument.
  3. Use merge to combine the two.

Upvotes: 4

Tyler Rinker
Tyler Rinker

Reputation: 109874

This is a really bad idea but here it is in base data.frame. It works because data.frame is actually a list of equal length vectors. You can force data.frame to store vectors in cells but it takes some hackery. I'd suggest other formats including Marius's suggestion or a list.

DT <- data.frame(
  BuyerID = c(879,765,123,11,773), 
  Fruit.1 = c('Banana','Strawberry','Orange','Strawberry','Kiwi'),
  Fruit.2 = c('Apple','Apple','Banana',NA,'Banana'),
  Fruit.3 = c( NA, 'Orange',NA,NA,NA),
  Amount.1 = c(4,1,1,3,1), Amount.2 = c(3,2,1,NA,2), Amount.3 = c(NA,4,1,NA,NA),
  stringsAsFactors = FALSE)

DT2 <- DT[, 1, drop=FALSE]
DT2$Fruit <- apply(DT[, 2:4], 1, function(x) unlist(na.omit(x)))
DT2$Amount <- apply(DT[, 5:7], 1, function(x) unlist(na.omit(x)))
DT2$Total <- sapply(DT2$Amount, sum)
DT2$Count <- sapply(DT2$Fruit, length)

Yielding:

> DT2
  BuyerID                     Fruit  Amount Total Count
1     879             Banana, Apple    4, 3     7     2
2     765 Strawberry, Apple, Orange 1, 2, 4     7     3
3     123            Orange, Banana 1, 1, 1     3     2
4      11                Strawberry       3     3     1
5     773              Kiwi, Banana    1, 2     3     2

Upvotes: 5

Related Questions