Reputation: 21

Summarize and generate multiple variables in a loop

I am looking for an effective way to manipulate multiple variables within a data frame. Right now I am using dplyr, but this becomes cumbersome with more variables. Suppose I have the following data frame, where brd is a car-brand, ye is a year, type is the car-type and cy and hp are type-characteristics.

brd <-c("BMW","BMW","BMW","Volvo","Volvo", "Volvo","BMW","BMW","BMW","Volvo","Volvo","Volvo")
ye <- c(99,99,99,99,99,99,98,98,98,98,98,98)
type <- c(1,2,3,1,2,3,1,2,3,1,2,3)
cy <- c(1895,1991,1587,2435,2435,1596,1991,1588,1984,1596,1991,1588)
hp <- c(77,110,80,103,103,75,110,77,93,75,110,77)

df <- as.data.frame(brd)
df$ye <- ye
df$type <- type
df$cy <- cy
df$hp <- hp    
df
     brd ye type   cy  hp
1    BMW 99    1 1895  77
2    BMW 99    2 1991 110
3    BMW 99    3 1587  80
4  Volvo 99    1 2435 103
5  Volvo 99    2 2435 103
6  Volvo 99    3 1596  75
7    BMW 98    1 1991 110
8    BMW 98    2 1588  77
9    BMW 98    3 1984  93
10 Volvo 98    1 1596  75
11 Volvo 98    2 1991 110
12 Volvo 98    3 1588  77

For each year, i want to compute the sum of product characteristics for all other products of the same brand and add it as a new variable to the dataframe. Right now, I am using dplyr like this:

library(dplyr)
df <- df %>% group_by(brd, ye) %>%
  mutate(sumall_cy = sum(cy),
         sumall_hp = sum(hp))

df <- df %>%
  mutate(sumother_cy = sumall_cy-cy,
         sumother_hp = sumall_li-hp)

So that I get

      brd    ye  type    cy    hp sumall_cy sumall_hp sumother_cy sumother_hp
   <fctr> <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>       <dbl>       <dbl>
1     BMW    99     1  1895    77      5473       267        3578         190
2     BMW    99     2  1991   110      5473       267        3482         157
3     BMW    99     3  1587    80      5473       267        3886         187
4   Volvo    99     1  2435   103      6466       281        4031         178
5   Volvo    99     2  2435   103      6466       281        4031         178
6   Volvo    99     3  1596    75      6466       281        4870         206
7     BMW    98     1  1991   110      5563       280        3572         170
8     BMW    98     2  1588    77      5563       280        3975         203
9     BMW    98     3  1984    93      5563       280        3579         187
10  Volvo    98     1  1596    75      5175       262        3579         187
11  Volvo    98     2  1991   110      5175       262        3184         152
12  Volvo    98     3  1588    77      5175       262        3587         185

Is there a more efficient way? I was thinking about looping like this stata code:

foreach x of varlist hp cy {

bysort ye: egen sumall_`x'= sum(`x')
gen sumother_`x'=(sumall_`x' -`x')}

Any suggestions would be greatly appreciated.

Upvotes: 2

Answers (2)

Uwe

Reputation: 42592

If there are many type characteristics like cy and hp, I suggest to reshape the data to long format and do all the similar transformations there. For this purpose, melt() and dcast() from the data.table package are used:

library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table
DT <- data.table(df)
# reshape from wide to long format, 
# specify id.vars because number of measure.vars may change in the future
long <- melt(DT, id.vars = c("brd", "ye", "type"))
# create additional columns
long[, sumall := sum(value), by = .(brd, ye, variable)][, sumother := sumall - value][]
# reshape from long to wide format
dcast(long, brd + ye + type ~ ..., value.var = c("value", "sumall", "sumother"))

      brd ye type value_cy value_hp sumall_cy sumall_hp sumother_cy sumother_hp
 1:   BMW 98    1     1991      110      5563       280        3572         170
 2:   BMW 98    2     1588       77      5563       280        3975         203
 3:   BMW 98    3     1984       93      5563       280        3579         187
 4:   BMW 99    1     1895       77      5473       267        3578         190
 5:   BMW 99    2     1991      110      5473       267        3482         157
 6:   BMW 99    3     1587       80      5473       267        3886         187
 7: Volvo 98    1     1596       75      5175       262        3579         187
 8: Volvo 98    2     1991      110      5175       262        3184         152
 9: Volvo 98    3     1588       77      5175       262        3587         185
10: Volvo 99    1     2435      103      6466       281        4031         178
11: Volvo 99    2     2435      103      6466       281        4031         178
12: Volvo 99    3     1596       75      6466       281        4870         206

In case, the sumall columns are not required in the final result, they can be removed before the reshape:

dcast(long[, sumall := NULL], brd + ye + type ~ ..., value.var = c("value", "sumother"))

      brd ye type value_cy value_hp sumother_cy sumother_hp
 1:   BMW 98    1     1991      110        3572         170
 2:   BMW 98    2     1588       77        3975         203
 3:   BMW 98    3     1984       93        3579         187
 4:   BMW 99    1     1895       77        3578         190
 5:   BMW 99    2     1991      110        3482         157
 6:   BMW 99    3     1587       80        3886         187
 7: Volvo 98    1     1596       75        3579         187
 8: Volvo 98    2     1991      110        3184         152
 9: Volvo 98    3     1588       77        3587         185
10: Volvo 99    1     2435      103        4031         178
11: Volvo 99    2     2435      103        4031         178
12: Volvo 99    3     1596       75        4870         206

Upvotes: 1

mt1022

Reputation: 17319

Here is a solution with non-standard evaluation, the group_by operation need to be done only once and also works when you have many more columns to process:

library(dplyr)  # 0.7.0
library(rlang)  # required for the `syms` function

varlist <- c('cy', 'hp')

# make a list of quos of opertions
ops <- sapply(syms(varlist), function(x) quo(sum(UQ(x)) - UQ(x)) )

# set new variable name
names(ops) <- paste('sumother', varlist, sep = '_')

# get results
df %>% group_by(brd, ye) %>% mutate(!!!ops) %>% ungroup()
# # A tibble: 12 x 7
#       brd    ye  type    cy    hp sumother_cy sumother_hp
#    <fctr> <dbl> <dbl> <dbl> <dbl>       <dbl>       <dbl>
#  1    BMW    99     1  1895    77        3578         190
#  2    BMW    99     2  1991   110        3482         157
#  3    BMW    99     3  1587    80        3886         187
#  4  Volvo    99     1  2435   103        4031         178
#  5  Volvo    99     2  2435   103        4031         178
#  6  Volvo    99     3  1596    75        4870         206
#  7    BMW    98     1  1991   110        3572         170
#  8    BMW    98     2  1588    77        3975         203
#  9    BMW    98     3  1984    93        3579         187
# 10  Volvo    98     1  1596    75        3579         187
# 11  Volvo    98     2  1991   110        3184         152
# 12  Volvo    98     3  1588    77        3587         185

If we would like to keep sumall_ columns, we could try:

ops <- sapply(syms(varlist), function(x) list(quo(sum(UQ(x))), quo(sum(UQ(x)) - UQ(x))) )
names(ops) <- paste(
    rep(c('sumall', 'sumother'), length(varlist)),
    rep(varlist, each = 2), sep = '_')
df %>% group_by(brd, ye) %>% mutate(!!!ops) %>% ungroup()

# # A tibble: 12 x 9
#       brd    ye  type    cy    hp sumall_cy sumother_cy sumall_hp sumother_hp
#    <fctr> <dbl> <dbl> <dbl> <dbl>     <dbl>       <dbl>     <dbl>       <dbl>
#  1    BMW    99     1  1895    77      5473        3578       267         190
#  2    BMW    99     2  1991   110      5473        3482       267         157
#  3    BMW    99     3  1587    80      5473        3886       267         187
#  4  Volvo    99     1  2435   103      6466        4031       281         178
#  5  Volvo    99     2  2435   103      6466        4031       281         178
#  6  Volvo    99     3  1596    75      6466        4870       281         206
#  7    BMW    98     1  1991   110      5563        3572       280         170
#  8    BMW    98     2  1588    77      5563        3975       280         203
#  9    BMW    98     3  1984    93      5563        3579       280         187
# 10  Volvo    98     1  1596    75      5175        3579       262         187
# 11  Volvo    98     2  1991   110      5175        3184       262         152
# 12  Volvo    98     3  1588    77      5175        3587       262         185

Upvotes: 1

Summarize and generate multiple variables in a loop

Answers (2)

Related Questions