Devin
Devin

Reputation: 911

Ways to add multiple columns to data frame using plyr/dplyr/purrr

I often have a need to mutate a data frame through the additional of several columns at once using a custom function, preferably using parallelization. Below are the ways I already know how to do this.

Setup

library(dplyr)
library(plyr)
library(purrr)
library(doMC)
registerDoMC(2)

df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))

Suppose that I want two new columns, foocol = x + y and barcol = (x + y) * 100, but that these are actually complex calculations done in a custom function.

Method 1: Add columns separately using rowwise and mutate

foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)

df_out1 <- df %>% rowwise() %>% mutate(foocol = foo(x, y), barcol = bar(x, y))

This is not a good solution since it requires two function calls for each row and two "expensive" calculations of x + y. It's also not parallelized.

Method 2: Trick ddply into rowwise operation

df2 <- df
df2$id <- 1:nrow(df2)

df_out2 <- ddply(df2, .(id), function(r) {
  foocol <- r$x + r$y
  barcol <- foocol * 100
  return(cbind(r, foocol, barcol))
}, .parallel = T)

Here I trick ddply into calling a function on each row by splitting on a unique id column I just created. It's clunky, though, and requires maintaining a useless column.

Method 3: splat

foobar <- function(x, y, ...) {
  foocol <- x + y
  barcol <- foocol * 100
  return(data.frame(x, y, ..., foocol, barcol))
}

df_out3 <- splat(foobar)(df)

I like this solution since you can reference the columns of df in the custom function (which can be anonymous if desired) without array comprehension. However, this method isn't parallelized.

Method 4: by_row

df_out4 <- df %>% by_row(function(r) {
  foocol <- r$x + r$y
  barcol <- foocol * 100
  return(data.frame(foocol = foocol, barcol = barcol))
}, .collate = "cols")

The by_row function from purrr eliminates the need for the unique id column, but this operation isn't parallelized.

Method 5: pmap_df

df_out5 <- pmap_df(df, foobar)
# or equivalently...
df_out5 <- df %>% pmap_df(foobar)

This is the best option I've found. The pmap family of functions also accept anonymous functions to apply to the arguments. I believe pmap_df converts df to a list and back, though, so maybe there is a performance hit.

It's also a bit annoying that I need to reference all the columns I plan on using for calculation in the function definition function(x, y, ...) instead of just function(r) for the row object.


Am I missing any good or better options? Are there any concerns with the methods I described?

Upvotes: 4

Views: 1644

Answers (1)

the_skua
the_skua

Reputation: 1291

How about using data.table?

library(data.table)

foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)

dt <- as.data.table(df)

dt[, foocol:=foo(x,y)]
dt[, barcol:=bar(x,y)]

The data.table library is quite fast and has at least some some potential for parallelization.

Upvotes: 1

Related Questions