Steven Beaupré
Steven Beaupré

Reputation: 21641

Working with temporary columns (created on-the-fly) more efficiently in a dataframe

Consider the following dataframe:

df <- data.frame(replicate(5,sample(1:10, 10, rep=TRUE)))

If I want to divide each row by its sum (to make a probability distribution), I need to do something like this:

df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)

This really feels inefficient:

  1. Create an rs column
  2. Divide each of the values by their corresponding row rowSums()
  3. Remove the temporarily created column to clean up the original dataframe.

When working with existing columns, it feels much more natural:

df %>% summarise_each(funs(weighted.mean(., X1)), -X1)

Using dplyr, would there a better way to work with temporary columns (created on-the-fly) than having to add and remove them after processing ?

I'm also interested in how data.table would handle such a task.

Upvotes: 4

Views: 2148

Answers (2)

eddi
eddi

Reputation: 49448

As I mentioned in a comment above I don't think that it makes sense to keep that data in either a data.frame or a data.table, but if you must, the following will do it without converting to a matrix and illustrates how to create a temporary variable in the data.table j-expression:

dt = as.data.table(df)

dt[, names(dt) := {sums = Reduce(`+`, .SD); lapply(.SD, '/', sums)}]

Upvotes: 5

Colonel Beauvel
Colonel Beauvel

Reputation: 31171

Why not considering base R as well:

as.data.frame(as.matrix(df)/rowSums(df))

Or just with your data.frame:

df/rowSums(df)

Upvotes: 2

Related Questions