Working with temporary columns (created on-the-fly) more efficiently in a dataframe

Question

Consider the following dataframe:

df <- data.frame(replicate(5,sample(1:10, 10, rep=TRUE)))

If I want to divide each row by its sum (to make a probability distribution), I need to do something like this:

df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)

This really feels inefficient:

Create an rs column
Divide each of the values by their corresponding row rowSums()
Remove the temporarily created column to clean up the original dataframe.

When working with existing columns, it feels much more natural:

df %>% summarise_each(funs(weighted.mean(., X1)), -X1)

Using dplyr, would there a better way to work with temporary columns (created on-the-fly) than having to add and remove them after processing ?

I'm also interested in how data.table would handle such a task.

eddi · Accepted Answer

As I mentioned in a comment above I don't think that it makes sense to keep that data in either a data.frame or a data.table, but if you must, the following will do it without converting to a matrix and illustrates how to create a temporary variable in the data.table j-expression:

dt = as.data.table(df)

dt[, names(dt) := {sums = Reduce(`+`, .SD); lapply(.SD, '/', sums)}]

Working with temporary columns (created on-the-fly) more efficiently in a dataframe

Answers (2)

Related Questions