Reputation: 3200
My question involves summing up values across multiple columns of a data frame and creating a new column corresponding to this summation using dplyr
. The data entries in the columns are binary(0,1). I am thinking of a row-wise analog of the summarise_each
or mutate_each
function of dplyr
. Below is a minimal example of the data frame:
library(dplyr)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
> df
x1 x2 x3 x4 x5
1 1 1 0 1 1
2 0 1 1 0 1
3 0 NA 0 NA NA
4 NA 1 1 1 1
5 0 1 1 0 1
6 1 0 0 0 1
7 1 NA NA NA NA
8 NA NA NA 0 1
9 0 0 0 0 0
10 1 1 1 1 1
I could use something like:
df <- df %>% mutate(sumrow= x1 + x2 + x3 + x4 + x5)
but this would involve writing out the names of each of the columns. I have like 50 columns. In addition, the column names change at different iterations of the loop in which I want to implement this operation so I would like to try avoid having to give any column names.
How can I do that most efficiently? Any assistance would be greatly appreciated.
Upvotes: 182
Views: 313898
Reputation: 18632
In newer versions of dplyr
you can use rowwise()
along with c_across
to perform row-wise aggregation for functions that do not have specific row-wise variants, but if the row-wise variant exists it should be faster than using rowwise
(eg rowSums
, rowMeans
).
Since rowwise()
is just a special form of grouping and changes the way verbs work you'll likely want to pipe it to ungroup()
after doing your row-wise operation.
To select a range by name:
df %>%
rowwise() %>%
mutate(sumrange = sum(c_across(x1:x5), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
To select by type:
df %>%
rowwise() %>%
mutate(sumnumeric = sum(c_across(where(is.numeric)), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
To select by column name:
You can use any number of tidy selection helpers like starts_with
, ends_with
, contains
, etc.
df %>%
rowwise() %>%
mutate(sum_startswithx = sum(c_across(starts_with("x")), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
To select by column index:
df %>%
rowwise() %>%
mutate(sumindex = sum(c_across(c(1:4, 5)), na.rm = T))
# %>% ungroup() # you'll likely want to ungroup after using rowwise()
rowise()
will work for any summary function. However, in your specific case a row-wise variant exists (rowSums
) so you can do the following, which will be faster:
df %>%
mutate(sumrow = rowSums(pick(x1:x5), na.rm = T))
rowwise
makes a pipe chain very readable and works fine for smaller data frames. However, it is inefficient.
rowwise
versus row-wise variant function
For this example, the the row-wise variant rowSums
is much faster:
library(microbenchmark)
set.seed(1)
large_df <- slice_sample(df, n = 1E5, replace = T) # 100,000 obs
microbenchmark(
large_df %>%
rowwise() %>%
mutate(sumrange = sum(c_across(x1:x5), na.rm = T)),
large_df %>%
mutate(sumrow = rowSums(pick(x1:x5), na.rm = T)),
times = 10L
)
Unit: milliseconds
min lq mean median uq max neval cld
11108.459801 11464.276501 12144.871171 12295.362251 12690.913301 12918.106801 10 b
6.533301 6.649901 7.633951 7.808201 8.296101 8.693101 10 a
Large data frame without a row-wise variant function
If there isn't a row-wise variant for your function and you have a large data frame, consider a long-format, which is more efficient than rowwise
. Though there are probably faster non-tidyverse options, here is a tidyverse option (using tidyr::pivot_longer
):
library(tidyr)
tidyr_pivot <- function(){
large_df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = starts_with("x")) %>%
group_by(rn) %>%
summarize(std = sd(value, na.rm = T), .groups = "drop") %>%
bind_cols(large_df, .) %>%
select(-rn)
}
dplyr_rowwise <- function(){
large_df %>%
rowwise() %>%
mutate(std = sd(c_across(starts_with("x")), na.rm = T)) %>%
ungroup()
}
microbenchmark(dplyr_rowwise(),
tidyr_pivot(),
times = 10L)
Unit: seconds
expr min lq mean median uq max neval cld
dplyr_rowwise() 12.845572 13.48340 14.182836 14.30476 15.155155 15.409750 10 b
tidyr_pivot() 1.404393 1.56015 1.652546 1.62367 1.757428 1.981293 10 a
c_across versus pick
In the particular case of the sum
function, pick
and c_across
give the same output for much of the code above:
sum_pick <- df %>%
rowwise() %>%
mutate(sumrange = sum(pick(x1:x5), na.rm = T))
sum_c_across <- df %>%
rowwise() %>%
mutate(sumrange = sum(c_across(x1:x5), na.rm = T))
all.equal(sum_pick, sum_c_across)
[1] TRUE
The row-wise output of c_across
is a vector (hence the c_
), while the row-wise output of pick
is a 1-row tibble
object:
df %>%
rowwise() %>%
mutate(c_across = list(c_across(x1:x5)),
pick = list(pick(x1:x5)),
.keep = "unused") %>%
ungroup()
# A tibble: 10 × 2
c_across pick
<list> <list>
1 <dbl [5]> <tibble [1 × 5]>
2 <dbl [5]> <tibble [1 × 5]>
3 <dbl [5]> <tibble [1 × 5]>
4 <dbl [5]> <tibble [1 × 5]>
5 <dbl [5]> <tibble [1 × 5]>
6 <dbl [5]> <tibble [1 × 5]>
7 <dbl [5]> <tibble [1 × 5]>
8 <dbl [5]> <tibble [1 × 5]>
9 <dbl [5]> <tibble [1 × 5]>
10 <dbl [5]> <tibble [1 × 5]>
The function you want to apply will necessitate, which verb you use. As shown above with sum
you can use them nearly interchangeably. However, mean
and many other common functions expect a (numeric) vector as its first argument:
class(df[1,])
"data.frame"
sum(df[1,]) # works with data.frame
[1] 4
mean(df[1,]) # does not work with data.frame
[1] NA
Warning message:
In mean.default(df[1, ]) : argument is not numeric or logical: returning NA
class(unname(unlist(df[1,])))
"numeric"
sum(unname(unlist(df[1,]))) # works with numeric vector
[1] 4
mean(unname(unlist(df[1,]))) # works with numeric vector
[1] 0.8
Ignoring the row-wise variant that exists for mean (rowMean
) then in this case c_across
should be used:
df %>%
rowwise() %>%
mutate(avg = mean(c_across(x1:x5), na.rm = T)) %>%
ungroup()
# A tibble: 10 x 6
x1 x2 x3 x4 x5 avg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 1 1 0.8
2 0 1 1 0 1 0.6
3 0 NA 0 NA NA 0
4 NA 1 1 1 1 1
5 0 1 1 0 1 0.6
6 1 0 0 0 1 0.4
7 1 NA NA NA NA 1
8 NA NA NA 0 1 0.5
9 0 0 0 0 0 0
10 1 1 1 1 1 1
# Does not work
df %>%
rowwise() %>%
mutate(avg = mean(pick(x1:x5), na.rm = T)) %>%
ungroup()
rowSums
, rowMeans
, etc. can take a numeric data frame as the first argument, which is why they work with pick
.
Upvotes: 83
Reputation: 32642
If legibility is not a concern, but speed is, you can construct the desired x1 + x2 + x3 + x4 + x5
using rlang, then evaluate that.
library(dplyr)
library(purrr)
library(rlang)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
# Construct a list of columns to sum, make them symbols
names_symbols <- syms(colnames(df))
# Construct an expression by placing `+` between each symbol
sum_expression <- reduce(names_symbols, ~expr(!!.x + !!.y))
# Evaluate that expression within the mutate
df <- df %>% mutate(sumrow = eval_tidy(!!sum_expression))
# Or, as a one-liner
df <- df %>% mutate(sumrow = eval_tidy(!!reduce(syms(colnames(.)), ~expr(!!.x + !!.y))))
A variant for the benchmark by @Agile Bean
sum.implicit_explicit <- function(){
dataset %>%
mutate(sum.cols = eval_tidy(!!purrr::reduce(syms(cols), ~ expr(!!.x + !!.y))))
}
Performance: very little overhead over explicit, much faster than the normal reduce:
Upvotes: 0
Reputation: 7752
sum up each row using rowSums
(rowwise
works for any aggreation, but is slower)
df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(across(where(is.numeric))))
sum down each column
df %>%
summarise(across(everything(), ~ sum(., na.rm = TRUE)))
sum up each row
df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(.[1:5]))
sum down each column using superseeded summarise_all
:
df %>%
replace(is.na(.), 0) %>%
summarise_all(funs(sum))
Upvotes: 242
Reputation: 7151
As it's difficult to decide among all the interesting answers given by @skd, @LMc, and others, I benchmarked all alternatives which are reasonably long.
The difference to other examples is that I used a larger dataset (10.000 rows) and from a real world dataset (diamonds), so the findings might reflect more the variance of real world data.
The reproducible benchmarking code is:
set.seed(17)
dataset <- diamonds %>% sample_n(1e4)
cols <- c("depth", "table", "x", "y", "z")
sum.explicit <- function() {
dataset %>%
mutate(sum.cols = depth + table + x + y + z)
}
sum.rowSums <- function() {
dataset %>%
mutate(sum.cols = rowSums(across(cols)))
}
sum.reduce <- function() {
dataset %>%
mutate(sum.cols = purrr::reduce(select(., cols), `+`))
}
sum.nest <- function() {
dataset %>%
group_by(id = row_number()) %>%
nest(data = cols) %>%
mutate(sum.cols = map_dbl(data, sum))
}
# NOTE: across with rowwise doesn't work with all functions!
sum.across <- function() {
dataset %>%
rowwise() %>%
mutate(sum.cols = sum(across(cols)))
}
sum.c_across <- function() {
dataset %>%
rowwise() %>%
mutate(sum.cols = sum(c_across(cols)))
}
sum.apply <- function() {
dataset %>%
mutate(sum.cols = select(., cols) %>%
apply(1, sum, na.rm = TRUE))
}
bench <- microbenchmark::microbenchmark(
sum.nest(),
sum.across(),
sum.c_across(),
sum.apply(),
sum.explicit(),
sum.reduce(),
sum.rowSums(),
times = 10
)
bench %>% print(order = 'mean', signif = 3)
Unit: microseconds
expr min lq mean median uq max neval
sum.explicit() 796 839 1160 950 1040 3160 10
sum.rowSums() 1430 1450 1770 1650 1800 2980 10
sum.reduce() 1650 1700 2090 2000 2140 3300 10
sum.apply() 9290 9400 9720 9620 9840 11000 10
sum.c_across() 341000 348000 353000 356000 359000 360000 10
sum.nest() 793000 827000 854000 843000 871000 945000 10
sum.across() 4810000 4830000 4880000 4900000 4920000 4940000 10
Visualizing this (without the outlier sum.across
) facilitates the comparison:
nest
and rowwise
/c_across
are not recommendable for larger datasets (> 100.000 rows or repeated actions)rowSums
but with a little computational overheadpurrr::reduce
is relatively new in the tidyverse (but well known in python), and as Reduce
in base R very efficient, thus winning a place among the Top3. Because the explicit form is cumbersome to write, and there are not many vectorized methods other than rowSums
/rowMeans
, colSums
/colMeans
, I would recommend for all other functions (e.g. sd
) to apply purrr::reduce
.Upvotes: 9
Reputation: 590
In case you want to sum across columns or rows using a vector but in this case modifying the df instead of add a new column to df.
You can use the sweep function:
library(dplyr)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
> df
x1 x2 x3 x4 x5
1 1 1 0 1 1
2 0 1 1 0 1
3 0 NA 0 NA NA
4 NA 1 1 1 1
5 0 1 1 0 1
6 1 0 0 0 1
7 1 NA NA NA NA
8 NA NA NA 0 1
9 0 0 0 0 0
10 1 1 1 1 1
Sum (vector + dataframe) in row-wise order:
vector = 1:5
sweep(df, MARGIN=2, vector, `+`)
x1 x2 x3 x4 x5
1 2 3 3 5 6
2 1 3 4 4 6
3 1 NA 3 NA NA
4 NA 3 4 5 6
5 1 3 4 4 6
6 2 2 3 4 6
7 2 NA NA NA NA
8 NA NA NA 4 6
9 1 2 3 4 5
10 2 3 4 5 6
Sum (vector + dataframe) in column-wise order:
vector <- 1:10
sweep(df, MARGIN=1, vector, `+`)
x1 x2 x3 x4 x5
1 2 2 1 2 2
2 2 3 3 2 3
3 3 NA 3 NA NA
4 NA 5 5 5 5
5 5 6 6 5 6
6 7 6 6 6 7
7 8 NA NA NA NA
8 NA NA NA 8 9
9 9 9 9 9 9
10 11 11 11 11 11
This the same to say vector+df
And Yes. You can use sweep with:
sweep(df, MARGIN=2, vector, `-`)
sweep(df, MARGIN=2, vector, `*`)
sweep(df, MARGIN=2, vector, `/`)
sweep(df, MARGIN=2, vector, `^`)
Another Way is using Reduce with column-wise:
vector = 1:5
.df <- list(df, vector)
Reduce('+', .df)
Upvotes: 0
Reputation: 1967
Using reduce()
from purrr
is slightly faster than rowSums
and definately faster than apply
, since you avoid iterating over all the rows and just take advantage of the vectorized operations:
library(purrr)
library(dplyr)
iris %>% mutate(Petal = reduce(select(., starts_with("Petal")), `+`))
See this for timings
Upvotes: 33
Reputation: 901
I encounter this problem often, and the easiest way to do this is to use the apply()
function within a mutate
command.
library(tidyverse)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
df %>%
mutate(sum = select(., x1:x5) %>% apply(1, sum, na.rm=TRUE))
Here you could use whatever you want to select the columns using the standard dplyr
tricks (e.g. starts_with()
or contains()
). By doing all the work within a single mutate
command, this action can occur anywhere within a dplyr
stream of processing steps. Finally, by using the apply()
function, you have the flexibility to use whatever summary you need, including your own purpose built summarization function.
Alternatively, if the idea of using a non-tidyverse function is unappealing, then you could gather up the columns, summarize them and finally join the result back to the original data frame.
df <- df %>% mutate( id = 1:n() ) # Need some ID column for this to work
df <- df %>%
group_by(id) %>%
gather('Key', 'value', starts_with('x')) %>%
summarise( Key.Sum = sum(value) ) %>%
left_join( df, . )
Here I used the starts_with()
function to select the columns and calculated the sum and you can do whatever you want with NA
values. The downside to this approach is that while it is pretty flexible, it doesn't really fit into a dplyr
stream of data cleaning steps.
Upvotes: 26
Reputation: 903
I would use regular expression matching to sum over variables with certain pattern names. For example:
df <- df %>% mutate(sum1 = rowSums(.[grep("x[3-5]", names(.))], na.rm = TRUE),
sum_all = rowSums(.[grep("x", names(.))], na.rm = TRUE))
This way you can create more than one variable as a sum of certain group of variables of your data frame.
Upvotes: 34
Reputation: 911
If you want to sum certain columns only, I'd use something like this:
library(dplyr)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
df %>% select(x3:x5) %>% rowSums(na.rm=TRUE) -> df$x3x5.total
head(df)
This way you can use dplyr::select
's syntax.
Upvotes: 41