rob
rob

Reputation: 47

Recursively sum data frames for matching rows

I would like to combine a set of data frames into a single data frame by summing columns that have matching variables (instead of appending columns).

For example, given

df1 <- data.frame(A = c(0,0,1,1,1,2,2), B = c(1,2,1,2,3,1,5), x = c(2,3,1,5,3,7,0))
df2 <- data.frame(A = c(0,1,1,2,2,2), B = c(1,1,3,2,4,5), x = c(4,8,4,1,0,3))
df3 <- data.frame(A = c(0,1,2), B = c(5,4,2), x = c(5,3,1))

I want to match by "A" and "B" and sum the values of "x". For this example, I can get the desired result as follows:

library(plyr)
library(dplyr)
# rename columns so that join_all preserves them all:
colnames(df1)[3] <- "x1"
colnames(df2)[3] <- "x2"
colnames(df3)[3] <- "x3"
# join the data frames by matching "A" and "B" values:
res <- join_all(list(df1, df2, df3), by = c("A", "B"), type = "full")
# get the sums and drop superfluous columns:
arrange(res, A, B) %>% 
  rowwise() %>% 
  mutate(x = sum(x1, x2, x3, na.rm = TRUE)) %>% 
  select(A, B, x)

Result:

       A     B     x
   <dbl> <dbl> <dbl>
 1     0     1     6
 2     0     2     3
 3     0     5     5
 4     1     1     9
 5     1     2     5
 6     1     3     7
 7     1     4     3
 8     2     1     7
 9     2     2     2
10     2     4     0
11     2     5     3

A more general solution is

library(dplyr)
# function to get the desired result for two data frames:
my_merge <- function(df1, df2)
{
  m1 <- merge(df1, df2, by = c("A", "B"), all = TRUE)
  m1 <- rowwise(res) %>% 
    mutate(x = sum(x.x, x.y, na.rm = TRUE)) %>%
    select(A, B, x)
  return(m1)
}
l1 <- list(df2, df3) # omit the first data frame
res <- df1 # initial value of the result
for(df in l1) res <- my_merge(res, df) # call the function repeatedly

Is there a more efficient option for combining a large set of data frames? Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).

Upvotes: 1

Views: 197

Answers (2)

Frank
Frank

Reputation: 66819

Ideally it should be recursive (i.e. it's better not to join all data frames into one massive data frame before calculating the sums).

If you're memory constrained and willing to sacrifice speed (vs @akrun's data.table approach), use one table at a time in a loop:

library(data.table)
tabs = c("df1", "df2", "df3")

# enumerate all combos for the results table
# initializing sum to 0
res = CJ(A = 0:2, B = 1:5, x = 0)
# loop over tabs, adding on
for (i in seq_along(tabs)){
  tab = get(tabs[[i]])
  res[tab, on=.(A, B), x := x + i.x][]
  rm(tab)
}

If you need to read tables from disk, change tabs to file names and get to fread or whatever function.

I am skeptical that you can fit all the tables in memory, but cannot also fit an rbind-ed copy of them together.


Similarly (thanks to @akrun's comment), use his approach pairwise:

res = data.table(get(tabs[[1]]))[0L]

for (i in seq_along(tabs)){
  tab = get(tabs[[i]])
  res = rbind(res, tab)[, .(x = sum(x)), by=.(A,B)]
  rm(tab)
}

Upvotes: 2

akrun
akrun

Reputation: 886938

An easier option is to bind the rows of the datasets, then group by the columns of interest and get the summarised output by getting the sum of 'x'

library(tidyverse)
bind_rows(df1, df2, df3) %>% 
        group_by(A, B) %>%
        summarise(x = sum(x))
# A tibble: 11 x 3
# Groups:   A [?]
#       A     B     x
#   <dbl> <dbl> <dbl>
# 1     0     1     6
# 2     0     2     3
# 3     0     5     5
# 4     1     1     9
# 5     1     2     5
# 6     1     3     7
# 7     1     4     3
# 8     2     1     7
# 9     2     2     2
#10     2     4     0
#11     2     5     3

If there are many objects in the global environment with the pattern "df" followed by some digits

mget(ls(pattern= "^df\\d+")) %>%
        bind_rows %>%
        group_by(A, B) %>% 
        summarise(x = sum(x))

As the OP mentioned about memory constraints, if we do the join first and then use rowSums or + with reduce, it would be more efficient

mget(ls(pattern= "^df\\d+")) %>% 
      reduce(full_join, by = c("A", "B")) %>%
      transmute(A, B, x = rowSums(.[3:5], na.rm = TRUE)) %>%
      arrange(A, B)
#   A B x
#1  0 1 6
#2  0 2 3
#3  0 5 5
#4  1 1 9
#5  1 2 5
#6  1 3 7
#7  1 4 3
#8  2 1 7
#9  2 2 2
#10 2 4 0
#11 2 5 3

This could also be done with data.table

library(data.table)
rbindlist(mget(ls(pattern= "^df\\d+")))[, .(x = sum(x)), by = .(A, B)]

Upvotes: 2

Related Questions