David Moore
David Moore

Reputation: 968

Summing Entries in Multiple Unequally-Sized Data Frames With Some (but not All) Rows and Columns the Same

I have the following three data frames:

df1 <- data.frame(A = 1:10, B = 3:12, D = 4:13)
df2 <- data.frame(C = 1:5, B = 4:8, E = 2:6)
df3 <- data.frame(A = 13:10, B = 19:16, F = 1:4)

rownames(df1) <- paste0("row", seq_len(nrow(df1)))
rownames(df2) <- paste0("row", c(1, 3, 5, 7, 11))
rownames(df3) <- paste0("row", c(12, 3, 10, 9))

# > df1
#        A  B  D
# row1   1  3  4
# row2   2  4  5
# row3   3  5  6
# row4   4  6  7
# row5   5  7  8
# row6   6  8  9
# row7   7  9 10
# row8   8 10 11
# row9   9 11 12
# row10 10 12 13

# > df2
#       C B E
# row1  1 4 2
# row3  2 5 3
# row5  3 6 4
# row7  4 7 5
# row11 5 8 6

# > df3
#       A  B F
# row12 13 19 1
# row3  12 18 2
# row10 11 17 3
# row9  10 16 4

When cells from different data frames have the same row and column names, I want to sum their values. In the cases where a cell has no matches (there aren't any other cells with the same row and column name), the final data frame will contain that original value. When a cell with a particular row name and column name combination doesn't exist in any of the original data frames, the final data frame will contain an NA in that position.

The final data frame should look like this data frame:

> df4
       A  B  C  D  E  G
row1   1  7  1  4  2 NA
row2   2  4 NA  5  1 NA
row3  15 28  2  6  5  2
row4   4  6 NA  7  3 NA
row5   5 13  3  8  8 NA
row6   6  8 NA  9  5 NA
row7   7 16  4 10 11 NA
row8   8 10 NA 11  7 NA
row9  19 27 NA 12  8  4
row10 21 29 NA 13  9  3
row11 NA  8  5 NA  6 NA
row12 13 19 NA NA NA  1

I'm imagining something with the Reduce() function that can be used on many data frames at once. Is the first step adding missing rows and columns to existing data frames with NAs in all the cells where values are missing?

Thanks!

Upvotes: 4

Views: 678

Answers (3)

akrun
akrun

Reputation: 887691

Using data.table

library(data.table)
rbindlist(list(setDT(df1, keep.rownames = TRUE), 
           setDT(df2, keep.rownames = TRUE), 
         setDT(df3, keep.rownames = TRUE)), fill = TRUE)[, 
    lapply(.SD, function(x) dplyr::na_if(sum(x, na.rm = TRUE), 0)), rn]

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389175

Here's a dplyr alternative -

  • Put the dataframes in a list
  • make rownames as a separate column from each dataframe
  • Combine them in one dataframe
  • For each rowname sum the values from all the columns. If a column has all the NA's we return NA. A shortcut to do that is to use hablar::sum_.
  • Order the data based on rownames
  • Use column_to_rownames to assign the values back as rownames.
library(dplyr)

df_list = list(df1, df2, df3)

purrr::map_df(df_list, ~.x %>% rownames_to_column('row')) %>%
  group_by(row) %>%
  summarise(across(A:F, hablar::sum_)) %>%
  arrange(order(gtools::mixedorder(row))) %>%
  column_to_rownames('row')

#       A  B  D  C  E  F
#row1   1  7  4  1  2 NA
#row2   2  4  5 NA NA NA
#row3  15 28  6  2  3  2
#row4   4  6  7 NA NA NA
#row5   5 13  8  3  4 NA
#row6   6  8  9 NA NA NA
#row7   7 16 10  4  5 NA
#row8   8 10 11 NA NA NA
#row9  19 27 12 NA NA  4
#row10 21 29 13 NA NA  3
#row11 NA  8 NA  5  6 NA
#row12 13 19 NA NA NA  1

Upvotes: 2

Gregor Thomas
Gregor Thomas

Reputation: 146060

I think this should work. With row AND column names and one data type, I prefer matrices to data frames, but you can convert the final matrix back to a data frame if you need.

# put things in a list
df_list = list(df1, df2, df3)

# get the complete set of row and column names
all_rows = unique(unlist(lapply(df_list, rownames)))
all_cols = unique(unlist(lapply(df_list, colnames)))

# initialize a final matrix to NA
final_mat = matrix(NA, nrow = length(all_rows), ncol = length(all_cols))
rownames(final_mat) = all_rows
colnames(final_mat) = all_cols

# go through each df in the list
for(i in seq_along(df_list)) {
  # set any NAs in the selection to 0
  final_mat[rownames(df_list[[i]]), colnames(df_list[[i]])][is.na(final_mat[rownames(df_list[[i]]), colnames(df_list[[i]])])] = 0
  # add the data frame to the selection
  final_mat[rownames(df_list[[i]]), colnames(df_list[[i]])] = final_mat[rownames(df_list[[i]]), colnames(df_list[[i]])] + as.matrix(df_list[[i]])
}

final_mat
#        A  B  D  C  E  F
# row1   1  7  4  1  2 NA
# row2   2  4  5 NA NA NA
# row3  15 28  6  2  3  2
# row4   4  6  7 NA NA NA
# row5   5 13  8  3  4 NA
# row6   6  8  9 NA NA NA
# row7   7 16 10  4  5 NA
# row8   8 10 11 NA NA NA
# row9  19 27 12 NA NA  4
# row10 21 29 13 NA NA  3
# row11 NA  8 NA  5  6 NA
# row12 13 19 NA NA NA  1

Upvotes: 4

Related Questions