Reputation: 968
I have the following three data frames:
df1 <- data.frame(A = 1:10, B = 3:12, D = 4:13)
df2 <- data.frame(C = 1:5, B = 4:8, E = 2:6)
df3 <- data.frame(A = 13:10, B = 19:16, F = 1:4)
rownames(df1) <- paste0("row", seq_len(nrow(df1)))
rownames(df2) <- paste0("row", c(1, 3, 5, 7, 11))
rownames(df3) <- paste0("row", c(12, 3, 10, 9))
# > df1
# A B D
# row1 1 3 4
# row2 2 4 5
# row3 3 5 6
# row4 4 6 7
# row5 5 7 8
# row6 6 8 9
# row7 7 9 10
# row8 8 10 11
# row9 9 11 12
# row10 10 12 13
# > df2
# C B E
# row1 1 4 2
# row3 2 5 3
# row5 3 6 4
# row7 4 7 5
# row11 5 8 6
# > df3
# A B F
# row12 13 19 1
# row3 12 18 2
# row10 11 17 3
# row9 10 16 4
When cells from different data frames have the same row and column names, I want to sum their values. In the cases where a cell has no matches (there aren't any other cells with the same row and column name), the final data frame will contain that original value. When a cell with a particular row name and column name combination doesn't exist in any of the original data frames, the final data frame will contain an NA
in that position.
The final data frame should look like this data frame:
> df4
A B C D E G
row1 1 7 1 4 2 NA
row2 2 4 NA 5 1 NA
row3 15 28 2 6 5 2
row4 4 6 NA 7 3 NA
row5 5 13 3 8 8 NA
row6 6 8 NA 9 5 NA
row7 7 16 4 10 11 NA
row8 8 10 NA 11 7 NA
row9 19 27 NA 12 8 4
row10 21 29 NA 13 9 3
row11 NA 8 5 NA 6 NA
row12 13 19 NA NA NA 1
I'm imagining something with the Reduce()
function that can be used on many data frames at once. Is the first step adding missing rows and columns to existing data frames with NA
s in all the cells where values are missing?
Thanks!
Upvotes: 4
Views: 678
Reputation: 887691
Using data.table
library(data.table)
rbindlist(list(setDT(df1, keep.rownames = TRUE),
setDT(df2, keep.rownames = TRUE),
setDT(df3, keep.rownames = TRUE)), fill = TRUE)[,
lapply(.SD, function(x) dplyr::na_if(sum(x, na.rm = TRUE), 0)), rn]
Upvotes: 1
Reputation: 389175
Here's a dplyr
alternative -
sum
the values from all the columns. If a column has all the NA
's we return NA
. A shortcut to do that is to use hablar::sum_
.column_to_rownames
to assign the values back as rownames.library(dplyr)
df_list = list(df1, df2, df3)
purrr::map_df(df_list, ~.x %>% rownames_to_column('row')) %>%
group_by(row) %>%
summarise(across(A:F, hablar::sum_)) %>%
arrange(order(gtools::mixedorder(row))) %>%
column_to_rownames('row')
# A B D C E F
#row1 1 7 4 1 2 NA
#row2 2 4 5 NA NA NA
#row3 15 28 6 2 3 2
#row4 4 6 7 NA NA NA
#row5 5 13 8 3 4 NA
#row6 6 8 9 NA NA NA
#row7 7 16 10 4 5 NA
#row8 8 10 11 NA NA NA
#row9 19 27 12 NA NA 4
#row10 21 29 13 NA NA 3
#row11 NA 8 NA 5 6 NA
#row12 13 19 NA NA NA 1
Upvotes: 2
Reputation: 146060
I think this should work. With row AND column names and one data type, I prefer matrices to data frames, but you can convert the final matrix back to a data frame if you need.
# put things in a list
df_list = list(df1, df2, df3)
# get the complete set of row and column names
all_rows = unique(unlist(lapply(df_list, rownames)))
all_cols = unique(unlist(lapply(df_list, colnames)))
# initialize a final matrix to NA
final_mat = matrix(NA, nrow = length(all_rows), ncol = length(all_cols))
rownames(final_mat) = all_rows
colnames(final_mat) = all_cols
# go through each df in the list
for(i in seq_along(df_list)) {
# set any NAs in the selection to 0
final_mat[rownames(df_list[[i]]), colnames(df_list[[i]])][is.na(final_mat[rownames(df_list[[i]]), colnames(df_list[[i]])])] = 0
# add the data frame to the selection
final_mat[rownames(df_list[[i]]), colnames(df_list[[i]])] = final_mat[rownames(df_list[[i]]), colnames(df_list[[i]])] + as.matrix(df_list[[i]])
}
final_mat
# A B D C E F
# row1 1 7 4 1 2 NA
# row2 2 4 5 NA NA NA
# row3 15 28 6 2 3 2
# row4 4 6 7 NA NA NA
# row5 5 13 8 3 4 NA
# row6 6 8 9 NA NA NA
# row7 7 16 10 4 5 NA
# row8 8 10 11 NA NA NA
# row9 19 27 12 NA NA 4
# row10 21 29 13 NA NA 3
# row11 NA 8 NA 5 6 NA
# row12 13 19 NA NA NA 1
Upvotes: 4