Red_24
Red_24

Reputation: 15

Merging dataframes into dataframe format by matching rows AND columns

Recently I have started using R again. I am having problems with the following problem:

I have created a dataframe format in which I want to merge smaller dataframes into, based on both the columns and rows. I want to keep the dimensions of the large format dataframe and insert the data of the smaller dataframes. The tables that I am working with are quite big, but I have created some example data.

The large dataframe, which is an empty format table:

formattable <- data.frame("Agriculture" = c(NA,NA,NA,NA,NA,NA,NA,NA),"Mining" = c(NA,NA,NA,NA,NA,NA,NA,NA),
                "Industry" = c(NA,NA,NA,NA,NA,NA,NA,NA),"Services" = c(NA,NA,NA,NA,NA,NA,NA,NA), "Households" = c(NA,NA,NA,NA,NA,NA,NA,NA), 
                row.names = c("N01 water","N02 solar", "P01 coal", "P02 natural gas", "P03 diesel", "R01 waste", "R02 residuals","R03 losses" ) )

The smaller dataframes that hold data:

data1 <- data.frame("Agriculture" = c(5,7,NA),"Mining" = c(5,6,NA),
                 "Services" = c(4,8,NA), "Households" = c(1,11,NA), 
                 row.names = c("P01 coal", "P02 natural gas", "P03 diesel") )

data2 <- data.frame("Agriculture" = c(3,2),"Mining" = c(NA,5)
                          ,"Services" = c(5,7), 
                          row.names = c("N01 water","N02 solar") )

data3 <- data.frame("Agriculture" = c(3,6,5),"Mining" = c(NA,0,5),
                          "Industry" = c(5,NA,NA),"Services" = c(NA,NA,NA), "Households" = c(7,8,9), 
                          row.names = c("R01 waste", "R02 residuals","R03 losses" ) )

As an end product, I want one large dataframe that holds all the data from the smaller dataframes. The combinations of rows and columns that are empty/zero should still be visible with the value "0". Can anyone help me with this?

Upvotes: 1

Views: 213

Answers (2)

harre
harre

Reputation: 7297

Another approach could potentially be to skip the larger data frame and use bind_rows and replace_na on the smaller:

library(dplyr)

bind_rows(data1, data2, data3) |> 
  mutate(across(everything(), ~ replace_na(., 0)))

Output:

                Agriculture Mining Services Households Industry
P01 coal                  5      5        4          1        0
P02 natural gas           7      6        8         11        0
P03 diesel                0      0        0          0        0
N01 water                 3      0        5          0        0
N02 solar                 2      5        7          0        0
R01 waste                 3      0        0          7        5
R02 residuals             6      0        0          8        0
R03 losses                5      5        0          9        0

Update: The long approach

To offer yet another alternative approach, it might come (very) handy for you to work in longer formats instead until the very end.

In the example below I make the formattable redundant by using expandgrid (you could use pivot_longer on your current table too) and you could update the values by a doing a simple join (or use an experimental rows_*-function, here rows_update). As you might have guessed from SamR's answer rownames are not the best for these tasks, so I'll keep them as an ID instead of using column_to_rownames().

expand.grid(ID = c("N01 water","N02 solar", "P01 coal", "P02 natural gas", "P03 diesel", "R01 waste", "R02 residuals","R03 losses"),
            name = c("Agriculture", "Mining", "Services", "Households", "Industry")) |>
  left_join(bind_rows(data1 |> rownames_to_column("ID") |> pivot_longer(-ID), 
                      data2 |> rownames_to_column("ID") |> pivot_longer(-ID),
                      data3 |> rownames_to_column("ID") |> pivot_longer(-ID))) |>
  mutate(value = replace_na(value, 0)) |>
  pivot_wider(ID)

That being said, what is smart will depend on your real data.

Upvotes: 1

SamR
SamR

Reputation: 20502

I would do this one with tidyverse functions:

library(dplyr)
library(tibble)
library(tidyr)

formattable  |>
    rownames_to_column()  |> 
    rows_patch(rownames_to_column(data1), by = "rowname")  |>
    rows_patch(rownames_to_column(data2), by = "rowname")  |>
    rows_patch(rownames_to_column(data3), by = "rowname")  |>
    mutate(across(
        Agriculture:Households, \(col) replace_na(col, 0)
        )
    ) |> 
    column_to_rownames()

#                 Agriculture Mining Industry Services Households
# N01 water                 3      0        0        5          0
# N02 solar                 2      5        0        7          0
# P01 coal                  5      5        0        4          1
# P02 natural gas           7      6        0        8         11
# P03 diesel                0      0        0        0          0
# R01 waste                 3      0        5        0          7
# R02 residuals             6      0        0        0          8
# R03 losses                5      5        0        0          9

EDIT: Using Reduce

If you are using more than three data frames to update the original, you might want to put them in a list and use Reduce to avoid having to repeat yourself:

# Note use of `rownames_to_column()` on first data frame
l  <- list(rownames_to_column(formattable), data1, data2, data3)

Reduce(\(x, y) rows_patch(
    x, 
    rownames_to_column(y), 
    by = "rowname"), 
    l)  |>
    mutate(across(
        Agriculture:Households, \(col) replace_na(col, 0)
        )
    ) |> 
    column_to_rownames()

#                 Agriculture Mining Industry Services Households
# N01 water                 3      0        0        5          0
# N02 solar                 2      5        0        7          0
# P01 coal                  5      5        0        4          1
# P02 natural gas           7      6        0        8         11
# P03 diesel                0      0        0        0          0
# R01 waste                 3      0        5        0          7
# R02 residuals             6      0        0        0          8
# R03 losses                5      5        0        0          9

You will need at least dplyr v.1.0.0 for rows_patch(). From the docs:

rows_update() modifies existing rows (like UPDATE). Key values in y must be unique, and, by default, key values in y must exist in x.

rows_patch() works like rows_update() but only overwrites NA values.

Upvotes: 1

Related Questions