chuckmorris
chuckmorris

Reputation: 1

Sum up two dataframes partly

I have two dataframes. For some rows of df1, there's a matching row in df2. Now some columns of df1 should be manipulated so that they contain the sum of their own value and the equivalent value out of df2.

In the following example, columns 'count1' and 'count2' should be summed up, but not column 'type'.

df1 <- data.frame(id = c("one_a", "two_a", "three_a", "four_a"), type = c(8,7,6,5), count1 = c(1,2,1,NA), count2 = c(NA,0,1,0), id_df2 = c("one", "two", "three", "four"))
df2 <- data.frame(id = c("one", "two", "four"), type = c(8,7,5), count1 = c(0,1,1), count2 = c(0,0,1))
result <- data.frame(id = c("one_a", "two_a", "three_a", "four_a"), type = c(8,7,6,5), count1 = c(1,3,1,1), count2 = c(0,0,1,1))

> df1
       id type count1 count2 id_df2
1   one_a    8      1     NA     one
2   two_a    7      2      0     two
3 three_a    6      1      1   three
4  four_a    5     NA      0    four

> df2
    id type count1 count2
1  one    8      0      0
2  two    7      1      0
3 four    5      1      1

> result
       id type count1 count2
1   one_a    8      1      0
2   two_a    7      3      0
3 three_a    6      1      1
4  four_a    5      1      1

There have been similar questions and I tried to find a solution by splitting the dataframes apart and merging them afterwards. I just wondered if there's a more elegant way to do this. My original dataset has about 300 columns, so I'm looking for a solution which is scalable.

Thanks in advance chuckmorris

Upvotes: 0

Views: 73

Answers (2)

DeduciveR
DeduciveR

Reputation: 1702

slightly less elegant, but works nonetheless:

result_2 <- df2 %>% 
  mutate(id = paste0(id, "_a")) %>%
  bind_rows(df1) %>% 
  select(-id_df2) %>% 
  replace(., is.na(.), 0) %>%
  group_by(id) %>% 
  summarise(count1 = sum(count1), count2 = sum(count2), type = max(type)) %>% 
  mutate(id_df2 = as.factor(id)) %>% 
  select(c(id_df2, type, count1, count2), -id)

Upvotes: 0

arg0naut91
arg0naut91

Reputation: 14764

You could do:

library(dplyr)

df1 %>% select(-id_df2) %>%
  bind_rows(df2) %>%
  mutate(id = gsub("_.*", "", id)) %>%
  replace(., is.na(.), 0) %>%
  group_by(id, type) %>%
  summarise_at(vars(contains("count")), funs(sum))

Where the output is:

# A tibble: 4 x 4
# Groups:   id [?]
  id     type count1 count2
  <chr> <dbl>  <dbl>  <dbl>
1 four      5      1      1
2 one       8      1      0
3 three     6      1      1
4 two       7      3      0

But also:

df1 %>% select(-id_df2) %>%
  bind_rows(df2) %>%
  mutate(id = ifelse(grepl("_", id), id, paste0(id, "_a"))) %>%
  replace(., is.na(.), 0) %>%
  group_by(id, type) %>%
  summarise_at(vars(contains("count")), funs(sum))

If you're interested in keeping the _a part.

Another approach would be with a join, transforming to long, and then spreading back, like:

library(tidyverse)

df1 %>% 
  left_join(df2, by = c("id_df2" = "id")) %>%
  gather(var, val, -id) %>%
  mutate(var = gsub("\\..*", "", var)) %>%
  distinct(id, var, val) %>%
  filter(!var == "id_df2") %>%
  group_by(id, var) %>%
  summarise(val = sum(as.numeric(val), na.rm = T)) %>%
  spread(var, val) 

Giving:

# A tibble: 4 x 4
# Groups:   id [4]
  id      count1 count2  type
  <fct>    <dbl>  <dbl> <dbl>
1 four_a       1      1     5
2 one_a        1      0     8
3 three_a      1      1     6
4 two_a        3      0     7

This may be of interest if _a ending has a special purpose, e.g. there are also groups with _b, _c etc. (the above method would fail in this case).

Upvotes: 1

Related Questions