Reputation: 1
I have two dataframes. For some rows of df1, there's a matching row in df2. Now some columns of df1 should be manipulated so that they contain the sum of their own value and the equivalent value out of df2.
In the following example, columns 'count1' and 'count2' should be summed up, but not column 'type'.
df1 <- data.frame(id = c("one_a", "two_a", "three_a", "four_a"), type = c(8,7,6,5), count1 = c(1,2,1,NA), count2 = c(NA,0,1,0), id_df2 = c("one", "two", "three", "four"))
df2 <- data.frame(id = c("one", "two", "four"), type = c(8,7,5), count1 = c(0,1,1), count2 = c(0,0,1))
result <- data.frame(id = c("one_a", "two_a", "three_a", "four_a"), type = c(8,7,6,5), count1 = c(1,3,1,1), count2 = c(0,0,1,1))
> df1
id type count1 count2 id_df2
1 one_a 8 1 NA one
2 two_a 7 2 0 two
3 three_a 6 1 1 three
4 four_a 5 NA 0 four
> df2
id type count1 count2
1 one 8 0 0
2 two 7 1 0
3 four 5 1 1
> result
id type count1 count2
1 one_a 8 1 0
2 two_a 7 3 0
3 three_a 6 1 1
4 four_a 5 1 1
There have been similar questions and I tried to find a solution by splitting the dataframes apart and merging them afterwards. I just wondered if there's a more elegant way to do this. My original dataset has about 300 columns, so I'm looking for a solution which is scalable.
Thanks in advance chuckmorris
Upvotes: 0
Views: 73
Reputation: 1702
slightly less elegant, but works nonetheless:
result_2 <- df2 %>%
mutate(id = paste0(id, "_a")) %>%
bind_rows(df1) %>%
select(-id_df2) %>%
replace(., is.na(.), 0) %>%
group_by(id) %>%
summarise(count1 = sum(count1), count2 = sum(count2), type = max(type)) %>%
mutate(id_df2 = as.factor(id)) %>%
select(c(id_df2, type, count1, count2), -id)
Upvotes: 0
Reputation: 14764
You could do:
library(dplyr)
df1 %>% select(-id_df2) %>%
bind_rows(df2) %>%
mutate(id = gsub("_.*", "", id)) %>%
replace(., is.na(.), 0) %>%
group_by(id, type) %>%
summarise_at(vars(contains("count")), funs(sum))
Where the output is:
# A tibble: 4 x 4
# Groups: id [?]
id type count1 count2
<chr> <dbl> <dbl> <dbl>
1 four 5 1 1
2 one 8 1 0
3 three 6 1 1
4 two 7 3 0
But also:
df1 %>% select(-id_df2) %>%
bind_rows(df2) %>%
mutate(id = ifelse(grepl("_", id), id, paste0(id, "_a"))) %>%
replace(., is.na(.), 0) %>%
group_by(id, type) %>%
summarise_at(vars(contains("count")), funs(sum))
If you're interested in keeping the _a
part.
Another approach would be with a join, transforming to long, and then spreading back, like:
library(tidyverse)
df1 %>%
left_join(df2, by = c("id_df2" = "id")) %>%
gather(var, val, -id) %>%
mutate(var = gsub("\\..*", "", var)) %>%
distinct(id, var, val) %>%
filter(!var == "id_df2") %>%
group_by(id, var) %>%
summarise(val = sum(as.numeric(val), na.rm = T)) %>%
spread(var, val)
Giving:
# A tibble: 4 x 4
# Groups: id [4]
id count1 count2 type
<fct> <dbl> <dbl> <dbl>
1 four_a 1 1 5
2 one_a 1 0 8
3 three_a 1 1 6
4 two_a 3 0 7
This may be of interest if _a
ending has a special purpose, e.g. there are also groups with _b
, _c
etc. (the above method would fail in this case).
Upvotes: 1