Reputation: 873
I have a data frame that I'm trying to merge with another, and I'm having some issues that I believe trace back to the fact that each observation is an incident rather than a cumulative amount. I have the data frame below where each row is an individual observation, and I'll then use the week
and the code
to merge it with another data frame based on the same variables week
and code
.
data frame a
has each row as a specific observation, but I need it to become a cumulative observation/row by code/date. I'm completely stumped.
date count code week
<date> <dbl> <dbl> <dbl>
1 2020-06-07 4 13309 23
2 2020-06-07 5 13309 23
3 2020-07-12 6 18099 28
4 2020-07-12 8 18099 28
needs to become
date count code week
<date> <dbl> <dbl> <dbl>
1 2020-06-07 9 13309 23
2 2020-07-12 14 18099 28
Then, it will be able to be merged with data frame b
date color name code week
<date> <char> <char> <dbl> <dbl>
1 2020-06-07 Blue A 13309 23
1 2020-06-07 Yellow B 13309 23
1 2020-06-07 Purple D 13309 23
3 2020-07-12 Yellow A 18099 28
3 2020-07-12 Blue E 18099 28
and the end result will be
date color name code week count
<date> <char> <char> <dbl> <dbl> <dbl>
1 2020-06-07 Blue A 13309 23 9
1 2020-06-07 Yellow B 13309 23 9
1 2020-06-07 Purple D 13309 23 9
3 2020-07-12 Yellow A 18099 28 14
3 2020-07-12 Blue E 18099 28 14
I originally used the code below to do this, but it completely blew up my data frame. My dimensions went from dim(a) == (209807, 86)
to dim(merged) == (1367029, 89)
. I tried multiple types of joins (right, left, inner, etc.) but all of them still blew up the data frame (varied by a few 100 or so observations, but still resulted in well over a million rows). That's why I'm thinking the issue is due to a
being each observation vs a summary observation for a specific code on a specific date.
merged <- right_join(x = b,
y = a,
by = c("code" = "code",
"week" = "week"))
Upvotes: 1
Views: 80
Reputation: 887571
Using join by data.table
library(data.table)
setDT(a)[, .(count = sum(count)), .(date, code, week)][b,
on = .(date, code, week)]
# date code week count color name
#1: 2020-06-07 13309 23 9 Blue A
#2: 2020-06-07 13309 23 9 Yellow B
#3: 2020-06-07 13309 23 9 Purple D
#4: 2020-07-12 18099 28 14 Yellow A
#5: 2020-07-12 18099 28 14 Blue E
Upvotes: 0
Reputation: 350
You were close with the call to right_join()
but before that, the first thing you should do is to total the first data frame so you have a sum of the counts. You could do it manually with summarize()
but an easier way is using a nested data frame and creating a new column that sums the counts.
Let me know if you need clarification?
Next time please post your data in a format that is easier to copy and paste into R.
library(tidyverse)
dfA <- tribble(~date, ~count, ~code, ~week,
"2020-06-07", 4, 13309, 23,
"2020-06-07", 5, 13309, 23,
"2020-07-12", 6, 18099, 28,
"2020-07-12", 8, 18099, 28)
dfB <- tribble(~date, ~color, ~name, ~code, ~week,
"2020-06-07", "Blue", "A", 13309, 23,
"2020-06-07", "Yellow", "B", 13309, 23,
"2020-06-07", "Purple", "D", 13309, 23,
"2020-07-12", "Yellow", "A", 18099, 28,
"2020-07-12", "Blue", "E", 18099, 28)
total_counts <- dfA %>%
group_by(date, code, week) %>%
nest() %>%
mutate(count= map_dbl(data, sum)) %>%
select(date, count, code, week) %>%
right_join(dfB) %>%
select(date, color, name, code, week, count)
#> Joining, by = c("date", "code", "week")
total_counts
#> # A tibble: 5 x 6
#> # Groups: date, code, week [2]
#> date color name code week count
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2020-06-07 Blue A 13309 23 9
#> 2 2020-06-07 Yellow B 13309 23 9
#> 3 2020-06-07 Purple D 13309 23 9
#> 4 2020-07-12 Yellow A 18099 28 14
#> 5 2020-07-12 Blue E 18099 28 14
Created on 2020-10-12 by the reprex package (v0.3.0)
Upvotes: 0
Reputation: 160647
a %>%
group_by(date, code, week) %>%
summarize(count = sum(count)) %>%
ungroup() %>%
left_join(b, ., by = c("date", "code", "week"))
# date color name code week count
# 1 2020-06-07 Blue A 13309 23 9
# 2 2020-06-07 Yellow B 13309 23 9
# 3 2020-06-07 Purple D 13309 23 9
# 4 2020-07-12 Yellow A 18099 28 14
# 5 2020-07-12 Blue E 18099 28 14
Upvotes: 1