Reputation: 353
I have a dataframe that has launch weeks for products across markets. Here is a snapshot of the dataframe.
Prod_ID Market_Name START_WEEK
11044913000 PHOENIX, AZ 1397
11044913000 WEST TEX/NEW MEX 1206
11159402003 PORTLAND,OR 1188
11159402003 SEATTLE/TACOMA 1188
11159402003 SPOKANE 1195
11159410010 PORTLAND,OR 1186
11159410010 SALT LAKE CITY 1190
11159410010 SEATTLE/TACOMA 1186
11159410010 SPOKANE 1187
11159410010 WEST TEX/NEW MEX 1197
11159410014 PORTLAND,OR 1198
11159410014 SEATTLE/TACOMA 1239
I would like to create another dataframe which will give me for each Prod_ID
, cumulative totals of number of markets a product has been launched in on a weekly basis for first 6 weeks. For the above snippet of data, the output should like something like this.
Prod_ID Week1 Week2 Week3 Week4 Week5 Week6
11044913000 1 1 1 1 1 1
11159402003 2 2 2 2 2 2
11159410010 2 3 3 3 4 4
11159410014 1 1 1 1 1 1
For ease of displaying, I have shown the output only till Week 6, but I need to track till Week 12 for my need. Week is denoted by a 4 digit number in my dataset and is not in date format. Please note that not all products have the same starting week, so I need to infer the earliest week for a Prod_ID
from the START_WEEK
variable. And then identify the next 6 weeks to generate the total number of markets launched in each week.
Any help to do this is appreciated.
Upvotes: 0
Views: 413
Reputation: 3629
I think I understand your problem. Here is my shot. There are several phases to this solution.
The first step is to calculate the cumulative sum of markets for the weeks and the week number for each Prod_ID
since they opened. This is done with the following code chunk.
df1 <- df %>%
group_by(Prod_ID, START_WEEK) %>%
count() %>%
arrange(Prod_ID, START_WEEK) %>%
ungroup() %>%
group_by(Prod_ID) %>%
mutate(tot_market = cumsum(n)) %>%
ungroup() %>%
group_by(Prod_ID) %>%
mutate(min_START_WEEK = min(START_WEEK)) %>%
mutate(week = START_WEEK - min_START_WEEK + 1)
df1
# # A tibble: 10 x 6
# # Groups: Prod_ID [4]
# Prod_ID START_WEEK n tot_market min_START_WEEK week
# <dbl> <int> <int> <int> <dbl> <dbl>
# 1 11044913000. 1206 1 1 1206. 1.
# 2 11044913000. 1397 1 2 1206. 192.
# 3 11159402003. 1188 2 2 1188. 1.
# 4 11159402003. 1195 1 3 1188. 8.
# 5 11159410010. 1186 2 2 1186. 1.
# 6 11159410010. 1187 1 3 1186. 2.
# 7 11159410010. 1190 1 4 1186. 5.
# 8 11159410010. 1197 1 5 1186. 12.
# 9 11159410014. 1198 1 1 1198. 1.
# 10 11159410014. 1239 1 2 1198. 42.
The second phase is to expand the week
and Prod_ID
to the maximum number of weeks in week
.
df2 <- expand.grid(min(df1$week):max(df1$week), unique(df1$Prod_ID))
colnames(df2) <- c("week", "Prod_ID")
The third phase is done by merging df1
and df2
and using zoo::locf
to fill the NA
's in tot_market
(total market) by Prod_ID
with the preceding value.
df2 %>% left_join(df1) %>% select(-START_WEEK, -n, -min_START_WEEK) %>%
group_by(Prod_ID) %>%
arrange(Prod_ID, week) %>%
mutate(tot_market = zoo::na.locf(tot_market)) %>%
spread(week, tot_market) %>%
ungroup() %>%
mutate_at(vars(Prod_ID), as.character) %>%
rename_if(is.integer, function(x) paste0("Week", x))
# # A tibble: 4 x 193
# Prod_ID Week1 Week2 Week3 Week4 Week5 Week6 Week7 Week8 Week9 Week10 Week11
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 11044913000 1 1 1 1 1 1 1 1 1 1 1
# 2 11159402003 2 2 2 2 2 2 2 3 3 3 3
# 3 11159410010 2 3 3 3 4 4 4 4 4 4 4
# 4 11159410014 1 1 1 1 1 1 1 1 1 1 1
# # ... with 181 more variables
Upvotes: 1