Reputation: 153
I have a subset of my dataframe:
df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804,
2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204))
where 2017010104 corresponds to the date 01/01/2017 and so on. I am trying to label these different quad_ids sequentially with reference to the year. So for example I am trying to get the output:
df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804,
2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204),
Snum = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1, 2, 3, 4))
where you can see for retailer_id = 2, store_id = 167, the weeks for the year 2017 are labeled 1-14 and then when the week begins with 2018 it starts counting sequentially from 1 again until it will reach a week that starts with 2019 within this grouping.
I tried:
DT <- data.table(df)
DT[, Snum := seq_len(.N), by = list(retailer_id, store_id)]
However, this is not labeling sequentially by year, instead it is labelling sequentially by store_id. Is there a way to fix this? (this example code is only showing two retailers and two stores, whereas my actual dataframe and hundreds of different retailers and stores)
Upvotes: 1
Views: 138
Reputation: 79164
We could use str_match
from stringr
package together with regex '^[[:digit:]]{4}'
to match for the first four digits:
library(dplyr)
library(stringr)
df %>%
group_by(Snum = str_match(quad_id, '^[[:digit:]]{4}')) %>%
mutate(Snum = row_number())
output:
retailer_id store_id quad_id Snum
<dbl> <dbl> <dbl> <int>
1 1 166 2017010104 1
2 1 166 2017012904 2
3 2 167 2017010104 3
4 2 167 2017012904 4
5 2 167 2017022604 5
6 2 167 2017032604 6
7 2 167 2017042304 7
8 2 167 2017052104 8
9 2 167 2017061804 9
10 2 167 2017071604 10
11 2 167 2017081304 11
12 2 167 2017091004 12
13 2 167 2017100804 13
14 2 167 2017110504 14
15 2 167 2017120304 15
16 2 167 2017123104 16
17 2 167 2018012804 1
18 2 167 2018022504 2
19 2 167 2018032504 3
20 2 167 2018042204 4
Upvotes: 0
Reputation: 61
Here's a solution using tidyverse
df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804,
2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204))
library(tidyverse)
getYear = function(x) {
x %>%
str_extract("^\\d{4}") %>%
as.integer() %>%
return()
}
tmp = df %>%
mutate(year = getYear(quad_id)) %>%
group_by(year, retailer_id, store_id) %>%
mutate(Snum = 1:n())
> tmp
# A tibble: 20 x 5
# Groups: year, retailer_id, store_id [3]
retailer_id store_id quad_id year Snum
<dbl> <dbl> <dbl> <int> <int>
1 1 166 2017010104 2017 1
2 1 166 2017012904 2017 2
3 2 167 2017010104 2017 1
4 2 167 2017012904 2017 2
5 2 167 2017022604 2017 3
6 2 167 2017032604 2017 4
7 2 167 2017042304 2017 5
8 2 167 2017052104 2017 6
9 2 167 2017061804 2017 7
10 2 167 2017071604 2017 8
11 2 167 2017081304 2017 9
12 2 167 2017091004 2017 10
13 2 167 2017100804 2017 11
14 2 167 2017110504 2017 12
15 2 167 2017120304 2017 13
16 2 167 2017123104 2017 14
17 2 167 2018012804 2018 1
18 2 167 2018022504 2018 2
19 2 167 2018032504 2018 3
20 2 167 2018042204 2018 4
Note that if your data isn't sorted by retailer_id
, store_id
and year
that would cause an issue.
Upvotes: 1