Jane Miller
Jane Miller

Reputation: 153

Trying to label sequentially within groups of dataframe R

I have a subset of my dataframe:

df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
                store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
                quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804, 
                            2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204))

where 2017010104 corresponds to the date 01/01/2017 and so on. I am trying to label these different quad_ids sequentially with reference to the year. So for example I am trying to get the output:

df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
                store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
                quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804, 
                            2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204),
                Snum = c(1,  2,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 1, 2, 3, 4))

where you can see for retailer_id = 2, store_id = 167, the weeks for the year 2017 are labeled 1-14 and then when the week begins with 2018 it starts counting sequentially from 1 again until it will reach a week that starts with 2019 within this grouping.

I tried:

DT <- data.table(df)

DT[, Snum := seq_len(.N), by = list(retailer_id, store_id)]

However, this is not labeling sequentially by year, instead it is labelling sequentially by store_id. Is there a way to fix this? (this example code is only showing two retailers and two stores, whereas my actual dataframe and hundreds of different retailers and stores)

Upvotes: 1

Views: 138

Answers (2)

TarJae
TarJae

Reputation: 79164

We could use str_match from stringr package together with regex '^[[:digit:]]{4}' to match for the first four digits:

library(dplyr)
library(stringr)
df %>% 
    group_by(Snum = str_match(quad_id, '^[[:digit:]]{4}')) %>% 
    mutate(Snum = row_number())

output:

retailer_id store_id    quad_id  Snum
         <dbl>    <dbl>      <dbl> <int>
 1           1      166 2017010104     1
 2           1      166 2017012904     2
 3           2      167 2017010104     3
 4           2      167 2017012904     4
 5           2      167 2017022604     5
 6           2      167 2017032604     6
 7           2      167 2017042304     7
 8           2      167 2017052104     8
 9           2      167 2017061804     9
10           2      167 2017071604    10
11           2      167 2017081304    11
12           2      167 2017091004    12
13           2      167 2017100804    13
14           2      167 2017110504    14
15           2      167 2017120304    15
16           2      167 2017123104    16
17           2      167 2018012804     1
18           2      167 2018022504     2
19           2      167 2018032504     3
20           2      167 2018042204     4

Upvotes: 0

Jonathan
Jonathan

Reputation: 61

Here's a solution using tidyverse

df = data.frame(retailer_id = c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
                store_id = c(166, 166, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167, 167),
                quad_id = c(2017010104, 2017012904, 2017010104, 2017012904, 2017022604, 2017032604 ,2017042304, 2017052104, 2017061804, 
                            2017071604, 2017081304, 2017091004, 2017100804, 2017110504, 2017120304, 2017123104, 2018012804, 2018022504, 2018032504, 2018042204))

library(tidyverse)

getYear = function(x) {
  x %>% 
    str_extract("^\\d{4}") %>% 
    as.integer() %>% 
    return()
}

tmp = df %>% 
  mutate(year = getYear(quad_id)) %>% 
  group_by(year, retailer_id, store_id) %>% 
  mutate(Snum = 1:n())

> tmp
# A tibble: 20 x 5
# Groups:   year, retailer_id, store_id [3]
   retailer_id store_id    quad_id  year  Snum
         <dbl>    <dbl>      <dbl> <int> <int>
 1           1      166 2017010104  2017     1
 2           1      166 2017012904  2017     2
 3           2      167 2017010104  2017     1
 4           2      167 2017012904  2017     2
 5           2      167 2017022604  2017     3
 6           2      167 2017032604  2017     4
 7           2      167 2017042304  2017     5
 8           2      167 2017052104  2017     6
 9           2      167 2017061804  2017     7
10           2      167 2017071604  2017     8
11           2      167 2017081304  2017     9
12           2      167 2017091004  2017    10
13           2      167 2017100804  2017    11
14           2      167 2017110504  2017    12
15           2      167 2017120304  2017    13
16           2      167 2017123104  2017    14
17           2      167 2018012804  2018     1
18           2      167 2018022504  2018     2
19           2      167 2018032504  2018     3
20           2      167 2018042204  2018     4

Note that if your data isn't sorted by retailer_id, store_id and year that would cause an issue.

Upvotes: 1

Related Questions