user3571389
user3571389

Reputation: 335

Generating unique group id based on combination of multiple columns in R

A sample of my dataset looks like this:

fact_code line_code date        style   buyer smv
1004      100401    2013-02-02  TS-334  2     7.15 
1004      100401    2013-02-03  TS-334  2     7.15
1008      100801    2013-04-12  SST     8     8.58
1008      100801    2013-04-13  SST     8     8.58
...       ...       ...         ...     ...   ...
1008      100801    2013-07-31  SST     2     9.20
1008      100801    2013-08-01  SST     2     9.20
...       ...       ...         ...     ...   ... 
1008      100802    2013-08-01  SST     2     9.20
1008      100802    2013-08-02  SST     2     9.20
...       ...       ...         ...     ...   ...
1008      100802    2013-08-12  SST     2     9.20

Variables fact_code, line_code are factory and line code respectively. style is the garment style produced by a given line_code on a certain date for a given buyer.

I aim to create a variable called spell number for each garment (for eg. T-shirts) style, spell_num, that is produced by a factory line. The variable and the data would look something like this:

fact_code line_code date        style   buyer smv   spell_num
1004      100401    2013-02-02  TS-334  2     7.15  TS-334_1
1004      100401    2013-02-03  TS-334  2     7.15  TS-334_1
1008      100801    2013-04-12  SST     8     8.58  SST_1
1008      100801    2013-04-13  SST     8     8.58  SST_1
...       ...       ...         ...     ...   ...   ...
1008      100801    2013-07-31  SST     2     9.20  SST_1  
1008      100801    2013-08-01  SST     2     9.20  SST_1
...       ...       ...         ...     ...   ...   ... 
1008      100802    2013-08-01  SST     2     9.20  SST_2
1008      100802    2013-08-02  SST     2     9.20  SST_2
...       ...       ...         ...     ...   ...   ... 
1008      100802    2013-08-12  SST     2     9.20  SST_2

The purpose of the variable, spell_num is to indicate the current spell number of each garment style that is produced by factory line. So, for example, TS-334_1 indicates first spell of production for the garment style TS-334. Similarly, SST_1 indicates the first spell for the style SST, whereas SST_2 indicates its second spell of production. Therefore, whenever a different line produces a style, it indicates a new spell for that same style. The underscore partitions the style name on its left and spell number/indicator on its right.

To this end, this is the code that I have tried among others:

mydf <- mydf %>%
  dplyr::group_by(line_code, style) %>%
  dplyr::mutate(spell_num = paste(style, as.integer(row_number() == 1L), sep = "_"))

where mydf is the name of the pertinent data frame.

And below is the resultant output:

fact_code line_code date        style   buyer smv   spell_num
1004      100401    2013-02-02  TS-334  2     7.15  TS-334_1
1004      100401    2013-02-03  TS-334  2     7.15  TS-334_0
1008      100801    2013-04-12  SST     8     8.58  SST_1
1008      100801    2013-04-13  SST     8     8.58  SST_0
...       ...       ...         ...     ...   ...   ...
1008      100801    2013-07-31  SST     2     9.20  SST_0  
1008      100801    2013-08-01  SST     2     9.20  SST_0
...       ...       ...         ...     ...   ...   ... 
1008      100802    2013-08-01  SST     2     9.20  SST_1
1008      100802    2013-08-02  SST     2     9.20  SST_0
...       ...       ...         ...     ...   ...   ... 
1008      100802    2013-08-12  SST     2     9.20  SST_0

As can be seen, the above R code, however, is not yielding the expected values for the variable, spell_num. For example, instead of the values SST_1, SST_0,...,SST_0 for style SST produced by line 100801 from 2013-04-13 till 2013-08-01, I want it to be rather SST_1, SST_1,...,SST_1, which shows that the style is currently is in its first spell of production.

The spell_num changes to SST_2 i.e. the second spell of production for style SST, when it is produced by a different factory line, for example, 100802.

Therefore, my question is how can I get the expected values for the variable spell_num given the above dataset and description?

Any help would be appreciated.

Upvotes: 0

Views: 100

Answers (2)

David Klotz
David Klotz

Reputation: 2431

library(tidyverse)

mydf %>% nest(-style, -line_code) %>% 
  group_by(style) %>% 
  mutate(index = row_number()) %>% 
  unnest %>% 
  mutate(spell_num = paste(style, index, sep = "_"))

Upvotes: 1

Gregor Thomas
Gregor Thomas

Reputation: 145755

With data.table.

setDT(mydf) # convert to data.table
setkey(mydf, style, linecode) # set key to ensure sorting
mydf[, spell_num := paste(style, rleid(line_code), sep = "_"), by = .(style)]

Upvotes: 3

Related Questions