HSJ
HSJ

Reputation: 841

How to renumber group id sequentially in R?

I am trying to rename grouped unique id sequentially using dplyr in R. There are five columns in the data frame as below.

## Load package if necessary
library(tidyverse)

## Set data frame
df <- data.frame(
    hid=c(10001,10001,10001,10001,10002,10002,10002,10002,10002,
          10003,10003,10003,10003,10003,10003,10004,10004,10004,10004,10004),
    mid=c(1,2,3,4,1,2,3,4,5,1,2,3,4,5,6,1,2,3,4,5),
    tmc=c(010,01010,0,01020,010,010,010,010,010,010,010,010,0,010,010,010,0,01010,010,01010),
    thc=c(010,01010,0,02030,010,020,020,020,030,010,010,010,0,020,030,010,0,02020,030,04040),
    mdc=c(000,01010,0,02020,000,010,010,010,010,000,000,010,0,010,020,000,0,02020,010,01010),
    itc=c(010,01010,0,02020,020,020,020,020,020,010,010,010,0,020,020,010,0,02020,020,02020)
    )

Unique ids are given to each row being grouped by some columns: tmc, thc, mdc and itc.

## Add unique id grouped by tmc, thc, mdc and itc
df.id <- df %>% mutate(id=as.numeric(interaction(tmc,thc,mdc,itc)))

As it does not give sequential ids, I need to rename it. However, I could not find solution for that. The conditions are:

Followings scripts show what I am doing currently. id is temporary id obtained from interaction function but I need to obtain sequential id indicated in id.desired column.

## Replace unique id sequentially
## IT DOES NOT GIVE DESIRED OUTPUT
# df.id %>% group_by(id) %>% mutate(id2=seq_along(id))

## Desired id is shown in `id.desired`
## `id` is the ones obtained from `interaction` function, which are not set sequentially
     hid   mid   tmc   thc   mdc   itc    id   id.desired
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
 1 10001     1    10    10     0    10   166     1
 2 10001     2  1010  1010  1010  1010   595     2
 3 10001     3     0     0     0     0     1     0
 4 10001     4  1020  2030  2020  2020   796     3
 5 10002     1    10    10     0    20   326     4
 6 10002     2    10    20    10    20   362     5
 7 10002     3    10    20    10    20   362     5
 8 10002     4    10    20    10    20   362     5
 9 10002     5    10    30    10    20   366     6
10 10003     1    10    10     0    10   166     1
11 10003     2    10    10     0    10   166     1
12 10003     3    10    10    10    10   198     7
13 10003     4     0     0     0     0     1     0
14 10003     5    10    20    10    20   362     5
15 10003     6    10    30    20    20   398     8
16 10004     1    10    10     0    10   166     1
17 10004     2     0     0     0     0     1     0
18  1004     3  1010  2020  2020  2020   791     9
19 10004     4    10    30    10    20   366     6
20 10004     5  1010  4040  1010  2020   767    10

Any suggestions? I prefer to use dplyr in this operation.

I received some suggestions in the previous question, however it is not the same structure in this case (dummy field does not exist in current data frame). How to renumber result of intersection/group_indices in R?

Upvotes: 2

Views: 2999

Answers (3)

www
www

Reputation: 39154

A solution using the tidyverse. Notice that I did not use the interaction function. Instead, I used the group_indices function from dplyr to create the group index and then convert to factor and change the levels based on the occurrence order in the column. df2 is the final output.

library(tidyverse)

df2 <- df %>%
  filter_at(vars(tmc, thc, mdc, itc), any_vars(. != 0)) %>%
  mutate(id = group_indices(., tmc, thc, mdc, itc)) %>%
  mutate(id = as.numeric(factor(id, levels = unique(id)))) %>%
  left_join(df, ., by = names(df)) %>%
  replace_na(list(id = 0))
df2
#      hid mid  tmc  thc  mdc  itc id
# 1  10001   1   10   10    0   10  1
# 2  10001   2 1010 1010 1010 1010  2
# 3  10001   3    0    0    0    0  0
# 4  10001   4 1020 2030 2020 2020  3
# 5  10002   1   10   10    0   20  4
# 6  10002   2   10   20   10   20  5
# 7  10002   3   10   20   10   20  5
# 8  10002   4   10   20   10   20  5
# 9  10002   5   10   30   10   20  6
# 10 10003   1   10   10    0   10  1
# 11 10003   2   10   10    0   10  1
# 12 10003   3   10   10   10   10  7
# 13 10003   4    0    0    0    0  0
# 14 10003   5   10   20   10   20  5
# 15 10003   6   10   30   20   20  8
# 16 10004   1   10   10    0   10  1
# 17 10004   2    0    0    0    0  0
# 18 10004   3 1010 2020 2020 2020  9
# 19 10004   4   10   30   10   20  6
# 20 10004   5 1010 4040 1010 2020 10

Upvotes: 3

Rachit Kinger
Rachit Kinger

Reputation: 361

(edited mutate based on your clarification in comments)
Here are the two things I tried to do:

  1. To ensure that id = 0 when certain variables are 0, I used if_else in the mutate function with the specific conditions you specified.
  2. To get id.desired I used dense_rank() function.

Here is the code based on the dataset you shared:

df %>% 
   mutate(id = if_else(tmc == 0 & thc == 0  & mdc == 0 & itc == 0, 0,
                       as.numeric(interaction(tmc, thc, mdc, itc, lex.order = TRUE)))) %>% 
   mutate(id.desired = dense_rank(id) - 1)

The output looks like this

    hid   mid  tmc  thc  mdc  itc id   id.desired
1  10001   1   10   10    0   10 227          1
2  10001   2 1010 1010 1010 1010 519          7
3  10001   3    0    0    0    0   0          0
4  10001   4 1020 2030 2020 2020 775         10
5  10002   1   10   10    0   20 228          2
6  10002   2   10   20   10   20 258          4
7  10002   3   10   20   10   20 258          4
8  10002   4   10   20   10   20 258          4
9  10002   5   10   30   10   20 283          5
10 10003   1   10   10    0   10 227          1
11 10003   2   10   10    0   10 227          1
12 10003   3   10   10   10   10 232          3
13 10003   4    0    0    0    0   0          0
14 10003   5   10   20   10   20 258          4
15 10003   6   10   30   20   20 288          6
16 10004   1   10   10    0   10 227          1
17 10004   2    0    0    0    0   0          0
18 10004   3 1010 2020 2020 2020 550          8
19 10004   4   10   30   10   20 283          5
20 10004   5 1010 4040 1010 2020 595          9

Upvotes: 0

MihaiV
MihaiV

Reputation: 157

Not sure how to interpret the id.desired column but here is an example based on the two conditions and using data.table:

 require(data.table)   
 df = data.table(df)   
 df[tmc != 0 & thc != 0 & mdc != 0 & itc != 0, ID := 1:.N, by = .(tmc, thc, mdc, itc)]
 df[is.na(ID), ID := 0]

Upvotes: 0

Related Questions