Recode variables within a group in R

Question

For a research project, I need to change the coding of sample ID's (variable: Samp_ID) that stem from different articles (variable: Art_ID). My data is in long format. In the original data set, the sample ID's were coded as an ascending number across all coded articles. If an articles use the same sample multiple times, the sample ID is the same in multiple rows. However, if different samples were used, the sample ID differs within the same "group" of Art_ID. The data set looks like this:

df_original <- read.table(text=
"Art_ID   Samp_ID         
1         1                
2         2           
2         2          
2         2         
3         3 
4         4 
4         5          
5         6      
6         7
7         8  
7         8
7         8  
7         9   
7         9   
7         9
8         10", header=TRUE)

However, I would like to have the sample ID coded with an ascending number within each article. Thus, if only one sample has been used, each row for this article should be coded as 1. If two different samples have been used within one article, the rows using the first sample should be coded as 1 and the rows using the second sample should be coded as 2 (as in Art_ID == 4 in df_new). Finally, I aim to create a variable that is the combination of Art_ID and Samp_ID.

I would like the new data set to look like this:

df_new <- read.table(text=
"Art_ID   Samp_ID   Art_Samp_ID      
1         1         1_1       
2         1         2_1  
2         1         2_1 
2         1         2_1
3         1         3_1
4         1         4_1
4         2         4_2        
5         1         5_1     
6         1         6_1
7         1         7_1 
7         1         7_1
7         1         7_1
7         2         7_2  
7         2         7_2
7         2         7_2
8         1         8_1", header=TRUE)

To create the variable Art_Samp_ID, I would use this code:

df_new$Art_Samp_ID <- as.factor(paste(df_new$Art_ID, df_new$Samp_ID, sep = "_"))

Does anyone know, how to do the recoding of Samp_ID most efficiently (e.g., by using tidyverse)? I am happy for any advice!

Ronak Shah · Accepted Answer

You can use dense_rank for Samp_ID and use unite to create Art_Samp_ID.

library(dplyr)
library(tidyr)

df_original %>%
  group_by(Art_ID) %>%
  mutate(Samp_ID = dense_rank(Samp_ID)) %>%
         #Few other options to get Samp_ID would be 
         #Samp_ID = match(Samp_ID, unique(Samp_ID)), 
         #Samp_ID = as.integer(factor(Samp_ID)))
  ungroup() %>%
  unite(Art_Samp_ID, Art_ID, Samp_ID, remove = FALSE)

#  Art_Samp_ID Art_ID Samp_ID
#              
# 1 1_1              1       1
# 2 2_1              2       1
# 3 2_1              2       1
# 4 2_1              2       1
# 5 3_1              3       1
# 6 4_1              4       1
# 7 4_2              4       2
# 8 5_1              5       1
# 9 6_1              6       1
#10 7_1              7       1
#11 7_1              7       1
#12 7_1              7       1
#13 7_2              7       2
#14 7_2              7       2
#15 7_2              7       2
#16 8_1              8       1

Recode variables within a group in R

Answers (2)

Related Questions