Reputation: 125
For a research project, I need to change the coding of sample ID's (variable: Samp_ID
) that stem from different articles (variable: Art_ID
). My data is in long format. In the original data set, the sample ID's were coded as an ascending number across all coded articles. If an articles use the same sample multiple times, the sample ID is the same in multiple rows. However, if different samples were used, the sample ID differs within the same "group" of Art_ID
. The data set looks like this:
df_original <- read.table(text=
"Art_ID Samp_ID
1 1
2 2
2 2
2 2
3 3
4 4
4 5
5 6
6 7
7 8
7 8
7 8
7 9
7 9
7 9
8 10", header=TRUE)
However, I would like to have the sample ID coded with an ascending number within each article. Thus, if only one sample has been used, each row for this article should be coded as 1. If two different samples have been used within one article, the rows using the first sample should be coded as 1 and the rows using the second sample should be coded as 2 (as in Art_ID == 4 in df_new
). Finally, I aim to create a variable that is the combination of Art_ID
and Samp_ID
.
I would like the new data set to look like this:
df_new <- read.table(text=
"Art_ID Samp_ID Art_Samp_ID
1 1 1_1
2 1 2_1
2 1 2_1
2 1 2_1
3 1 3_1
4 1 4_1
4 2 4_2
5 1 5_1
6 1 6_1
7 1 7_1
7 1 7_1
7 1 7_1
7 2 7_2
7 2 7_2
7 2 7_2
8 1 8_1", header=TRUE)
To create the variable Art_Samp_ID
, I would use this code:
df_new$Art_Samp_ID <- as.factor(paste(df_new$Art_ID, df_new$Samp_ID, sep = "_"))
Does anyone know, how to do the recoding of Samp_ID
most efficiently (e.g., by using tidyverse)? I am happy for any advice!
Upvotes: 1
Views: 599
Reputation: 16998
A slightly different method using dplyr
:
library(dplyr)
df_original %>%
group_by(Art_ID) %>%
mutate(Samp_ID = 1 + cumsum(Samp_ID != lag(Samp_ID, default = first(Samp_ID))),
Art_Samp_ID = paste(Art_ID, Samp_ID, sep = "_")) %>%
ungroup()
returns
# A tibble: 16 x 3
Art_ID Samp_ID Art_Samp_ID
<int> <dbl> <chr>
1 1 1 1_1
2 2 1 2_1
3 2 1 2_1
4 2 1 2_1
5 3 1 3_1
6 4 1 4_1
7 4 2 4_2
8 5 1 5_1
9 6 1 6_1
10 7 1 7_1
11 7 1 7_1
12 7 1 7_1
13 7 2 7_2
14 7 2 7_2
15 7 2 7_2
16 8 1 8_1
Upvotes: 2
Reputation: 389325
You can use dense_rank
for Samp_ID
and use unite
to create Art_Samp_ID
.
library(dplyr)
library(tidyr)
df_original %>%
group_by(Art_ID) %>%
mutate(Samp_ID = dense_rank(Samp_ID)) %>%
#Few other options to get Samp_ID would be
#Samp_ID = match(Samp_ID, unique(Samp_ID)),
#Samp_ID = as.integer(factor(Samp_ID)))
ungroup() %>%
unite(Art_Samp_ID, Art_ID, Samp_ID, remove = FALSE)
# Art_Samp_ID Art_ID Samp_ID
# <chr> <int> <int>
# 1 1_1 1 1
# 2 2_1 2 1
# 3 2_1 2 1
# 4 2_1 2 1
# 5 3_1 3 1
# 6 4_1 4 1
# 7 4_2 4 2
# 8 5_1 5 1
# 9 6_1 6 1
#10 7_1 7 1
#11 7_1 7 1
#12 7_1 7 1
#13 7_2 7 2
#14 7_2 7 2
#15 7_2 7 2
#16 8_1 8 1
Upvotes: 2