Reputation: 81
I have a dataframe in R as below
bacteria sample
1 A HM_001
2 B HM_001_HM_001
3 C A2_HM_001
4 D A2_HM_001_HM_001
5 E HM_002
6 F HM_002_HM_002
7 G A2_HM_002
8 H A2_HM_002_HM_002
and wish to remove duplicated substrings down the sample
column so that the final output is as below:
bacteria sample
1 A HM_001
2 B HM_001
3 C A2_HM_001
4 D A2_HM_001
5 E HM_002
6 F HM_002
7 G A2_HM_002
8 H A2_HM_002
Upvotes: 1
Views: 202
Reputation: 887028
Using regex with gsub
df1$sample_new <- with(df1, gsub("([A-Z]+_\\d+)_?\\1+", "\\1", sample))
-output
df1
# bacteria sample sample_new
#1 A HM_001 HM_001
#2 B HM_001_HM_001 HM_001
#3 C A2_HM_001 A2_HM_001
#4 D A2_HM_001_HM_001 A2_HM_001
#5 E HM_002 HM_002
#6 F HM_002_HM_002 HM_002
#7 G A2_HM_002 A2_HM_002
#8 H A2_HM_002_HM_002 A2_HM_002
df1 <- structure(list(bacteria = c("A", "B", "C", "D", "E", "F", "G",
"H"), sample = c("HM_001", "HM_001_HM_001", "A2_HM_001", "A2_HM_001_HM_001",
"HM_002", "HM_002_HM_002", "A2_HM_002", "A2_HM_002_HM_002")),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
Upvotes: 1