Reputation: 49
I am not so qualified in R and I am struggling with a problem. I want to replace all the existing underscores which are before "S11" pattern, with the dashes "(-)". S11 is just a number and it is variable in my table such as S29, S30. Here is the code that I am using and failing:
foo <- c("H2_2months_S11_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_", "H2_2months_with_acetate_S101_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_", "Formate_3months_S99_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_")
Sample <- gsub(pattern="*(_S)", replacement="-", x=foo)
Getting:
[1] "H2_2months-11_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[2] "H2_2months_with_acetate-101_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[3] "Formate_3months-99_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
I also don't want "_S"
to be deleted and replaced. I use "_S[0-9]"
as a matching criteria and before "_S"
, the underscores should be changed to "-"
.
Also please recommend me a good website that I can learn those "codes or signs" using in this function. Thanks in advance.
Expected output:
[1] "H2-2months-S11_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[2] "H2-2months-with-acetate-S101_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[3] "Formate-3months-S99_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
Upvotes: 1
Views: 44
Reputation: 230
This will match the "_S11" and save S11 to the group. Then replace this with a "-" followed by the captured group "S11".
Sample <- gsub("_(S[0-9+])", "-\\1", foo)
Excellent place to learn more regex: https://www.regular-expressions.info/quickstart.html
Excellent place to test regex with explanations of the matching: https://regexr.com/
Edit: Thanks RLave, didn't realise it could be any digits after the S. Updated answer.
Upvotes: 1
Reputation: 8364
This should work.
Basically we divide the job in two parts, first match ("_(S[0-9+])"
), then we split the resulting string at "-"
, then we use gsub
to fix all the "_"
we find.
foo <- c("H2_2months_S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_")
foo <- gsub(pattern="_(S[0-9+])", replacement="-\\1", x=foo)
#foo
#[1] "H2_2months-S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"
Then we split:
split <- unlist(strsplit(foo, "-")) # split using the new "-"
#split
#[1] "H2_2months"
#[2] "S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"
Now we can use simple gsub
on everything except the last element in split
.
split_1 <- split[-length(split)] # fix all the "_" before the match (exclude the last)
split_1 <- gsub("_", "-", split_1)
Then we paste
the results:
paste0(split_1, "-", split[length(split)]) # paste back together
#[1] "H2-2months-S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"
Here in a function and with another example:
foo <- c("H2_2months_abc_456_S123_L001_R1_001")
my_foo <- function(s) {
s <- gsub(pattern="_(S[0-9+])", replacement="-\\1", x=s)
split <- unlist(strsplit(s, "-"))
split_1 <- split[-length(split)]
split_1 <- gsub("_", "-", split_1)
paste0(split_1, "-", split[length(split)])
}
my_foo(foo)
#[1] "H2-2months-abc-456-S123_L001_R1_001"
Upvotes: 1