Ömer Coskun
Ömer Coskun

Reputation: 49

Replacing given characters to new ones before a defined parameter in gsub function

I am not so qualified in R and I am struggling with a problem. I want to replace all the existing underscores which are before "S11" pattern, with the dashes "(-)". S11 is just a number and it is variable in my table such as S29, S30. Here is the code that I am using and failing:

foo <- c("H2_2months_S11_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_", "H2_2months_with_acetate_S101_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_", "Formate_3months_S99_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_")
Sample <- gsub(pattern="*(_S)", replacement="-", x=foo)

Getting: [1] "H2_2months-11_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[2] "H2_2months_with_acetate-101_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_" [3] "Formate_3months-99_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"

I also don't want "_S" to be deleted and replaced. I use "_S[0-9]" as a matching criteria and before "_S", the underscores should be changed to "-".

Also please recommend me a good website that I can learn those "codes or signs" using in this function. Thanks in advance.

Expected output: [1] "H2-2months-S11_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"
[2] "H2-2months-with-acetate-S101_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_" [3] "Formate-3months-S99_L001_R1_001_(paired)trimmed(paired)_contig_940_[cov=11]_"

Upvotes: 1

Views: 44

Answers (2)

Tom Haddow
Tom Haddow

Reputation: 230

This will match the "_S11" and save S11 to the group. Then replace this with a "-" followed by the captured group "S11".

Sample <- gsub("_(S[0-9+])", "-\\1", foo)

Excellent place to learn more regex: https://www.regular-expressions.info/quickstart.html

Excellent place to test regex with explanations of the matching: https://regexr.com/

Edit: Thanks RLave, didn't realise it could be any digits after the S. Updated answer.

Upvotes: 1

RLave
RLave

Reputation: 8364

This should work.

Basically we divide the job in two parts, first match ("_(S[0-9+])"), then we split the resulting string at "-", then we use gsub to fix all the "_" we find.

foo <- c("H2_2months_S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_")
foo <- gsub(pattern="_(S[0-9+])", replacement="-\\1", x=foo)
#foo
#[1] "H2_2months-S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"

Then we split:

split <- unlist(strsplit(foo, "-")) # split using the new "-"
#split
#[1] "H2_2months"                                                     
#[2] "S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"

Now we can use simple gsub on everything except the last element in split.

split_1 <- split[-length(split)] # fix all the "_" before the match (exclude the last)
split_1 <- gsub("_", "-", split_1)

Then we paste the results:

paste0(split_1, "-", split[length(split)]) # paste back together
#[1] "H2-2months-S123_L001_R1_001_(paired)_trimmed_(paired)_contig_940_[cov=11]_"

Here in a function and with another example:

foo <- c("H2_2months_abc_456_S123_L001_R1_001")

my_foo <- function(s) {
  s <- gsub(pattern="_(S[0-9+])", replacement="-\\1", x=s)
  split <- unlist(strsplit(s, "-"))

  split_1 <- split[-length(split)]
  split_1 <- gsub("_", "-", split_1)

  paste0(split_1, "-", split[length(split)])
}

my_foo(foo)
#[1] "H2-2months-abc-456-S123_L001_R1_001"

Upvotes: 1

Related Questions