Reputation: 3577
I have 5 million sequences (probes to be specific) as below. I need to extract the name from each string.
The names here are 1007_s_at:123:381, 10073_s_at:128:385 and so on..
I am using lapply function but it is taking too much time. I have several other similar files. Would you suggest a faster way to do this.
nm = c(
"probe:HG-Focus:1007_s_at:123:381; Interrogation_Position=3570; Antisense;",
"probe:HG-Focus:1007_s_at:128:385; Interrogation_Position=3615; Antisense;",
"probe:HG-Focus:1007_s_at:133:441; Interrogation_Position=3786; Antisense;",
"probe:HG-Focus:1007_s_at:142:13; Interrogation_Position=3878; Antisense;" ,
"probe:HG-Focus:1007_s_at:156:191; Interrogation_Position=3443; Antisense;",
"probe:HTABC:1007_s_at:244:391; Interrogation_Position=3793; Antisense;")
extractProbe <- function(x) sub("probe:", "", strsplit(x, ";", fixed=TRUE)[[1]][1], ignore.case=TRUE)
pr = lapply(nm, extractProbe)
Output
1007_s_at:123:381
1007_s_at:128:385
1007_s_at:133:441
1007_s_at:142:13
1007_s_at:156:191
1007_s_at:244:391
Upvotes: 1
Views: 3217
Reputation: 89057
Using regular expressions:
sub("probe:(.*?):(.*?);.*$", "\\2", nm, perl = TRUE)
A bit of explanation:
.
means "any character"..*
means "any number of characters"..*?
means "any number of characters, but do not be greedy.\\1
, \\2
, etc.$
means end of the line (or string).So here, the pattern matches the whole line, and captures two things via the two (.*?)
: the HG-Focus
(or other) thing you don't want as \\1
and your id as \\2
. By setting the replacement to \\2
, we are effectively replacing the whole string with your id.
I now realize it was not necessary to capture the first thing, so this would work just as well:
sub("probe:.*?:(.*?);.*$", "\\1", nm, perl = TRUE)
Upvotes: 7
Reputation: 109874
A roundabout technique:
sapply(strsplit(sapply(strsplit(nm, "e:"), "[[", 2), ";"), "[[", 1)
Upvotes: 1