Reputation: 157
I have a column with these type of names:
sp_O00168_PLM_HUMAM
sp_Q8N1D5_CA158_HUMAN
sp_Q15818_NPTX1_HUMAN
tr_Q6FGH5_Q6FGH5_HUMAN
sp_Q9UJ99_CAD22_HUMAN
I want to remove everything before, and including, the second _ and everything after, and including, the third _.
I do not which to remove based on number of characters, since this is not a fixed number.
The output should be:
PLM
CA158
NPTX1
Q6FGH5
CAD22
I have played around with these, but don't quite get it right..
library(stringer)
str_sub(x,-6,-1)
Upvotes: 2
Views: 571
Reputation: 81693
Another possible regular expression is this:
sub("^(?:.+_){2}(.+?)_.+", "\\1", vec)
# [1] "PLM" "CA158" "NPTX1" "Q6FGH5" "CAD22"
where vec
is your vector of strings.
A visual explanation:
Upvotes: 2
Reputation: 545676
That’s not really a subset in programming terminology1, it’s a substring. In order to extract partial strings, you’d usually use regular expressions (pretty much regardless of language); in R, this is accessible via sub
and other related functions:
pattern = '^.*_.*_([^_]*)_.*$'
result = sub(pattern, '\\1', strings)
1 Aside: taking a subset is, as the name says, a set operation, and sets are defined by having no duplicate elements and there’s no particular order to the elements. A string by contrast is a sequence which is a very different concept.
Upvotes: 2
Reputation: 4335
> gsub(".*_.*_(.*)_.*", "\\1", "sp_O00168_PLM_HUMAM")
[1] "PLM"
Upvotes: 0