Reputation:
I have this vector called Identifier
:
c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b",
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
I'd like to extract out the OA
I've tried:
gsub(".*\\.(.*)\\..*", "\\1", Identifier)
Basically, I'd like to extract out the text between the second and the third periods. If there are only two periods (NC.1.OA
), I'd like to extract out everything after the second period.
Upvotes: 1
Views: 1067
Reputation: 11490
regmatches(Identifier, gregexpr("OA", Identifier))
wrap ?unlist
if you need a vector
unlist(
regmatches(Identifier, gregexpr("OA", Identifier))
)
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
Upvotes: 0
Reputation: 23109
We could try stringr
too:
Identifier = c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b",
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
library(stringr)
str_extract(Identifier, ".OA.")
# [1] NA ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA."
str_extract(Identifier, "OA")
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
gsub('\\.', '', str_extract(Identifier, ".OA.?"))
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
Upvotes: 0
Reputation: 522171
Here is an alternative to sub
using strsplit
with apply
:
sapply(Identifier, function(x) unlist(strsplit(x, "\\."))[3])
NC.1.OA NC.1.OA.0 NC.1.OA.1 NC.1.OA.1.a NC.1.OA.1.b NC.1.OA.1.c
"OA" "OA" "OA" "OA" "OA" "OA"
NC.1.OA.2 NC.1.OA.2.0 NC.1.OA.3 NC.1.OA.4
"OA" "OA" "OA" "OA"
Upvotes: 1
Reputation: 371019
Repeat (non-periods, followed by a period) twice, then capture non-periods, and the substring you want is in that captured group:
Identifier = c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b",
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
gsub("(?:[^.]+\\.){2}([^.]+).*", "\\1", Identifier)
Output:
[1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
To elaborate, (?:[^.]+\\.)
is a group which matches non-period characters and then a single period. The {2}
after the group means that the preceding token (the group) is repeated twice - that is, "non-periods, followed by a period, followed by non-periods, followed by a period.". Then, the final ([^.]+)
matches as many non-period characters as it can past the second period, thereby matching non-periods between the second period and the third period (or the end of the string).
Upvotes: 3