user8248672
user8248672

Reputation:

Extract out letters between 2nd period and 3rd period in R

I have this vector called Identifier:

c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b", 
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)

I'd like to extract out the OA

I've tried:

gsub(".*\\.(.*)\\..*", "\\1", Identifier)

Basically, I'd like to extract out the text between the second and the third periods. If there are only two periods (NC.1.OA), I'd like to extract out everything after the second period.

Upvotes: 1

Views: 1067

Answers (4)

Andre Elrico
Andre Elrico

Reputation: 11490

regmatches(Identifier, gregexpr("OA", Identifier))

wrap ?unlist if you need a vector

unlist(
    regmatches(Identifier, gregexpr("OA", Identifier))
)
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"

Upvotes: 0

Sandipan Dey
Sandipan Dey

Reputation: 23109

We could try stringr too:

Identifier = c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b", 
               "NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
library(stringr)
str_extract(Identifier, ".OA.")
# [1] NA     ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA."
str_extract(Identifier, "OA")
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
gsub('\\.', '', str_extract(Identifier, ".OA.?"))
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522171

Here is an alternative to sub using strsplit with apply:

sapply(Identifier, function(x) unlist(strsplit(x, "\\."))[3])

NC.1.OA   NC.1.OA.0   NC.1.OA.1 NC.1.OA.1.a NC.1.OA.1.b NC.1.OA.1.c 
    "OA"        "OA"        "OA"        "OA"        "OA"        "OA" 
NC.1.OA.2 NC.1.OA.2.0   NC.1.OA.3   NC.1.OA.4 
    "OA"        "OA"        "OA"        "OA" 

Upvotes: 1

CertainPerformance
CertainPerformance

Reputation: 371019

Repeat (non-periods, followed by a period) twice, then capture non-periods, and the substring you want is in that captured group:

Identifier = c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b", 
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
gsub("(?:[^.]+\\.){2}([^.]+).*", "\\1", Identifier)

Output:

[1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"

To elaborate, (?:[^.]+\\.) is a group which matches non-period characters and then a single period. The {2} after the group means that the preceding token (the group) is repeated twice - that is, "non-periods, followed by a period, followed by non-periods, followed by a period.". Then, the final ([^.]+) matches as many non-period characters as it can past the second period, thereby matching non-periods between the second period and the third period (or the end of the string).

Upvotes: 3

Related Questions