Reputation: 23
I have a long string that includes one or multiple keywords, "Realm" in this case. I have been using gsub, but that only takes the words after the last of the keywords.
so the string could look like:
...Attributes \r\n Realm - Afrotropical \r\n IUCN Ecosystem -- Terrestrial biome...
or
...Attributes \r\n Realm - Afrotropical \r\n Realm - Neotropical \r\n . IUCN Ecosystem -- Terrestrial biome...
I have been using the function:
Realm_fun<-function(x){gsub('^.*Realm -\\s*|\\s*IUCN Ecosystem.*$', '', x)}
then using lapply
to fun it on all of the strings.
what can I do to get Afrotropical
for the first string and Afrotropical , Neotropical
for the second?
Upvotes: 2
Views: 74
Reputation: 79228
I do not know how exactly you need the words to be. But one idea could be:
regmatches(st, gregexpr("Realm - \\K(\\w+)",st,perl = TRUE))
[[1]]
[1] "Afrotropical"
[[2]]
[1] "Afrotropical" "Neotropical"
If you do not want to split the words:
trimws(gsub("(?m)^.*Realm - (\\w+)|((?!Realm).)*$","\\1 ",st,perl=TRUE))
[1] "Afrotropical" "Afrotropical \nNeotropical"
gsub("(?m)^(?:(?!Real).)*$|[\r\n]|.*Realm - ","",st, perl = TRUE)
[1] "Afrotropical " "Afrotropical Neotropical "
Upvotes: 1
Reputation: 151
Use base functions gregexpr
and regmatches
.
library(magrittr)
test<-c("...Attributes \r\n Realm - Afrotropical \r\n IUCN Ecosystem -- Terrestrial biome...","...Attributes \r\n Realm - Afrotropical \r\n Realm - Neotropical \r\n . IUCN Ecosystem -- Terrestrial biome...")
test %>% gregexpr("(?<=(Realm - ))[a-zA-Z]+", ., perl = T) %>% regmatches(x = test)
#[[1]]
#[1] "Afrotropical"
#[[2]]
#[1] "Afrotropical" "Neotropical"
Upvotes: 0
Reputation: 11128
You can try this:
stringi::stri_extract_last(st, regex='(?<=Realm - )(\\w+)')
stri_extract_last
will pull the last match for the regex, using a look behind assertions we can collect word followed by look arounds(using positive look behind in this case), in this case you are having words Afrotropical and Neotropical followed by Realm - .
In case if you want to extract both strings for last match, you can try below(stri_extract_all
):
stringi::stri_extract_all(st, regex='(?<=Realm - )(\\w+)')
Input:
st <- c("...Attributes \r\n Realm - Afrotropical \r\n IUCN Ecosystem -- Terrestrial biome...",
"...Attributes \r\n Realm - Afrotropical \r\n Realm - Neotropical \r\n . IUCN Ecosystem -- Terrestrial biome..."
)
Output:
> stringi::stri_extract_last(st, regex='(?<=Realm - )(\\w+)')
[1] "Afrotropical" "Neotropical"
> stringi::stri_extract_all(st, regex='(?<=Realm - )(\\w+)')
[[1]]
[1] "Afrotropical"
[[2]]
[1] "Afrotropical" "Neotropical"
Upvotes: 0