dblo
dblo

Reputation: 23

Extract a string of words between multiple specific words in R

I have a long string that includes one or multiple keywords, "Realm" in this case. I have been using gsub, but that only takes the words after the last of the keywords.

so the string could look like:

...Attributes \r\n Realm - Afrotropical \r\n IUCN Ecosystem -- Terrestrial biome...

or

...Attributes \r\n Realm - Afrotropical \r\n Realm - Neotropical \r\n . IUCN Ecosystem -- Terrestrial biome...

I have been using the function:

Realm_fun<-function(x){gsub('^.*Realm -\\s*|\\s*IUCN Ecosystem.*$', '', x)}

then using lapply to fun it on all of the strings.

what can I do to get Afrotropical for the first string and Afrotropical , Neotropical for the second?

Upvotes: 2

Views: 74

Answers (3)

Onyambu
Onyambu

Reputation: 79228

I do not know how exactly you need the words to be. But one idea could be:

  regmatches(st, gregexpr("Realm - \\K(\\w+)",st,perl = TRUE))
[[1]]
[1] "Afrotropical"

[[2]]
[1] "Afrotropical" "Neotropical" 

If you do not want to split the words:

trimws(gsub("(?m)^.*Realm - (\\w+)|((?!Realm).)*$","\\1 ",st,perl=TRUE))

[1] "Afrotropical"                "Afrotropical  \nNeotropical"

gsub("(?m)^(?:(?!Real).)*$|[\r\n]|.*Realm - ","",st, perl = TRUE)

[1] "Afrotropical  "              "Afrotropical  Neotropical  "

Upvotes: 1

Hsiang Yun Chan
Hsiang Yun Chan

Reputation: 151

Use base functions gregexpr and regmatches.

library(magrittr)

test<-c("...Attributes  \r\n      Realm - Afrotropical  \r\n  IUCN Ecosystem -- Terrestrial biome...","...Attributes  \r\n      Realm - Afrotropical  \r\n   Realm - Neotropical  \r\n .  IUCN Ecosystem -- Terrestrial biome...")

test %>% gregexpr("(?<=(Realm - ))[a-zA-Z]+", ., perl = T) %>% regmatches(x = test) 

#[[1]]
#[1] "Afrotropical"

#[[2]]
#[1] "Afrotropical" "Neotropical"                                                         

Upvotes: 0

PKumar
PKumar

Reputation: 11128

You can try this:

stringi::stri_extract_last(st, regex='(?<=Realm - )(\\w+)')

stri_extract_last will pull the last match for the regex, using a look behind assertions we can collect word followed by look arounds(using positive look behind in this case), in this case you are having words Afrotropical and Neotropical followed by Realm - .

In case if you want to extract both strings for last match, you can try below(stri_extract_all):

stringi::stri_extract_all(st, regex='(?<=Realm - )(\\w+)')

Input:

st <- c("...Attributes  \r\n      Realm - Afrotropical  \r\n  IUCN Ecosystem -- Terrestrial biome...", 
"...Attributes  \r\n      Realm - Afrotropical  \r\n   Realm - Neotropical  \r\n .  IUCN Ecosystem -- Terrestrial biome..."
)

Output:

> stringi::stri_extract_last(st, regex='(?<=Realm - )(\\w+)')
[1] "Afrotropical" "Neotropical" 


> stringi::stri_extract_all(st, regex='(?<=Realm - )(\\w+)')
[[1]]
[1] "Afrotropical"

[[2]]
[1] "Afrotropical" "Neotropical" 

Upvotes: 0

Related Questions