Reputation: 538
I am trying to extract a certain number of words after a particular string.
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
for example to extract the 4 words following "source", I learnt from another question to use this code:
trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){4}'))
this works very well, however, if I try to select 8 words instead, I noted it does not recognize the "/" and returns NA for the first string.
trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){8}'))
The question is: is there a regex to include the special characters (or bypass them), so I can still extract the needed words? I noticed that the same happens with other characters (eg - ) or with double white space.
The expected output for 8 words should be something like that:
from animal origin as Vitamin A / all-trans-Retinol
it does not matter if it counts the / and - as words or not, as I can always adjust the number of quantifiers to be more (in my case, I do not mind extracting more than what I need).
thank you
Upvotes: 3
Views: 1663
Reputation: 626738
You may rely on \S
shorthand character class that matches any non-whitespace char:
(?<=source:\s)\S+(?:\s+\S+){3,7}\b
See the regex demo. Details:
(?<=source:\s)
- a location immediately preceded with source:
and a whitespace\S+
- one or more non-whitespace chars(?:\s+\S+){3,7}
- three to seven occurrences of 1+ whitespace and then 1+ non-whitespace chars\b
- a word boundary.See the R demo online:
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / alltrans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
stringr::str_extract(x$end, "(?<=source:\\s)\\S+(?:\\s+\\S+){3,7}\\b")
Output:
[1] "from animal origin as Vitamin A / alltrans-Retinol"
[2] "Eggs, liver, certain fish species such as sardines"
[3] "Leafy green vegetables such as spinach; egg yolks"
Upvotes: 2
Reputation: 886998
We can specify the range with {4,8}
trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){4,8}'))
Or if it needs to be specific number, do a loop with those number
pat <- sprintf('(?<=source:\\s)(\\w+,?\\s){%d}', c(8, 4))
then extract the words with the pattern and coalesce
library(dplyr)
do.call(coalesce, lapply(pat, function(y) trimws(stringr::str_extract(x$end, y))))
#[1] "from animal origin as"
#[2] "Eggs, liver, certain fish species such as sardines,"
#[3] "Leafy green vegetables such"
Upvotes: 1
Reputation: 101179
Here is a base R option using regmatches
+ gsub
lapply(regmatches(u <- gsub(".*?source:\\s+?","",x$end),gregexpr("\\w+",u)),`[`,1:4)
which gives
[[1]]
[1] "from" "animal" "origin" "as"
[[2]]
[1] "Eggs" "liver" "certain" "fish"
[[3]]
[1] "Leafy" "green" "vegetables" "such"
Upvotes: 1
Reputation: 20889
The question is: is there a regex to include the special characters (or bypass them), so I can still extract the needed words? I noticed that the same happens with other characters (eg - ) or with double white space.
Sidenode: This is kind of the XY-Problem (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)
Your problem is not, that the regex is not working - your problem is that the Regex IS working, but you expect something different. You use it to select the upcoming 8 words, after a certain string - but there are only 6 words before a non-word (/
) - so that's just no match of your pattern.
So, to provide an "answer" to your question, you should redo your question at first:
WHAT is your exact expectation?
The Solution of akrun would match anything of 4-8 words, but doubt that is what you really need.
Upvotes: 1