Bahi8482
Bahi8482

Reputation: 538

Extract certain number of words or special characters after a string in R

I am trying to extract a certain number of words after a particular string.

library(stringr)

x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))

for example to extract the 4 words following "source", I learnt from another question to use this code:

trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){4}'))

this works very well, however, if I try to select 8 words instead, I noted it does not recognize the "/" and returns NA for the first string.

trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){8}'))

The question is: is there a regex to include the special characters (or bypass them), so I can still extract the needed words? I noticed that the same happens with other characters (eg - ) or with double white space.

The expected output for 8 words should be something like that:

from animal origin as Vitamin A / all-trans-Retinol  

it does not matter if it counts the / and - as words or not, as I can always adjust the number of quantifiers to be more (in my case, I do not mind extracting more than what I need).

thank you

Upvotes: 3

Views: 1663

Answers (4)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You may rely on \S shorthand character class that matches any non-whitespace char:

(?<=source:\s)\S+(?:\s+\S+){3,7}\b

See the regex demo. Details:

  • (?<=source:\s) - a location immediately preceded with source: and a whitespace
  • \S+ - one or more non-whitespace chars
  • (?:\s+\S+){3,7} - three to seven occurrences of 1+ whitespace and then 1+ non-whitespace chars
  • \b - a word boundary.

See the R demo online:

library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / alltrans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
stringr::str_extract(x$end, "(?<=source:\\s)\\S+(?:\\s+\\S+){3,7}\\b")

Output:

[1] "from animal origin as Vitamin A / alltrans-Retinol"
[2] "Eggs, liver, certain fish species such as sardines"
[3] "Leafy green vegetables such as spinach; egg yolks" 

Upvotes: 2

akrun
akrun

Reputation: 886998

We can specify the range with {4,8}

trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){4,8}'))

Or if it needs to be specific number, do a loop with those number

pat <- sprintf('(?<=source:\\s)(\\w+,?\\s){%d}', c(8, 4))

then extract the words with the pattern and coalesce

library(dplyr)
do.call(coalesce, lapply(pat, function(y) trimws(stringr::str_extract(x$end, y))))
#[1] "from animal origin as"     
#[2] "Eggs, liver, certain fish species such as sardines,"
#[3] "Leafy green vegetables such"  

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101179

Here is a base R option using regmatches + gsub

lapply(regmatches(u <- gsub(".*?source:\\s+?","",x$end),gregexpr("\\w+",u)),`[`,1:4)

which gives

[[1]]
[1] "from"   "animal" "origin" "as"

[[2]]
[1] "Eggs"    "liver"   "certain" "fish"

[[3]]
[1] "Leafy"      "green"      "vegetables" "such"

Upvotes: 1

dognose
dognose

Reputation: 20889

The question is: is there a regex to include the special characters (or bypass them), so I can still extract the needed words? I noticed that the same happens with other characters (eg - ) or with double white space.

Sidenode: This is kind of the XY-Problem (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)

Your problem is not, that the regex is not working - your problem is that the Regex IS working, but you expect something different. You use it to select the upcoming 8 words, after a certain string - but there are only 6 words before a non-word (/) - so that's just no match of your pattern.

So, to provide an "answer" to your question, you should redo your question at first:

WHAT is your exact expectation?

The Solution of akrun would match anything of 4-8 words, but doubt that is what you really need.

Upvotes: 1

Related Questions