Reputation: 1852
Using Python regex, I'm trying to scrape some Behat scenarios. Here is a regex: https://regex101.com/r/EGdK3O/1 (Scenario:([\s\S]*?)(And|When|Then|Given)
).
The current version of my code is items = re.findall(r'Scenario:([\s\S]*?)(And|When|Then|Given|#)', contents, re.MULTILINE)
. This works, except when one of these strings is in the scenario.
What I'm having trouble figuring out is how to only match (And|When|Then|Given)
when the string occurrence is the first string on a new line. Even better would be if I can match with a new line that has a tab or number of spaces.
The ultimate goal here is to get the Scenario description but not the steps.
Upvotes: 0
Views: 216
Reputation: 163632
You could match Scenario followed by a capturing group which will match until the end of the string without matching a newline.
Then use a single capturing group to repeat matching the lines that do not start with (And|When|Then|Given)
prepended with 1+ tabs or spaces and finally match the line that contains one of the options after the capturing group.
\bScenario:(.*(?:\r?\n(?![ \t]+(And|[WT]hen|Given)).*)*)\r?\n[ \t]+(?:And|[WT]hen|Given)
\bScenario:
Match Scenario:
prepended by a word boundary(
Capture group 1
.*
Match any char except a newline(?:
Non capturing group
\r?\n
Match a newline(?!
Negative lookahead, if what is on the right is not
[ \t]+(And|[WT]hen|Given)
Match 1+ spaces or tabs and 1 of the options).*
Close group and match 0+ times any char except a newline)*
Close group and repeat 0+ times)
Close capture group\r?\n[ \t]+
Match a newline and 1+ spaces or tabs(?:And|[WT]hen|Given)
Match any of the listedUpvotes: 1
Reputation: 24802
even though you might end up with some very complex regex to parse the Behat language, this is a typical case of 'I had one problem, I used a regex, now I have 2 problems':
Instead of losing your mind trying to solve this with a regex, you should better use a library that can read and parse the Behat language.
The reason is that the regex language is great to work on simple string parsing problem (working with the tokens of a language). Even though it can do it (with extended regex), parsing a complex language is more abstract. You need to not only look at the tokens (the words), but at the grammar (the syntax and its meaning).
A typical issue (which you're facing) is when a word has a different meaning given the context, and a grammar is there to help on this. And even though you can figure out the first step of parsing the scenarios, when you'll look at each scenario, you're likely to have a similar issue.
So that's why you need to implement a full blown parser… But writing a parser is not easy (the most complex part being writing the grammar). So if you're lucy, someone else has done it for you!
And you're lucky! Looking at some documentation on Behat the language used is call gherkin. With some googling, I found at least one python package that understands that language : cucumber/gherkin-python
, which has now moved to the cucumber/cucumber
repository.
The snippet to use the parser is the following:
from gherkin.parser import Parser
from gherkin.pickles.compiler import compile
parser = Parser()
gherkin_document = parser.parse("Feature: ...")
pickles = compile(gherkin_document)
Then you'll get a structured data output which you'll be able to navigate through easily in python.
Upvotes: 2