Using Regex to select multiple sentence patterns - issue with grouping?

Question

I'm having trouble with a Regex statement that I want to use in R to extract full matches of a pattern from a data frame.

I have 11 sentence patterns and I want to be able to select only records matching these patterns from my data frame as full matches using one Regex (I've been able to get this to work with multiple Regex, but it's a real hassle). Any help would be please appreciated as to what I can do to simply this.

These are my sentences:

A change to headings 0101 through 0106 from any other chapter.
A change to subheadings 0712.20 through 0712.39 from any other chapter.
A change to heading 0903 from any other chapter.
A change to subheading 1806.20 from any other heading.
A change to subheading 1207.99 from any other chapter.
A change to heading 4302 from any other heading.
A change to subheading 4105.10 from heading 4102 or any other chapter.
A change to subheading 4105.30 from heading 4102, subheading 4105.10 or any other chapter.
A change to subheading 4106.21 from subheading 4103.10 or any other chapter.
A change to subheading 4106.22 from subheadings 4103.10 or 4106.21 or any other chapter.
A change to tariff item 7304.41.30 from subheading 7304.49 or any other chapter.

This is the Regex I have now, which selects full matches and partial matches (where I'm stuck) - so I end up getting records I don't want from my data frame in addition to these sentences (I know this is messy, just an example).

^A change to (?:headings|heading|subheadings|subheading|tariff item) (?:\d+\S\d+\S\d+|\d+\S\d+) (?:through \d+\S\d+ from any other chapter.|from any other chapter.|from any other heading.|)|from heading \d+\S\d+ or any other chapter.|from (?:heading|subheading|subheadings) \d+\S\d+|, subheading \d+\S\d+ or any other chapter| or any other chapter.| or \d+\S\d+

This is the how far I can get with the Regex as full matches on all 11 sentences. I'm having a problem continuing to group cleany after this:

^A change to (?:tariff item|headings|heading|subheading|subheadings) (?:\d+\S\d+|\d+\S\d+\S\d+|\d+\S\d+) (?:from|through)

Wiktor Stribiżew · Accepted Answer

You may use

rx <- "A\s+change\s+to\s+(?:(?:sub)?headings?|tariff\s+item)\s+\d[0-9.]*(?:\s+through\s+\d[0-9.]*)?\s+from(?:(?:,?\s+(?:sub)?headings?\s+\d[0-9.]*)+(?:\s+or\s+\d[0-9.]*)*\s+or)?\s+any\s+other\s+(?:heading|chapter)\."

See the regex demo. Note that \s+ matches 1 or more whitespace chars, and will match even if the number and type of whitespace between the words is not constant.

Details

A\s+change\s+to\s+ - A change to substring
(?:(?:sub)?headings?|tariff\s+item) - subheading, subheadings, heading, headings, tariff item substrings
\s+\d[0-9.]* - 1+ whitespaces, 1 digit and 0 or more digits or .
(?:\s+through\s+\d[0-9.]*)? - an optional sequence of:
- \s+ - 1+ whitespaces
- through - through
- \s+ - 1+ whitespaces
- \d[0-9.]* - 1 digit and 0 or more digits or .
\s+from - 1+ whitespaces and from
(?:(?:,?\s+(?:sub)?headings?\s+\d[0-9.]*)+(?:\s+or\s+\d[0-9.]*)*\s+or)? - an optional sequence of:
- (?:,?\s+(?:sub)?headings?\s+\d[0-9.]*)+ - 1 or more sequences of:
  - ,? - an optional ,
  - \s+
  - (?:sub)?headings? - an optional sub, then heading and then an optional s
  - \s+ - 1+ whitespaces
  - \d[0-9.]* - a digit and then 0+ digits or . chars
- (?:\s+or\s+\d[0-9.]*)* - 0 or more sequences of:
  - \s+ - 1+ whitespaces
  - or\s+\d[0-9.]* - or, 1+ whitespaces, a digit and then 0+ digits or . chars
- \s+or - 1+ whitespaces and or
\s+any\s+other\s+(?:heading|chapter)\. - any other heading. or any other chapter.

All 11 matches are returned in this online R demo:

text <- "A change to headings 0101 through 0106 from any other chapter.
A change to subheadings 0712.20 through 0712.39 from any other chapter.
A change to heading 0903 from any other chapter.
A change to subheading 1806.20 from any other heading.
A change to subheading 1207.99 from any other chapter.
A change to heading 4302 from any other heading.
A change to subheading 4105.10 from heading 4102 or any other chapter.
A change to subheading 4105.30 from heading 4102, subheading 4105.10 or any other chapter.
A change to subheading 4106.21 from subheading 4103.10 or any other chapter.
A change to subheading 4106.22 from subheadings 4103.10 or 4106.21 or any other chapter.
A change to tariff item 7304.41.30 from subheading 7304.49 or any other chapter."
rx <- "A\s+change\s+to\s+(?:(?:sub)?headings?|tariff\s+item)\s+\d[0-9.]*(?:\s+through\s+\d[0-9.]*)?\s+from(?:(?:,?\s+(?:sub)?headings?\s+\d[0-9.]*)+(?:\s+or\s+\d[0-9.]*)*\s+or)?\s+any\s+other\s+(?:heading|chapter)\."
regmatches(text, gregexpr(rx, text))

Using Regex to select multiple sentence patterns - issue with grouping?

Answers (1)

Related Questions