Rikal
Rikal

Reputation: 225

Scrape all date of death expressions from obituary text

I need to match the regex that matches the sentence with following pattern:
1st part is occurrence of word/s.(eg: passed, died)
2nd part is the date in that sentence.
3rd part is, this should match only before the delimiter/dot/full stop.

Example: Worth Scattergood (Dee) Lea passed on Thursday, July 28, 2022, Worth Scattergood (Dee) Lea passed away unexpectedly at age 88 with her three daughters at her side. Dee was born on April 26, 1934, in Radnor, Pennsylvania.

Here i need result of: July 28, 2022

But this should not match or find any result in following sentence:
Worth Scattergood (Dee) Lea passed on Thursday. Dee was born on April 26, 1934, in Radnor, Pennsylvania.

I tried with following expression but it is wrong as it match upto second sentence:

(passed|died)(.*?)(\w+)\d{1,2},?\s?\d{4}

Upvotes: 2

Views: 89

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You can use

\b(?:passed|died)\b[^.?!]*?\b(\w+\s*\d{1,2},\s?\d{4})(?!\d)

See the regex demo.

Details

  • \b(?:passed|died)\b - a word boundary, a non-capturing group matching either passed or died (as whole words) and a word boundary
  • [^.?!]*? - zero or more chars other than ., ! and ? as few as possible
  • \b - a word boundary
  • (\w+\s*\d{1,2},\s?\d{4}) - Group 1: one or more word chars, zero or more whitespaces, one or two digits, comma, an optional whitespace, and four digits
  • (?!\d) - no digit immediately on the right is allowed.

Upvotes: 0

anubhava
anubhava

Reputation: 785156

You can match keywords passed or died and then allow upto 3 space separated substrings before matching date:

\b(?>passed|died)(?>\h+\S+){0,3}\h+\K\w+\h+\d{1,2},\h*\d{4}\b

RegEx Demo

Explanation:

  • ?>...): is atomic group
  • \b: Word boundary
  • (?>passed|died): Match passed or died
  • (?>\h+\S+){0,3}: Match 0 to 3 space separated substrings
  • \h+: Match 1+ whitespaces
  • \K: Resets matched info
  • \w+: Match month name
  • \h+: Match 1+ whitespaces
  • \d{1,2}: Match date part 1 or 2 digits
  • ,\h*: Match comma followed by 0 or more whitespaces
  • \d{4}\b: Match 4 digit year followed by word boundary

Upvotes: 4

Related Questions