DataMiner_NLP
DataMiner_NLP

Reputation: 121

Matching an entire sentence containing words even if the sentence spans multiple lines

Attempting to match the entire sentence of a document containing certain words even if the sentence spans multiple lines.

My current attempts only capture the sentence if it does not span to the next lines.

^.*\b(dog|cat|bird)\b.*\.

Using ECMAScript.

Upvotes: 3

Views: 78

Answers (1)

Ryszard Czech
Ryszard Czech

Reputation: 18611

When no abbreviations in the input are expected use

/\b[^?!.]*?\b(dog|cat|bird)\b[^?!.]*[.?!]/gi

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  [^?!.]*?                 any character except: '?', '!', '.' (0 or
                           more times (matching the least amount
                           possible))
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    dog                      'dog'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    cat                      'cat'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    bird                     'bird'
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  [^?!.]*                  any character except: '?', '!', '.' (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  [.?!]                    any character of: '.', '?', '!'

Upvotes: 1

Related Questions