Coding4Fun
Coding4Fun

Reputation: 91

Regex: How to ignore dots in connected words

For analyzing a log file, I need to extract exception types with python and regex.
The exception types always contain the substring "Exception".
The problem is that the substring "Exception" is not always at the end of their names.
Moreover, the exception types consist of an unknown number of dots.

Expected behaviour:

Input
"08-01-2021: There is a System.InvalidCalculationException - System reboots"
"09-01-2021: SuperSystem recognised a System.IO.WritingException ask user what to do next"
"10-01-2021: Oh no, not again an InternalException.NullReference.NonCritical.User we should fix it!"

Output
"System.InvalidCalculationException"
"System.IO.WritingException"
"InternalException.NullReference.NonCritical.User"

How does the regex need to look like?
I have tried it with "\w+[.]\w+[.]*Exception" for the exception types who are ending with "Exception".
But what if exception types contain even more dots and "Exception" is not at the end?

Upvotes: 3

Views: 1021

Answers (3)

Mahdi Akhi
Mahdi Akhi

Reputation: 11

Based on what you wrote, it can be said that every exception is a string of letters and dots.

I think this can solve your problem : "([A-Z][a-z]*.).([^\s]+)"

check it in link

Upvotes: 0

Jim Simson
Jim Simson

Reputation: 2862

How about:

[^\s]*Exception[^\s]*

(Demo)

The above ensures that your string contains the word "Exception" and includes anything before or after that is not a white space character.

[^\s]* Matches anything that is not (^) a white space (\s) 0 to unlimited times (*).

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You can use

\b(?:[A-Za-z]+\.)*[A-Za-z]*Exception(?:\.[A-Za-z]+)*\b
\b(?:\w+\.)*\w*Exception(?:\.\w+)*\b

See the regex demo / regex demo #2. Details:

  • \b - a word boundary
  • (?:[A-Za-z]+\.)* - zero or more occurrences of one or more letters followed with a dot
  • [A-Za-z]* - zero or more letters
  • Exception - a string Exception
  • (?:\.[A-Za-z]+)* - zero or more reptitions of a dot and then one or more letters.
  • \b - a word boundary.

The \w matches any letters, digits or underscore.

Python usage:

re.findall(r'\b(?:\w+\.)*\w*Exception(?:\.\w+)*\b', text)

Upvotes: 1

Related Questions