Raj De Inno
Raj De Inno

Reputation: 1224

how to split the sentence into multiple sentence based on multiple condition regex?

I have the below sentences. I need to split the sentences into multiple sentences if the sentence has dot or a matched word.

Sentence 1: There was an error while trying to serialize parameter http://uri.org/:Message. The InnerException message with data contract name 'enumStatus:' is not expected.

Expected result:

senetences =    1. There was an error while trying to serialize parameter http://uri.org/:vMessage.
                2. The InnerException message with data contract name 'enumStatus:' is not expected.
                        

Sentence 2: ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1

Expected result:

senetences = 1. ORA-01756: quoted string not properly terminated
             2. ORA-06512: at module1, line 48
             3. ORA-06512: at line 1
                        

I am using below regex to split the sentences.

 sentences = re.split(r'(?<=\w\.)\s|ORA-[0-9]{1,8}', input)
 

Issue here is, for the first case, if any word followed by dot is working fine. For the second case, I am able to split the sentence. I have 2 issues.

  1. It is removing the entire match word 'ORA-'. But I need the entire word.
  2. I am getting 4 sentences instead of 3 sentences.
    1. (first is empty since it has starting word ORA-)
    2. quoted string not properly terminated
    3. at module1, line 48
    4. at line 1

I need 3 sentences in this case.

Any help would be really appreciated.

Upvotes: 1

Views: 126

Answers (2)

vks
vks

Reputation: 67968

(?<=\w\.)\s|(ORA-[0-9]{1,8})

You can try this and replace by \n\1.

See demo.

https://regex101.com/r/8yvUuZ/1/

# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?<=\w\.)\s|(ORA-[0-9]{1,8})"

test_str = ("ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1\n"
    "There was an error while trying to serialize parameter http://uri.org/:Message. The InnerException message with data contract name 'enumStatus:' is not expected.")

subst = "\\n\\1"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Upvotes: 0

anubhava
anubhava

Reputation: 785256

You may use this regex for splitting:

\s+(?=ORA-\d+)|(?<=\.)\s+(?=[A-Z])

RegEx Demo

RegEx Details:

  • \s+(?=ORA-\d+): Match 1+ whitespace if that is followed by ORA- and 1+ digits
  • |: OR
  • (?<=\.)\s+(?=[A-Z]): Match 1+ whitespace if that is preceded by a dot and followed by an uppercase letter

Code Demo

Code:

import re
arr = ["There was an error while trying to serialize parameter http://uri.org/:Message. The InnerException message with data contract name 'enumStatus:' is not expected.", "ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1"]

rx = re.compile(r'\s+(?=\bORA-\d+)|(?<=\.)\s+(?=[A-Z])')
for i in arr: print (rx.split(i))

Output:

['There was an error while trying to serialize parameter http://uri.org/:Message.', "The InnerException message with data contract name 'enumStatus:' is not expected."]
['ORA-01756: quoted string not properly terminated', 'ORA-06512: at module1, line 48', 'ORA-06512: at line 1']

Upvotes: 1

Related Questions