Reputation: 1224
I have the below sentences. I need to split the sentences into multiple sentences if the sentence has dot or a matched word.
Sentence 1: There was an error while trying to serialize parameter http://uri.org/:Message. The InnerException message with data contract name 'enumStatus:' is not expected.
Expected result:
senetences = 1. There was an error while trying to serialize parameter http://uri.org/:vMessage.
2. The InnerException message with data contract name 'enumStatus:' is not expected.
Sentence 2: ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1
Expected result:
senetences = 1. ORA-01756: quoted string not properly terminated
2. ORA-06512: at module1, line 48
3. ORA-06512: at line 1
I am using below regex to split the sentences.
sentences = re.split(r'(?<=\w\.)\s|ORA-[0-9]{1,8}', input)
Issue here is, for the first case, if any word followed by dot is working fine. For the second case, I am able to split the sentence. I have 2 issues.
I need 3 sentences in this case.
Any help would be really appreciated.
Upvotes: 1
Views: 126
Reputation: 67968
(?<=\w\.)\s|(ORA-[0-9]{1,8})
You can try this and replace by \n\1
.
See demo.
https://regex101.com/r/8yvUuZ/1/
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\w\.)\s|(ORA-[0-9]{1,8})"
test_str = ("ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1\n"
"There was an error while trying to serialize parameter http://uri.org/:Message. The InnerException message with data contract name 'enumStatus:' is not expected.")
subst = "\\n\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Upvotes: 0
Reputation: 785256
You may use this regex for splitting:
\s+(?=ORA-\d+)|(?<=\.)\s+(?=[A-Z])
RegEx Details:
\s+(?=ORA-\d+)
: Match 1+ whitespace if that is followed by ORA-
and 1+ digits|
: OR(?<=\.)\s+(?=[A-Z])
: Match 1+ whitespace if that is preceded by a dot and followed by an uppercase letterCode:
import re
arr = ["There was an error while trying to serialize parameter http://uri.org/:Message. The InnerException message with data contract name 'enumStatus:' is not expected.", "ORA-01756: quoted string not properly terminated ORA-06512: at module1, line 48 ORA-06512: at line 1"]
rx = re.compile(r'\s+(?=\bORA-\d+)|(?<=\.)\s+(?=[A-Z])')
for i in arr: print (rx.split(i))
Output:
['There was an error while trying to serialize parameter http://uri.org/:Message.', "The InnerException message with data contract name 'enumStatus:' is not expected."]
['ORA-01756: quoted string not properly terminated', 'ORA-06512: at module1, line 48', 'ORA-06512: at line 1']
Upvotes: 1