Muhammad Waleed
Muhammad Waleed

Reputation: 92

regular expression for the extracting multiple patterns

I have string like this

string="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""

tokens=re.findall('(.*)\r\n(.*?:)(.*?])',string)

Output

 ('Claim Status', '[Primary Status:', ' Paidup to Rebilled]')
 ('General Info.', '[PA Number:', ' R180126187]')
 ('Claim Insurance: Modified', '[Ins. Mode:', ' Primary]')

Wanted output:

 ('Claim Status', 'Primary Status:Paidup to Rebilled')
 ('General Info.', 'PA Number:R180126187')
 ('Claim Insurance: Modified', 'Ins. Mode:Primary','ICN: ########', 'Id: #########')

Upvotes: 0

Views: 207

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626804

You may achieve what you need with a solution like this:

import re
s="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
res = []
for m in re.finditer(r'^(.+)(?:\r?\n\s*\[(.+)])?\r?$', s, re.M):
    t = []
    t.append(m.group(1).strip())
    if m.group(2):
        t.extend([x.strip() for x in m.group(2).strip().split('], [') if ':' in x])
    res.append(tuple(t))
print(res)

See the Python online demo. Output:

[('Claim Status', 'Primary Status: Paidup to Rebilled'), ('General Info.', 'PA Number: #######'), ('Claim Insurance: Modified', 'Ins. Mode: Primary', 'ICN: #######', 'Id: ########')]

With the ^(.+)(?:\r?\n\s*\[(.+)])?\r?$ regex, you match two consecutive lines with the second being optional (due to the (?:...)? optional non-capturing group), the first is captured into Group 1 and the subsequent one (that starts with [ and ends with ]) is captured into Group 2. (Note that \r?$ is necessary since in the multiline mode $ only matches before a newline and not a carriage return.) Group 1 value is added to a temporary list, then the contents of the second group is split with ], [ (if you are not sure about the amount of whitespace, you may use re.split(r']\s*,\s*\[', m.group(2))) and then only add those items that contain a : in them to the temporary list.

Upvotes: 2

alexis
alexis

Reputation: 50190

You are getting three elements per result because you are using "capturing" regular expressions. Rewrite your regexp like this to combine the second and third match:

re.findall('(.*)\r\n((?:.*?:)(?:.*?]))',string)

A group delimited by (?:...) (instead of (...)) is "non-capturing", i.e. it doesn't count as a match target for \1 etc., and it does not get "seen" by re.findall. I have made both your groups non-capturing, and added a single capturing (regular) group around them.

Upvotes: 0

Related Questions