Reputation: 92
I have string like this
string="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
tokens=re.findall('(.*)\r\n(.*?:)(.*?])',string)
Output
('Claim Status', '[Primary Status:', ' Paidup to Rebilled]')
('General Info.', '[PA Number:', ' R180126187]')
('Claim Insurance: Modified', '[Ins. Mode:', ' Primary]')
Wanted output:
('Claim Status', 'Primary Status:Paidup to Rebilled')
('General Info.', 'PA Number:R180126187')
('Claim Insurance: Modified', 'Ins. Mode:Primary','ICN: ########', 'Id: #########')
Upvotes: 0
Views: 207
Reputation: 626804
You may achieve what you need with a solution like this:
import re
s="""Claim Status\r\n[Primary Status: Paidup to Rebilled]\r\nGeneral Info.\r\n[PA Number: #######]\r\nClaim Insurance: Modified\r\n[Ins. Mode: Primary], [Corrected Claim Checked], [ICN: #######], [Id: ########]"""
res = []
for m in re.finditer(r'^(.+)(?:\r?\n\s*\[(.+)])?\r?$', s, re.M):
t = []
t.append(m.group(1).strip())
if m.group(2):
t.extend([x.strip() for x in m.group(2).strip().split('], [') if ':' in x])
res.append(tuple(t))
print(res)
See the Python online demo. Output:
[('Claim Status', 'Primary Status: Paidup to Rebilled'), ('General Info.', 'PA Number: #######'), ('Claim Insurance: Modified', 'Ins. Mode: Primary', 'ICN: #######', 'Id: ########')]
With the ^(.+)(?:\r?\n\s*\[(.+)])?\r?$
regex, you match two consecutive lines with the second being optional (due to the (?:...)?
optional non-capturing group), the first is captured into Group 1 and the subsequent one (that starts with [
and ends with ]
) is captured into Group 2. (Note that \r?$
is necessary since in the multiline mode $
only matches before a newline and not a carriage return.) Group 1 value is added to a temporary list, then the contents of the second group is split with ], [
(if you are not sure about the amount of whitespace, you may use re.split(r']\s*,\s*\[', m.group(2))
) and then only add those items that contain a :
in them to the temporary list.
Upvotes: 2
Reputation: 50190
You are getting three elements per result because you are using "capturing" regular expressions. Rewrite your regexp like this to combine the second and third match:
re.findall('(.*)\r\n((?:.*?:)(?:.*?]))',string)
A group delimited by (?:...)
(instead of (...)
) is "non-capturing", i.e. it doesn't count as a match target for \1
etc., and it does not get "seen" by re.findall
. I have made both your groups non-capturing, and added a single capturing (regular) group around them.
Upvotes: 0