user9510048
user9510048

Reputation:

Regular expression skips values if no match

I am using the following regular expression to extract data from a file, which works fine as long as the data Am extracting contains all 3 elements of the regex, if not ( if only one is messing ) the regex just skips the data, how do I change this behavior to not skip a value if it does not match but instead fill it with 0 or null?

bC_NUMBER = 1
bS_ID = 1
bTRANSACTION_AMOUNT = 1
rC_NUMBER = r"number:\s(\d+\*+\d+).*?"
rS_ID = r"ID:\s*(\d*).*?"
rT_ID = r"ATM:\s(\w+).*?"
rT_AMOUNT = r"Total cash dispensed:\s*([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+).*?"

regex = rC_NUMBER*bC_NUMBER+ rS_ID*bS_ID + rT_AMOUNT*bTRANSACTION_AMOUNT

Example Output :

[('99280*********8823', '182', '40000', 'MGA'), ('99280*********8823', '182', '40000', 'MGA')]

Desired Output :

[('99280*********8823', '182', '40000', 'MGA'),('6700*********8823', '177', 'null or 0', 'null or 0'), ('99280*********8823', '182', '40000', 'MGA')]

Upvotes: 0

Views: 117

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627327

You can use a regex like

(?s)Card number:\s(\d+\*+\d+)(?:(?!Card number:).)*?ID:\s*(\d*)(?:(?:(?!Card number:).)*?Total cash dispensed:\s*([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+))?

See the regex demo.

NOTE: 1) the .*? is turned into a (?:(?!Card number:).)*? tempered greedy token, 2) the last part if now optional, (?:(?:(?!Card number:).)*?Total cash dispensed:\s*([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+))?, and 3) I am using the (?s) (in code, re.S or re.DOTALL) so that the . could match any chars including line break chars.

See the Python demo:

import re
 
test_str = "YOUR_STRING_HERE"
 
bC_NUMBER = 1
bS_ID = 1
bTRANSACTION_AMOUNT = 1
rC_NUMBER = r"Card number:\s(\d+\*+\d+)"
rS_ID = r"(?:(?!Card number:).)*?ID:\s*(\d*)"
rT_ID = r"(?:(?!Card number:).)*?ATM:\s(\w+)"
rT_AMOUNT = r"(?:(?:(?!Card number:).)*?Total cash dispensed:\s*([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+))?"
 
regex = rC_NUMBER*bC_NUMBER+ rS_ID*bS_ID + rT_AMOUNT*bTRANSACTION_AMOUNT
print( re.findall(regex, test_str, re.S) )

Output:

[('99280*********8823', '182', '40000', 'MGA'), ('6700*********8823', '177', '', ''), ('99280*********8823', '182', '40000', 'MGA')]

Upvotes: 1

Related Questions