Reputation: 356
I have a string of account information with multiple accounts in the string (the example shows one line, where I actually have a text file with multiple lines of account data, so there is another loop going through each line in the text file in my code). I need to pull out each account into its own line. The code below works, but I assume there is a more efficient or better way to do it. I am just starting to learn Regex.
import re
import pandas as pd
allAccounts = []
example = '02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42 02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73'
rex = '[0-9]{1,2}-[0-9]{1,7}-[0-9]{1,2}'
accounts = re.findall(rex, example)
for account in accounts:
example= example.replace(account, f'||{account}')
example = [account.replace(' ', '|').split('|') for account in example.split('||')][1:]
allAccounts += example
df = pd.DataFrame(allAccounts)
df
from the regex portion of the code, I want it to return:
['02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42', ' 02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73']
# or
'||02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42 ||02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73'
The code returns the following df which is what ultimately I want:
0 1 2 3 4 5 6 7 8
0 02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42
1 02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73 None
But I feel like there is a better way to use the Regex than what I am doing. Reading the docs it seems like re.sub
should do it, but it is only replacing the first account number it comes upon, and it only want to replace the account number not add the '||' separator to the beginning.
update:
Using the following it gets close to what I want but not sure why the first item in the list is ''.
example = '02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42 02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73'
rex = re.compile('(?=[0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9])')
re.split(rex, example)
outputs:
['',
'02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42 ',
'02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73']
Upvotes: 0
Views: 103
Reputation: 163197
Instead of using split, you can match the values:
\b\d\d-\d{7}-\d\d\b.*?(?=\s*\b\d\d-\d{7}-\d\d\b.*?|$)
Explanation
\b\d\d-\d{7}-\d\d\b
Match the pattern with 2 digits - 7 digits - 2 digits using a quantifier.*?
Match any character as least as possible(?=\s*\b\d\d-\d{7}-\d\d\b.*?|$)
Positive lookahead to assert either the digits pattern to the right or the end of the string to also match the last occurrenceExample
import re
pattern = r"\b\d\d-\d{7}-\d\d\b.*?(?=\s*\b\d\d-\d{7}-\d\d\b.*?|$)"
s = "02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42 02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73"
print(re.findall(pattern, s))
Output
['02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42', '02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73']
If you must use split:
import re
pattern = r"\b(?=\d\d-\d{7}-\d\d\b)"
s = "02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42 02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73"
result = [m.strip() for m in re.split(pattern, s) if m]
print(result)
See a Python demo
Upvotes: 1