Reputation: 162
I have a text file from which I'm trying to pull names and birth dates using a RegEx. The wall I've hit as of now is that the strings can span multiple lines and my RegEx is not able to grab them all. The format of the data I want is always:
last name, middle name(sometimes), first name, f. DD-MM-YYYY
This is my RegEx:
if re.findall(r'\w+,\s*f\s*\.\s*\d\s*\d\s*-\s*\d\s*\d\s*-\s*\d\s*\d\s*\d\s*\d', line):
This doesn't get the below line break:
Smith, John,
f. 25-12-1990
or only first part of below:
Smith, John, f. 25-12-
1990
Smith, John, f. 25-
12-1990
Here's the full code:
import re
import pandas as pd
a_list = []
f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")
txt = f.readlines()
for k, line in enumerate(txt):
if re.findall(r'\w+,\s*f\s*\.\s*\d\s*\d\s*-\s*\d\s*\d\s*-\s*\d\s*\d\s*\d\s*\d', line):
a_list.append((k, line))
print(a_list)
#df1 = pd.DataFrame(a_list)
#df1.to_csv('C:/Users/me/Desktop/outputs.csv', index=False)
f.close()
Upvotes: 0
Views: 116
Reputation: 19430
You are iterating the lines of the file and only passing each line at a time to findall
. The regex can only work on what you give it so obviously it can't match something you didn't pass to it. You will have to search the whole file at once:
import re
a_list = []
with open("/Users/me/Desktop/scrape.txt", encoding="utf8") as f:
txt = f.read()
print(re.findall(r'\w+,\s*f\s*\.\s*\d\s*\d\s*-\s*\d\s*\d\s*-\s*\d\s*\d\s*\d\s*\d', txt)
Upvotes: 1
Reputation: 144
Your regex seems to be working. First, you can check it online here: https://regex101.com/r/yWrCig/1 It matches 3 cases.
As regarding your code, use it like:
res = regex.findall(r'\w+,\s*f\s*\.\s*\d\s*\d\s*-\s*\d\s*\d\s*-\s*\d\s*\d\s*\d\s*\d', line)
if res:
...
Where 'res' is the list of matched strings.
Upvotes: 0