pam_param
pam_param

Reputation: 162

RegEx to span newlines

I have a text file from which I'm trying to pull names and birth dates using a RegEx. The wall I've hit as of now is that the strings can span multiple lines and my RegEx is not able to grab them all. The format of the data I want is always:

last name, middle name(sometimes), first name, f. DD-MM-YYYY

This is my RegEx:

if re.findall(r'\w+,\s*f\s*\.\s*\d\s*\d\s*-\s*\d\s*\d\s*-\s*\d\s*\d\s*\d\s*\d', line):

This doesn't get the below line break:

Smith, John,

f. 25-12-1990

or only first part of below:

Smith, John, f. 25-12-

1990

Smith, John, f. 25-

12-1990

Here's the full code:

import re
import pandas as pd

a_list = []

f = open("/Users/me/Desktop/scrape.txt", "r", encoding="utf8")
txt = f.readlines()

for k, line in enumerate(txt):
    if re.findall(r'\w+,\s*f\s*\.\s*\d\s*\d\s*-\s*\d\s*\d\s*-\s*\d\s*\d\s*\d\s*\d', line):
        a_list.append((k, line))
print(a_list)


#df1 = pd.DataFrame(a_list)
#df1.to_csv('C:/Users/me/Desktop/outputs.csv', index=False)

f.close()

Data example: enter image description here

Upvotes: 0

Views: 116

Answers (2)

Tomerikoo
Tomerikoo

Reputation: 19430

You are iterating the lines of the file and only passing each line at a time to findall. The regex can only work on what you give it so obviously it can't match something you didn't pass to it. You will have to search the whole file at once:

import re

a_list = []

with open("/Users/me/Desktop/scrape.txt", encoding="utf8") as f:
    txt = f.read()

    print(re.findall(r'\w+,\s*f\s*\.\s*\d\s*\d\s*-\s*\d\s*\d\s*-\s*\d\s*\d\s*\d\s*\d', txt)

Upvotes: 1

Natan
Natan

Reputation: 144

Your regex seems to be working. First, you can check it online here: https://regex101.com/r/yWrCig/1 It matches 3 cases.

As regarding your code, use it like:

res = regex.findall(r'\w+,\s*f\s*\.\s*\d\s*\d\s*-\s*\d\s*\d\s*-\s*\d\s*\d\s*\d\s*\d', line)
if res:
    ...

Where 'res' is the list of matched strings.

Upvotes: 0

Related Questions