Edmon
Edmon

Reputation: 4872

Regex in python that looks into pattern over multiple lines

I am extracting the records from the file that has information of interest over three or more lines. Information is in sequence, it follows a reasonable pattern but it is can have some boilerplate text in between.

Since this is a text file converted from PDF it is also possible that there is a page number or some other simple control elements in between.

Pattern consists of: starting line: last name and first name separated by comma, and nothing else
next line will have two long numbers (>=7 digits) followed by two dates
last line of interest will have 4-digit number followed by a date

Pattern of interest is marked in BOLD):

LAST NAME   ,FIRST NAME
... nothing or possibly some junk text
   999999999  9999999  MM/DD/YY  MM/DD/YY   junk text
... nothing or possibly some junk text
   9999    MM/DD/YY   junk
I dont care

My target text by default looks something like:

SOME IRRELEVANT TEXT 
DOE       ,JOHN
             200000002   100000070     04/04/13   12/12/12  XYZ IJK ABC     SOMETHING SOMETHING  
             0999   12/22/12    0   1   0   SOMETHING ELSE
MORE OF SOMETHING ELSE

but it is possible to encounter something in between so it would look like:

SOME IRRELEVANT TEXT 
DOE       ,JOHN
Page 13     Header
             200000002   100000070     04/04/13   12/12/12  XYZ IJK ABC     SOMETHING SOMETHING  
             0999   12/22/12    0   1   0   SOMETHING ELSE
MORE OF SOMETHING ELSE

I dont really need to validate much here so I am catching three lines with a following regex.

Since I know that this pattern will occur as a substring, but with possible insertions

So far, I have been catching these elements with following three reg. expressions:

(([A-Z]+\s+)+,[A-Z]+)
(\d{7,}\s+\d{7,}\s+(\d{2}/\d{2}/\d{2}\s+){2})
(\d{4}\s+\d{2}/\d{2}/\d{2})

but I would like to extract the whole data of interest.

Is that possible and if so, how?

Upvotes: 1

Views: 192

Answers (2)

woemler
woemler

Reputation: 7169

This should pull all instances of the desired substrings from the larger string for you:

re.findall('([A-Z]+\s+,[A-Z]+).+?(\d+\s+\d+\s+\d{2}\/\d{2}\/\d{2}\s+\d{2}\/\d{2}\/\d{2}).+?(\d+\s+\d{2}\/\d{2}\/\d{2})', x, re.S)

The resulting list of tuples can be stitched together if needed to get a list of desired substrings with the junk text removed.

Upvotes: 0

Mridul Augustine
Mridul Augustine

Reputation: 311

Here I have added regular expressions to a list and tried finding a match one after the other... Is this what you were looking for ??

import re

f = open("C:\\Users\\mridulp\\Desktop\\temp\\file1.txt")
regexpList = [re.compile("(([A-Z]+\s+)+,[A-Z]+)"),
              re.compile("^.*(\d{7,}\s+\d{7,}\s+(\d{2}/\d{2}/\d{2}\s+){2})"),
              re.compile("^.*(\d{4}\s+\d{2}/\d{2}/\d{2}).*")]
lines = f.readlines()
i = 0
for l in lines:
    mObj = regexpList[i].match(l)
    if mObj:
        print mObj.group(1)
        i = i + 1
    if i > 2:
        i = 0

f.close()

Upvotes: 0

Related Questions