Reputation: 4872
I am extracting the records from the file that has information of interest over three or more lines. Information is in sequence, it follows a reasonable pattern but it is can have some boilerplate text in between.
Since this is a text file converted from PDF it is also possible that there is a page number or some other simple control elements in between.
Pattern consists of:
starting line: last name and first name separated by comma, and nothing else
next line will have two long numbers (>=7 digits) followed by two dates
last line of interest will have 4-digit number followed by a date
Pattern of interest is marked in BOLD):
LAST NAME ,FIRST NAME ... nothing or possibly some junk text 999999999 9999999 MM/DD/YY MM/DD/YY junk text ... nothing or possibly some junk text 9999 MM/DD/YY junk I dont care
My target text by default looks something like:
SOME IRRELEVANT TEXT DOE ,JOHN 200000002 100000070 04/04/13 12/12/12 XYZ IJK ABC SOMETHING SOMETHING 0999 12/22/12 0 1 0 SOMETHING ELSE MORE OF SOMETHING ELSE
but it is possible to encounter something in between so it would look like:
SOME IRRELEVANT TEXT DOE ,JOHN Page 13 Header 200000002 100000070 04/04/13 12/12/12 XYZ IJK ABC SOMETHING SOMETHING 0999 12/22/12 0 1 0 SOMETHING ELSE MORE OF SOMETHING ELSE
I dont really need to validate much here so I am catching three lines with a following regex.
Since I know that this pattern will occur as a substring, but with possible insertions
So far, I have been catching these elements with following three reg. expressions:
(([A-Z]+\s+)+,[A-Z]+)
(\d{7,}\s+\d{7,}\s+(\d{2}/\d{2}/\d{2}\s+){2})
(\d{4}\s+\d{2}/\d{2}/\d{2})
but I would like to extract the whole data of interest.
Is that possible and if so, how?
Upvotes: 1
Views: 192
Reputation: 7169
This should pull all instances of the desired substrings from the larger string for you:
re.findall('([A-Z]+\s+,[A-Z]+).+?(\d+\s+\d+\s+\d{2}\/\d{2}\/\d{2}\s+\d{2}\/\d{2}\/\d{2}).+?(\d+\s+\d{2}\/\d{2}\/\d{2})', x, re.S)
The resulting list of tuples can be stitched together if needed to get a list of desired substrings with the junk text removed.
Upvotes: 0
Reputation: 311
Here I have added regular expressions to a list and tried finding a match one after the other... Is this what you were looking for ??
import re
f = open("C:\\Users\\mridulp\\Desktop\\temp\\file1.txt")
regexpList = [re.compile("(([A-Z]+\s+)+,[A-Z]+)"),
re.compile("^.*(\d{7,}\s+\d{7,}\s+(\d{2}/\d{2}/\d{2}\s+){2})"),
re.compile("^.*(\d{4}\s+\d{2}/\d{2}/\d{2}).*")]
lines = f.readlines()
i = 0
for l in lines:
mObj = regexpList[i].match(l)
if mObj:
print mObj.group(1)
i = i + 1
if i > 2:
i = 0
f.close()
Upvotes: 0