Reputation: 25
I have a very large text file that was created by converting a PDF to text format using the XPDF 'pdftotext' utility. The document contains abandoned property data that I'd like to be able to parse into a database-friendly format. The file is formatted as follows:
DOE JANE H STATE 1994 0002 SAVINGS 52 33.99 0
EMPLOYEES ACCOUNTS
CREDIT UNION
SOMECITY ZZ 12345
In general, all of the records look like that. There are some that span one additional line (having an extra space before the city / state / zip line.
To make this even better, the text was not converted uniformly. Each page of the PDF converted to a "page" of the text file, having different column widths. In addition, the data might sometimes look like this:
DOE JANE H OR SOME OTHER OWNER STATE 1994 0002 SAVINGS 52 33.99 0
EMPLOYEES ACCOUNTS
CREDIT UNION
SOMECITY ZZ 12345
Or, alternatively, like this:
DOE STATE 1994 0002 SAVINGS 52 33.99 0
JANE EMPLOYEES ACCOUNTS
H CREDIT UNION
SOMECITY ZZ 12345
So, is there anything I can do about this? Any help is appreciated.
UPDATE:
To clarify - I've tried to parse this using regular expressions and Perl, but with no luck (I was not able to figure out a decent way to deal with chunks of 5 lines at a time, or varied chunks as the records sometimes span 6 lines). In my best attempt, I was able to parse out only the first line of every record, which only contains partial information (some or all of the name, the first line of the holder, the report year, property type, etc...). Ideally, I'd like to be able to get all of the information for a record, regardless of the different formats. The programming language does not matter, so any appropriate tool for the job will do. I was hoping to get some input on different options of parsing a file like this, perhaps different tools or even applications that deal with issues like this.
Upvotes: 0
Views: 693
Reputation: 392050
Here's a start in Python. Something similar can probably be done in perl.
col_start_pat = re.compile( r'\s+\s(?=\S)' )
for row in ( r.rstrip() for r in data.splitlines() ):
if not row: continue
def matches( row ):
offset= 0
for m in col_start_pat.finditer( row ):
yield offset, row[offset:m.end(0)].rstrip()
offset= m.end(0)
yield offset, row[offset:].rstrip()
columns= list( matches( row ) )
if not columns: continue
print( columns )
It doesn't do the whole job. It merely identifies the "fields" in each line as a sequence of tuples that look like this.
[(0, 'DOE JANE H'), (50, 'STATE'), (64, '1994'), (73, '0002'), (84, 'SAVINGS'), (98, '52'), (112, '33.99'), (120, '0')]
[(0, ''), (50, 'EMPLOYEES'), (84, 'ACCOUNTS')]
[(0, ''), (50, 'CREDIT UNION')]
[(0, ''), (17, 'SOMECITY'), (31, 'ZZ'), (39, '12345')]
[(0, 'DOE JANE H'), (16, 'OR SOME OTHER OWNER'), (46, 'STATE'), (60, '1994'), (69, '0002'), (80, 'SAVINGS'), (94, '52'), (108, '33.99'), (116, '0')]
[(0, ''), (46, 'EMPLOYEES'), (80, 'ACCOUNTS')]
[(0, ''), (46, 'CREDIT UNION')]
[(0, ''), (16, 'SOMECITY'), (30, 'ZZ'), (38, '12345')]
[(0, 'DOE'), (41, 'STATE'), (55, '1994'), (64, '0002'), (75, 'SAVINGS'), (89, '52'), (103, '33.99'), (111, '0')]
[(0, 'JANE'), (41, 'EMPLOYEES'), (75, 'ACCOUNTS')]
[(0, 'H'), (41, 'CREDIT UNION')]
[(0, ''), (12, 'SOMECITY'), (26, 'ZZ'), (34, '12345')]
The next part is to use the header column numbers to identify the format and then collapse the rows into a single object.
Expanding on that...
pat1= [0, 50, 64, 73, 84, 98, 112, 120]
pat2= [0, 16, 46, 60, 69, 80, 94, 108, 116]
pat3= [0, 41, 55, 64, 75, 89, 103, 111]
def columns( data ):
col_start_pat = re.compile( r'\s+\s(?=\S)' )
for row in ( r.rstrip() for r in data.splitlines() ):
if not row: continue
def matches( row ):
offset= 0
for m in col_start_pat.finditer( row ):
yield offset, row[offset:m.end(0)].rstrip()
offset= m.end(0)
yield offset, row[offset:].rstrip()
columns= list( matches( row ) )
yield columns
def row_groups( data ):
columns_iter= columns(data)
record= []
for parsed in columns_iter:
offsets= [ c[0] for c in parsed ]
if offsets == pat1:
if record: yield( format, record )
format= 1
record= []
elif offsets == pat2:
if record: yield( format, record )
format= 2
record= []
elif offsets == pat3:
if record: yield( format, record )
format= 3
record= []
record.extend( parsed )
yield( format, record )
for record in row_groups( data ):
print( record )
Gets you the following:
(1, [(0, 'DOE JANE H'), (50, 'STATE'), (64, '1994'), (73, '0002'), (84, 'SAVINGS'), (98, '52'), (112, '33.99'), (120, '0'), (0, ''), (50, 'EMPLOYEES'), (84, 'ACCOUNTS'), (0, ''), (50, 'CREDIT UNION'), (0, ''), (17, 'SOMECITY'), (31, 'ZZ'), (39, '12345')])
(2, [(0, 'DOE JANE H'), (16, 'OR SOME OTHER OWNER'), (46, 'STATE'), (60, '1994'), (69, '0002'), (80, 'SAVINGS'), (94, '52'), (108, '33.99'), (116, '0'), (0, ''), (46, 'EMPLOYEES'), (80, 'ACCOUNTS'), (0, ''), (46, 'CREDIT UNION'), (0, ''), (16, 'SOMECITY'), (30, 'ZZ'), (38, '12345')])
(3, [(0, 'DOE'), (41, 'STATE'), (55, '1994'), (64, '0002'), (75, 'SAVINGS'), (89, '52'), (103, '33.99'), (111, '0'), (0, 'JANE'), (41, 'EMPLOYEES'), (75, 'ACCOUNTS'), (0, 'H'), (41, 'CREDIT UNION'), (0, ''), (12, 'SOMECITY'), (26, 'ZZ'), (34, '12345')])
Which has all of the information; it just needs some clever assembly.
Upvotes: 2