Parsing Text Files in Irregular Formats

Question

I have a very large text file that was created by converting a PDF to text format using the XPDF 'pdftotext' utility. The document contains abandoned property data that I'd like to be able to parse into a database-friendly format. The file is formatted as follows:

DOE JANE H                                        STATE         1994     0002       SAVINGS       52            33.99   0
                                                  EMPLOYEES                         ACCOUNTS
                                                  CREDIT UNION

                 SOMECITY      ZZ      12345

In general, all of the records look like that. There are some that span one additional line (having an extra space before the city / state / zip line.

To make this even better, the text was not converted uniformly. Each page of the PDF converted to a "page" of the text file, having different column widths. In addition, the data might sometimes look like this:

DOE JANE H      OR SOME OTHER OWNER           STATE         1994     0002       SAVINGS       52            33.99   0
                                              EMPLOYEES                         ACCOUNTS
                                              CREDIT UNION

                SOMECITY      ZZ      12345

Or, alternatively, like this:

DOE                                      STATE         1994     0002       SAVINGS       52            33.99   0
JANE                                     EMPLOYEES                         ACCOUNTS
H                                        CREDIT UNION

            SOMECITY      ZZ      12345

So, is there anything I can do about this? Any help is appreciated.

UPDATE:

To clarify - I've tried to parse this using regular expressions and Perl, but with no luck (I was not able to figure out a decent way to deal with chunks of 5 lines at a time, or varied chunks as the records sometimes span 6 lines). In my best attempt, I was able to parse out only the first line of every record, which only contains partial information (some or all of the name, the first line of the holder, the report year, property type, etc...). Ideally, I'd like to be able to get all of the information for a record, regardless of the different formats. The programming language does not matter, so any appropriate tool for the job will do. I was hoping to get some input on different options of parsing a file like this, perhaps different tools or even applications that deal with issues like this.

S.Lott · Accepted Answer

Here's a start in Python. Something similar can probably be done in perl.

col_start_pat = re.compile( r'\s+\s(?=\S)' )
for row in ( r.rstrip() for r in data.splitlines() ):
    if not row: continue
    def matches( row ):
        offset= 0
        for m in col_start_pat.finditer( row ):
            yield offset, row[offset:m.end(0)].rstrip()
            offset= m.end(0)
        yield offset, row[offset:].rstrip()
    columns= list( matches( row ) )
    if not columns: continue
    print( columns )

It doesn't do the whole job. It merely identifies the "fields" in each line as a sequence of tuples that look like this.

[(0, 'DOE JANE H'), (50, 'STATE'), (64, '1994'), (73, '0002'), (84, 'SAVINGS'), (98, '52'), (112, '33.99'), (120, '0')]
[(0, ''), (50, 'EMPLOYEES'), (84, 'ACCOUNTS')]
[(0, ''), (50, 'CREDIT UNION')]
[(0, ''), (17, 'SOMECITY'), (31, 'ZZ'), (39, '12345')]
[(0, 'DOE JANE H'), (16, 'OR SOME OTHER OWNER'), (46, 'STATE'), (60, '1994'), (69, '0002'), (80, 'SAVINGS'), (94, '52'), (108, '33.99'), (116, '0')]
[(0, ''), (46, 'EMPLOYEES'), (80, 'ACCOUNTS')]
[(0, ''), (46, 'CREDIT UNION')]
[(0, ''), (16, 'SOMECITY'), (30, 'ZZ'), (38, '12345')]
[(0, 'DOE'), (41, 'STATE'), (55, '1994'), (64, '0002'), (75, 'SAVINGS'), (89, '52'), (103, '33.99'), (111, '0')]
[(0, 'JANE'), (41, 'EMPLOYEES'), (75, 'ACCOUNTS')]
[(0, 'H'), (41, 'CREDIT UNION')]
[(0, ''), (12, 'SOMECITY'), (26, 'ZZ'), (34, '12345')]

The next part is to use the header column numbers to identify the format and then collapse the rows into a single object.

Expanding on that...

pat1= [0, 50, 64, 73, 84, 98, 112, 120]
pat2= [0, 16, 46, 60, 69, 80, 94, 108, 116]
pat3= [0, 41, 55, 64, 75, 89, 103, 111]

def columns( data ):
    col_start_pat = re.compile( r'\s+\s(?=\S)' )
    for row in ( r.rstrip() for r in data.splitlines() ):
        if not row: continue
        def matches( row ):
            offset= 0
            for m in col_start_pat.finditer( row ):
                yield offset, row[offset:m.end(0)].rstrip()
                offset= m.end(0)
            yield offset, row[offset:].rstrip()
        columns= list( matches( row ) )
        yield columns

def row_groups( data ):
    columns_iter= columns(data)
    record= []
    for parsed in columns_iter:
        offsets= [ c[0] for c in parsed ]
        if offsets == pat1:
            if record: yield( format, record )
            format= 1
            record= []
        elif offsets == pat2:
            if record: yield( format, record )
            format= 2
            record= []
        elif offsets == pat3:
            if record: yield( format, record )
            format= 3
            record= []
        record.extend( parsed )
    yield( format, record )

for record in row_groups( data ):
    print( record )

Gets you the following:

(1, [(0, 'DOE JANE H'), (50, 'STATE'), (64, '1994'), (73, '0002'), (84, 'SAVINGS'), (98, '52'), (112, '33.99'), (120, '0'), (0, ''), (50, 'EMPLOYEES'), (84, 'ACCOUNTS'), (0, ''), (50, 'CREDIT UNION'), (0, ''), (17, 'SOMECITY'), (31, 'ZZ'), (39, '12345')])
(2, [(0, 'DOE JANE H'), (16, 'OR SOME OTHER OWNER'), (46, 'STATE'), (60, '1994'), (69, '0002'), (80, 'SAVINGS'), (94, '52'), (108, '33.99'), (116, '0'), (0, ''), (46, 'EMPLOYEES'), (80, 'ACCOUNTS'), (0, ''), (46, 'CREDIT UNION'), (0, ''), (16, 'SOMECITY'), (30, 'ZZ'), (38, '12345')])
(3, [(0, 'DOE'), (41, 'STATE'), (55, '1994'), (64, '0002'), (75, 'SAVINGS'), (89, '52'), (103, '33.99'), (111, '0'), (0, 'JANE'), (41, 'EMPLOYEES'), (75, 'ACCOUNTS'), (0, 'H'), (41, 'CREDIT UNION'), (0, ''), (12, 'SOMECITY'), (26, 'ZZ'), (34, '12345')])

Which has all of the information; it just needs some clever assembly.

Parsing Text Files in Irregular Formats

Answers (1)

Related Questions