Reputation: 1219

Issues parsing textual data

I have a text file that I want to parse and output to a csv. The data comes from SSURGO soil metadata (USDA). I have converted the pdf that describes the database tables, column names, and descriptions into a text file and would like to convert to a csv.

The column values are:

Table Physical Name:
Table Label:
Column Physical name:
Column Label:
Desc:

Sometimes the Column physical name and column label do not include a description. Example input file is:

Table Physical Name: chaashto
Table Label: Horizon AASHTO
Column Physical Name: aashtocl 
Column Label:   AASHTO
desc: Some description here.

Not every column name and label has a corresponding desc: or Table Physical Name and Label.

I have parsed the data to have one row for each column Physical Name and Label using the following code:

count = 0
cnt = 0 
out = open("output_test.txt", "wb")
value1,value2,value3,value4,value5 = "","","","","" 
for i,line in enumerate(lines):

    key, value = line.strip().split(':',1)
    if key == 'Table Physical Name':
        value1 = value.strip()
    if key == 'Table Label':
        value2 = value.strip()
    if key == 'Column Physical Name':
        value3 = value.strip()
    if key == 'Column Label':
        value4 = value.strip()
        if not lines[i+1].strip().startswith('desc'):
            count = 1
    if key == 'desc':
        value5 = value.strip()
        count = 1
    if count == 1:
        cnt += 1
        record = value1 + ";" +value2 + ";" + value3 + ";" + value4 + ";" + value5 +";\n"
        print cnt, record
        out.write(record)
        count = 0
        value3 = " "
        value4 = " "
        value5 = " "
    out.close()

(Pardon the sloppy-looking code, just learning Python).

At any rate, the issue is that sometimes when a line is output it does not include the Column Physical Name and inserts a \r at the end of the line. The subsequent line then includes the key "Column Physical Name: "Name of Column"

This is how the output should look:

chfrags;Horizon Fragments;chfragskey;Chorizon Fragments Key;A non-connotative string of characters used to uniquely identify a record in the Horizon Fragments table.;

This is how the "bad" output looks:

coforprodo;Component Forest Productivity - Other; ;Low
Column Physical Name: fprod_r; ;

In the above example, "fprod_r" should be between Other; and ";Low" and the "\r" after ;Low should not be there.

The maddening thing is that when I subset the input file, the places where the errors occur no longer occur. So the code processes the text file correctly.

Any help would be appreciated.

Upvotes: 0

Issues parsing textual data

Answers (2)

Related Questions