Reputation: 1219
I have a text file that I want to parse and output to a csv. The data comes from SSURGO soil metadata (USDA). I have converted the pdf that describes the database tables, column names, and descriptions into a text file and would like to convert to a csv.
The column values are:
Table Physical Name:
Table Label:
Column Physical name:
Column Label:
Desc:
Sometimes the Column physical name and column label do not include a description. Example input file is:
Table Physical Name: chaashto
Table Label: Horizon AASHTO
Column Physical Name: aashtocl
Column Label: AASHTO
desc: Some description here.
Not every column name and label has a corresponding desc: or Table Physical Name and Label.
I have parsed the data to have one row for each column Physical Name and Label using the following code:
count = 0
cnt = 0
out = open("output_test.txt", "wb")
value1,value2,value3,value4,value5 = "","","","",""
for i,line in enumerate(lines):
key, value = line.strip().split(':',1)
if key == 'Table Physical Name':
value1 = value.strip()
if key == 'Table Label':
value2 = value.strip()
if key == 'Column Physical Name':
value3 = value.strip()
if key == 'Column Label':
value4 = value.strip()
if not lines[i+1].strip().startswith('desc'):
count = 1
if key == 'desc':
value5 = value.strip()
count = 1
if count == 1:
cnt += 1
record = value1 + ";" +value2 + ";" + value3 + ";" + value4 + ";" + value5 +";\n"
print cnt, record
out.write(record)
count = 0
value3 = " "
value4 = " "
value5 = " "
out.close()
(Pardon the sloppy-looking code, just learning Python).
At any rate, the issue is that sometimes when a line is output it does not include the Column Physical Name and inserts a \r at the end of the line. The subsequent line then includes the key "Column Physical Name: "Name of Column"
This is how the output should look:
chfrags;Horizon Fragments;chfragskey;Chorizon Fragments Key;A non-connotative string of characters used to uniquely identify a record in the Horizon Fragments table.;
This is how the "bad" output looks:
coforprodo;Component Forest Productivity - Other; ;Low
Column Physical Name: fprod_r; ;
In the above example, "fprod_r" should be between Other; and ";Low" and the "\r" after ;Low should not be there.
The maddening thing is that when I subset the input file, the places where the errors occur no longer occur. So the code processes the text file correctly.
Any help would be appreciated.
Upvotes: 0
Views: 58
Reputation: 1219
All, I apologize. There were "invisible" \r that were not showing up when i was using notepad++. Not sure why, but when I finally deleted them, the output is correct. Not sure why the \r were not carried over into the subset file. Oh well.
Upvotes: 1
Reputation: 15306
There's no need to do all the CSV ETL by hand. Python has a module csv
for that which you can leverage, and which should help to eliminate the sort of edge cases you're encountering.
You might try something like this:
import csv
with open('foo.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',',
quotechar='"', quoting=csv.QUOTE_ALL)
writer.writerow(('Table Physical Name', 'Table Label', 'Column Physical Name', 'Column Label', 'Desc'))
writer.writerow( ('chfrags', 'Horizon Fragments', 'chfragskey', 'Chorizon Fragments Key',
'A non-connotative string of characters used to uniquely identify a record in the Horizon Fragments table.')
)
Which outputs this:
"Table Physical Name","Table Label","Column Physical Name","Column Label","Desc"
"chfrags","Horizon Fragments","chfragskey","Chorizon Fragments Key","A non-connotative string of characters used to uniquely identify a record in the Horizon Fragments table."
You may need to tweak this code somewhat to meet your exact data needs (i.e. the delimiter, the type of quoting, and the exact details of the row tuples) but this should provide the general template, using the data you mentioned above, for you to use.
Upvotes: 1