RolfBly
RolfBly

Reputation: 3862

Transform tagged to CSV

A file like this:

fieldname_A: eggs
fieldname_B: bacon 
~EOR~
..
..
~EOR~
fieldname_A: eggs
fieldname_C: spam
fieldname_Z: baked beans
~EOR~

in total, 40 different field names, about 10,000 records (EOR is End Of Record) with a variable set of fields in them. Most field content is short, some is very long (over 1000 characters).

I'm looking for a nice, clean, pythonic way to transform this to CSV, or rather, to a Google spreadsheet. (The data needs to be checked and corrected. Several people will do this. A google doc is the only fast way I know for them to work on the same data without causing version conflicts or syncing problems.)

Dictwriter seems like a cumbersome way, because many records (ie rows in a the spreadsheet) have fields in them that are empty.

The form of the input file is more XML-ish, so I'm thinking to transform it to XML, import that into Excel, and then upload it as Google spreadsheet.

Any ideas?

Upvotes: 0

Views: 66

Answers (1)

ATOzTOA
ATOzTOA

Reputation: 35950

Use this one...

infile = "d.txt"
data = open(infile, "r").readlines()

dataDict = []
columns = []

rowData = {}
# Create a dictionary list
for line in data:
    if not line.strip():
        continue
    if "~EOR~" in line:
        keys = rowData.keys()
        dataDict.append(rowData)
        columns = list(set(columns) | set(keys))
        rowData = {}
    else:  
        cell = line.split(":", 1);
        print cell
        rowData[cell[0].strip().strip('"').strip("'")] = cell[1].strip().strip('"').strip("'")


# Write dictionary list to file
outfile = "d.csv"

with open(outfile, "w") as fp:

    columns.sort()

    for key in columns:
        fp.write(key + ", ")

    fp.write("\n")

    for data in dataDict:
        for key in columns:
            if key in data:
                fp.write(data[key] + ",")
            else:
                fp.write(",")
        fp.write("\n")

Input:

fieldname_A: eggs
fieldname_B: bacon 
~EOR~
fieldname_A: eggs
fieldname_C: spam
fieldname_Z: baked beans
~EOR~

Output:

fieldname_A, fieldname_B, fieldname_C, fieldname_Z, 
eggs,bacon,,,
eggs,,spam,baked beans,

Upvotes: 1

Related Questions