Python 3.5 csv parsing with special characters

Question

I have a file that I'd like to parse into a csv. The file is an export file, and arrives to me in the following format (line by line, each file containing thousands of lines):

03:30:30  08:30:30 [15 August 2015] productid:123456789 manuf:987654321 case:12 pallet:1234 id:12 code:1234  12345  123  12

I'd like to get the data into a csv file as follows:

local time, GMT time, date, product id, manuf id, case, pallet, id, code, company id, location, secondary code
03:30:30, 08:30:30, 15 August 2015, 123456789, 987654321, 12, 1234, 12, 1234, 12345, 123, 12

I've successfully done this, but using the wrong tools I believe. I've been using lines like below:

import fileinput
for line in fileinput.FileInput("file",inplace=1):
    line = line.replace(":",",")

import fileinput
for line in fileinput.FileInput("file",inplace=1):
    line = line.replace("case"," ")

which gives me something like..

l h, l min, ls, gmt h, gmt m, gmt s, date, product id, manuf id, case, pallet, id, code, company id, location, secondary code
03,30,30,08,30,30,15 August 2015, 123456789, 987654321, 12, 1234, 12, 1234, 12345, 123, 12

The problems are: I have to use successive lines to parse each character and word into csv (descriptor words, colons, brackets, etc..) and it takes a long time over a large data-set. Also I'm saving it all to a secondary file instead of writing in place. By replacing the colons with commas, my times get broken down into separate columns. I've spent a few days playing with different options and arrived at regex, but as a complete newb to python have yet to come up with a solution. It may be simpler to rebuild the times from comma separated blocks into the correct format, but I'm at a loss. Please assist. Thanks in advance.

Edit:

I've tried to implement Sparkeandshine as well as Julian's versions of a workable solution. I came closest to getting Sparkeandshie's version to work, but am only able to iterate one line; the one actually in the solution, not the entire file. I spent the last 24 trying different iterations of both solutions, to no avail. Here is where I am:

#!/usr/bin/env python

import csv
import os

inputFileName = 'test.txt'
outputFileName = 'finished.csv'

with open(inputFileName, newline='') as inFile, open(outputFileName,    'w', newline='') as outfile:
    r = csv.reader(inFile)
    w = csv.writer(outfile)

    line = '03:30:30 08:30:30 [15 August 2015] productid:123456789 manuf:987654321 case:12 pallet:1234 id:12 code:1234 12345 123 12'
str_list = line.split()

new_list = [str_list[0],
        str_list[1],
        ' '.join([item.strip('[]') for item in str_list[2:5]]), # '[15', 'August', '2015]'
        str_list[6].split(':')[1],
        str_list[7].split(':')[1],
        str_list[8].split(':')[1],
        str_list[9].split(':')[1],
        str_list[10].split(':')[1],
        str_list[12],
        str_list[13]
        ]

with open(inputFileName, newline='') as inFile, open(outputFileName, 'w', newline='') as outfile:
    r = csv.reader(inFile)
    w = csv.writer(outfile)

    for row in r:
        w.writerow(new_list)

Edit: Sparkandshines below solution works great for set lines, but I have found that some of my data has lines of varying lengths. For example, sometimes a few lines will be short a few "columns" of data, or the last three columns may repeat twice. To get around this problem I have been using "try" and "except error, continue" for the possible lengths. With this method I get duplicate lines in my final file; is there a better way to use Sparkandshine's solution with lines of varying length? ...or is it easier to just find duplicate lines (by checking / comparing two/three columns) and removing them?

SparkAndShine · Accepted Answer

Process each line using,

line = '03:30:30 08:30:30 [15 August 2015] productid:123456789 manuf:987654321 case:12 pallet:1234 id:12 code:1234 12345 123 12'
str_list = line.split()

new_list = [str_list[0],
            str_list[1],
            ' '.join([item.strip('[]') for item in str_list[2:5]]), # '[15', 'August', '2015]'
            str_list[6].split(':')[1],
            str_list[7].split(':')[1],
            str_list[8].split(':')[1],
            str_list[9].split(':')[1],
            str_list[10].split(':')[1],
            str_list[12],
            str_list[13]
            ]

print(new_list)
# Output
['03:30:30', '08:30:30', '15 August 2015', '987654321', '12', '1234', '12', '1234', '123', '12']

Write to a csv file with,

with open(filename, 'w') as f:
    writer = csv.writer(f)

    # write the file header
    fieldnames = ['local time', 'GMT time', 'date', 'product id', 'manuf id', 
            'case', 'pallet', 'id', 'code', 'company id', 'location', 'secondary code']
    writer.writerow(fieldnames)

    # process each line
    for line in lines:
        new_list = do_something()

        writer.writerow(new_list) # write to the file

Or process all lines and save the results to a list of lists, then wirte to a csv file with writerows,

writer.writerows(lists)

The full source code,

#!/usr/bin/env python
import csv

inputFileName = 'test.txt'
outputFileName = 'finished.csv'

with open(outputFileName, 'w') as outfile:
    w = csv.writer(outfile)

    # write the file header
    fieldnames = ['local time', 'GMT time', 'date', 'product id', 'manuf id', 
            'case', 'pallet', 'id', 'code', 'company id', 'location', 'secondary code']
    writer.writerow(fieldnames)


    # process each line
    with open(inputFileName, 'r') as inFile:
        for line in inFile:
            str_list = line.rstrip().split()
            new_list = [str_list[0],
                        str_list[1],
                        ' '.join([item.strip('[]') for item in str_list[2:5]]), # '[15', 'August', '2015]'
                        str_list[6].split(':')[1],
                        str_list[7].split(':')[1],
                        str_list[8].split(':')[1],
                        str_list[9].split(':')[1],
                        str_list[10].split(':')[1],
                        str_list[12],
                        str_list[13]]

            w.writerow(new_list) # write to the file

Python 3.5 csv parsing with special characters

Answers (2)

Related Questions