Reputation: 15
I have a file that I'd like to parse into a csv. The file is an export file, and arrives to me in the following format (line by line, each file containing thousands of lines):
03:30:30 08:30:30 [15 August 2015] productid:123456789 manuf:987654321 case:12 pallet:1234 id:12 code:1234 12345 123 12
I'd like to get the data into a csv file as follows:
local time, GMT time, date, product id, manuf id, case, pallet, id, code, company id, location, secondary code
03:30:30, 08:30:30, 15 August 2015, 123456789, 987654321, 12, 1234, 12, 1234, 12345, 123, 12
I've successfully done this, but using the wrong tools I believe. I've been using lines like below:
import fileinput
for line in fileinput.FileInput("file",inplace=1):
line = line.replace(":",",")
import fileinput
for line in fileinput.FileInput("file",inplace=1):
line = line.replace("case"," ")
which gives me something like..
l h, l min, ls, gmt h, gmt m, gmt s, date, product id, manuf id, case, pallet, id, code, company id, location, secondary code
03,30,30,08,30,30,15 August 2015, 123456789, 987654321, 12, 1234, 12, 1234, 12345, 123, 12
The problems are: I have to use successive lines to parse each character and word into csv (descriptor words, colons, brackets, etc..) and it takes a long time over a large data-set. Also I'm saving it all to a secondary file instead of writing in place. By replacing the colons with commas, my times get broken down into separate columns. I've spent a few days playing with different options and arrived at regex, but as a complete newb to python have yet to come up with a solution. It may be simpler to rebuild the times from comma separated blocks into the correct format, but I'm at a loss. Please assist. Thanks in advance.
Edit:
I've tried to implement Sparkeandshine as well as Julian's versions of a workable solution. I came closest to getting Sparkeandshie's version to work, but am only able to iterate one line; the one actually in the solution, not the entire file. I spent the last 24 trying different iterations of both solutions, to no avail. Here is where I am:
#!/usr/bin/env python
import csv
import os
inputFileName = 'test.txt'
outputFileName = 'finished.csv'
with open(inputFileName, newline='') as inFile, open(outputFileName, 'w', newline='') as outfile:
r = csv.reader(inFile)
w = csv.writer(outfile)
line = '03:30:30 08:30:30 [15 August 2015] productid:123456789 manuf:987654321 case:12 pallet:1234 id:12 code:1234 12345 123 12'
str_list = line.split()
new_list = [str_list[0],
str_list[1],
' '.join([item.strip('[]') for item in str_list[2:5]]), # '[15', 'August', '2015]'
str_list[6].split(':')[1],
str_list[7].split(':')[1],
str_list[8].split(':')[1],
str_list[9].split(':')[1],
str_list[10].split(':')[1],
str_list[12],
str_list[13]
]
with open(inputFileName, newline='') as inFile, open(outputFileName, 'w', newline='') as outfile:
r = csv.reader(inFile)
w = csv.writer(outfile)
for row in r:
w.writerow(new_list)
Edit: Sparkandshines below solution works great for set lines, but I have found that some of my data has lines of varying lengths. For example, sometimes a few lines will be short a few "columns" of data, or the last three columns may repeat twice. To get around this problem I have been using "try" and "except error, continue" for the possible lengths. With this method I get duplicate lines in my final file; is there a better way to use Sparkandshine's solution with lines of varying length? ...or is it easier to just find duplicate lines (by checking / comparing two/three columns) and removing them?
Upvotes: 1
Views: 550
Reputation: 18017
Process each line using,
line = '03:30:30 08:30:30 [15 August 2015] productid:123456789 manuf:987654321 case:12 pallet:1234 id:12 code:1234 12345 123 12'
str_list = line.split()
new_list = [str_list[0],
str_list[1],
' '.join([item.strip('[]') for item in str_list[2:5]]), # '[15', 'August', '2015]'
str_list[6].split(':')[1],
str_list[7].split(':')[1],
str_list[8].split(':')[1],
str_list[9].split(':')[1],
str_list[10].split(':')[1],
str_list[12],
str_list[13]
]
print(new_list)
# Output
['03:30:30', '08:30:30', '15 August 2015', '987654321', '12', '1234', '12', '1234', '123', '12']
Write to a csv file with,
with open(filename, 'w') as f:
writer = csv.writer(f)
# write the file header
fieldnames = ['local time', 'GMT time', 'date', 'product id', 'manuf id',
'case', 'pallet', 'id', 'code', 'company id', 'location', 'secondary code']
writer.writerow(fieldnames)
# process each line
for line in lines:
new_list = do_something()
writer.writerow(new_list) # write to the file
Or process all lines and save the results to a list of lists, then wirte to a csv file with writerows
,
writer.writerows(lists)
The full source code,
#!/usr/bin/env python
import csv
inputFileName = 'test.txt'
outputFileName = 'finished.csv'
with open(outputFileName, 'w') as outfile:
w = csv.writer(outfile)
# write the file header
fieldnames = ['local time', 'GMT time', 'date', 'product id', 'manuf id',
'case', 'pallet', 'id', 'code', 'company id', 'location', 'secondary code']
writer.writerow(fieldnames)
# process each line
with open(inputFileName, 'r') as inFile:
for line in inFile:
str_list = line.rstrip().split()
new_list = [str_list[0],
str_list[1],
' '.join([item.strip('[]') for item in str_list[2:5]]), # '[15', 'August', '2015]'
str_list[6].split(':')[1],
str_list[7].split(':')[1],
str_list[8].split(':')[1],
str_list[9].split(':')[1],
str_list[10].split(':')[1],
str_list[12],
str_list[13]]
w.writerow(new_list) # write to the file
Upvotes: 1
Reputation: 5822
If you have a known structure of your input files, you don't need to use regular expressions, as they are slower. For your case, try using something similar to the code below:
import csv
with open('destination.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=',')
... # get your data here, the next code is for one line of data
values = line.split()
values[2] = '{} {} {}'.format(values[2][1:], values[3], values[4][:-1]) # Get rid of square brackets
for idx in range(5, 11):
values[idx] = values[idx].split(':')[1] # get the number values
values.pop(3); # remove separated month name
values.pop(3); # remove separated year
csv_writer.writerow(values)
You are editing only the data that requires it and then write the whole row to the csv.
Upvotes: 0