Reputation: 57
I have a folder with multiple files, each with a varying number of columns in each file. I want to go through the directory, open each file and loop through each line, writing the line to a new CSV file based on the number of columns in that line. I want to end up with a single big CSV for all lines with 14 columns, another big CSV for all lines with 18 columns, and the last CSV with all the other columns.
Here's what I have so far.
import pandas as pd
import glob
import os
import csv
path = r'C:\Users\Vladimir\Documents\projects\ETLassig\W3SVC2'
all_files = glob.glob(os.path.join(path, "*.log"))
for file in all_files:
for line in file:
if len(line.split()) == 14:
with open('c14.csv', 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=' ')
csvwriter.writerow([line])
elif len(line.split()) == 18:
with open('c14.csv', 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=' ')
csvwriter.writerow([line])
#open 18.csv
else:
with open('misc.csv', 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=' ')
csvwriter.writerow([line])
print(c14.csv)
Can anyone offer any feedback on how to approach this?
Upvotes: 0
Views: 348
Reputation: 25023
Beforehand, please note that you can copy the lines as is from the input files to the output ones, no need for the CSV machinery.
That said, I propose to use a dictionary of file objects and the get
method of dictionaries, that permits to specify a default value.
files = {14:open('14.csv', 'wb'),
18:open('18.csv', 'wb')}
other = open('other.csv', 'wb')
for file in all_files:
for line in open(file):
llen = len(line.split())
target = files.get(llen, other)
target.write(line)
If you have to process some million records then note that, because
In [20]: a = 'a '*20
In [21]: %timeit len(a.split())
599 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: %timeit a.count(' ')+1
328 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
you should substitute the for
loops above with
for file in all_files:
for line in open(file):
fields_count = line.count(' ')+1
target = files.get(fields_count, other)
target.write(line)
Should because, even if we speak of nano seconds , the file system access is in the same ballpark
In [23]: f = open('dele000', 'w')
In [24]: %timeit f.write(a)
508 ns ± 154 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
as splitting/counting.
Upvotes: 0
Reputation: 11073
You can add all of your columns as as a list in list:
l = []
for file in [your_files]:
with open(file, 'r') as f:
for line in f.readlines()
l.appned(line.split(" "))
Now you have list of lists, so just sort them with length of sublists then put it in a new file:
l.sort(key=len)
with open(outputfile, 'w'):
# Write lines here as you want
Upvotes: 5