Reputation: 2335
I have a simple code that reads in a data file ~2Gb, extracts the columns of data that I need and then writes that data as columns to another file for later processing. I ran the code last night and it took close to nine hours to complete. I ran the two sections separately and have determined that the portion that writes the data to a new file is the problem. I would like to ask if anyone can point out why it is so slow the way I have written it as well as suggestions on a better method.
sample of data being read in
26980300000000 26980300000000 39 13456502685696 1543 0
26980300000001 26980300000000 38 13282082553856 1523 0.01
26980300000002 26980300000000 37 13465223692288 1544 0.03
26980300000003 26980300000000 36 13290803560448 1524 0.05
26980300000004 26980300000000 35 9514610851840 1091 0.06
26980300000005 26980300000000 34 9575657897984 1098 0.08
26980300000006 26980300000000 33 8494254129152 974 0.1
26980300000007 26980300000000 32 8520417148928 977 0.12
26980300000008 26980300000000 31 8302391459840 952 0.14
26980300000009 26980300000000 30 8232623931392 944 0.16
Code
F = r'C:\Users\mass_red.csv'
def filesave(TID,M,R):
X = str(TID)
Y = str(M)
Z = str(R)
w = open(r'C:\Users\Outfiles\acc1_out3.txt','a')
w.write(X)
w.write('\t')
w.write(Y)
w.write('\t')
w.write(Z)
w.write('\n')
w.close()
return()
N = 47000000
f = open(F)
f.readline()
nlines = islice(f, N)
for line in nlines:
if line !='':
line = line.strip()
line = line.replace(',',' ')
columns = line.split()
tid = int(columns[1])
m = float(columns[3])
r = float(columns[5])
filesave(tid,m,r)
Upvotes: 0
Views: 166
Reputation: 414089
Here's a simplified but complete version of your code:
#!/usr/bin/env python
from __future__ import print_function
from itertools import islice
nlines_limit = 47000000
with open(r'C:\Users\mass_red.csv') as input_file, \
open(r'C:\Users\Outfiles\acc1_out3.txt', 'w') as output_file:
next(input_file) # skip line
for line in islice(input_file, nlines_limit):
columns = line.split()
try:
tid = int(columns[1])
m = float(columns[3])
r = float(columns[5])
except (ValueError, IndexError):
pass # skip invalid lines
else:
print(tid, m, r, sep='\t', file=output_file)
I don't see commas in your input; so I've removed line.replace(',', ' ')
from the code.
Upvotes: 1
Reputation: 19144
In modern Python, most file use should be done with with
statements. Open is easily seen to be done once in the header, and close is automatic. Here is a general template for line processing.
inp = r'C:\Users\mass_red.csv'
out = r'C:\Users\Outfiles\acc1_out3.txt'
with open(inp) as fi, open(out, 'a') as fo:
for line in fi:
...
if keep:
...
fo.write(whatever)
Upvotes: 1
Reputation: 9609
You open and close the file for each line. Open it once at the beginning.
Upvotes: 2