Stripers247
Stripers247

Reputation: 2335

Looking for a way to speed up the write to file portion of my Python code

I have a simple code that reads in a data file ~2Gb, extracts the columns of data that I need and then writes that data as columns to another file for later processing. I ran the code last night and it took close to nine hours to complete. I ran the two sections separately and have determined that the portion that writes the data to a new file is the problem. I would like to ask if anyone can point out why it is so slow the way I have written it as well as suggestions on a better method.

sample of data being read in

26980300000000  26980300000000  39  13456502685696  1543    0
26980300000001  26980300000000  38  13282082553856  1523    0.01
26980300000002  26980300000000  37  13465223692288  1544    0.03
26980300000003  26980300000000  36  13290803560448  1524    0.05
26980300000004  26980300000000  35  9514610851840   1091    0.06
26980300000005  26980300000000  34  9575657897984   1098    0.08
26980300000006  26980300000000  33  8494254129152   974     0.1
26980300000007  26980300000000  32  8520417148928   977     0.12
26980300000008  26980300000000  31  8302391459840   952     0.14
26980300000009  26980300000000  30  8232623931392   944     0.16

Code

F = r'C:\Users\mass_red.csv'

def filesave(TID,M,R):     
  X = str(TID)
  Y = str(M)
  Z = str(R) 
  w = open(r'C:\Users\Outfiles\acc1_out3.txt','a')
  w.write(X)
  w.write('\t')
  w.write(Y)
  w.write('\t')
  w.write(Z)
  w.write('\n')
  w.close()
  return()

N = 47000000
f = open(F)           
f.readline()          
nlines = islice(f, N) 

for line in nlines:                 
 if line !='':
      line = line.strip()         
      line = line.replace(',',' ') 
      columns = line.split()       
      tid = int(columns[1])
      m = float(columns[3])  
      r = float(columns[5])             
      filesave(tid,m,r)

Upvotes: 0

Views: 166

Answers (3)

jfs
jfs

Reputation: 414089

Here's a simplified but complete version of your code:

#!/usr/bin/env python
from __future__ import print_function
from itertools import islice

nlines_limit = 47000000
with open(r'C:\Users\mass_red.csv') as input_file, \
     open(r'C:\Users\Outfiles\acc1_out3.txt', 'w') as output_file:
    next(input_file) # skip line
    for line in islice(input_file, nlines_limit):
        columns = line.split()       
        try:
            tid = int(columns[1])
            m = float(columns[3])  
            r = float(columns[5])             
        except (ValueError, IndexError):
            pass # skip invalid lines
        else:
            print(tid, m, r, sep='\t', file=output_file)

I don't see commas in your input; so I've removed line.replace(',', ' ') from the code.

Upvotes: 1

Terry Jan Reedy
Terry Jan Reedy

Reputation: 19144

In modern Python, most file use should be done with with statements. Open is easily seen to be done once in the header, and close is automatic. Here is a general template for line processing.

inp = r'C:\Users\mass_red.csv'
out = r'C:\Users\Outfiles\acc1_out3.txt'
with open(inp) as fi, open(out, 'a') as fo:
    for line in fi:
        ...
        if keep:
            ...
            fo.write(whatever)

Upvotes: 1

StenSoft
StenSoft

Reputation: 9609

You open and close the file for each line. Open it once at the beginning.

Upvotes: 2

Related Questions