Reputation: 2033

How do I optimize the python code to read a batch of multiple lines rather than one line at a time?

I have written a code that reads a large (>15 GB) text file and converts the data for a csv file one line at a time.

txt_file = fileName+".txt"
    csv_file = fileName+".csv"
    with open(txt_file, "r") as tf, open(csv_file, "w") as cf:
        for line in tf:
            cf.writelines(" ".join(line.split()).replace(' ', ','))
            cf.write("\n")

edit:
As for the data,
Data in input file:
abc def ghi jkl

Expected data in output file:
abc,def,ghi,jkl

I am using Python 2.7.6 in Mac OSX 10.10.3

Upvotes: 0

Answers (3)

rassa45

Reputation: 3560

The easiest way to do it is this.

with open("file.json", "r") as r, open("write.csv", "a") as w:
    lines = []
    for l in r:
        #Process l
        if len(lines) < 1000000: #Only uses 54mb of RAM so I hear
            lines.append(l)
        else:
            w.writelines(lines)
            del lines[:]

Upvotes: -1

Malonge

Reputation: 2040

I know this is not technically answering your question, but if you are able to process the files before your python script, I believe using sed would be the fastest way to do this. Considering your large file sizes, I think it is worth the non python related suggestion.

How to replace space with comma using sed

You can do this via command line before starting your python script, or even invoke it within your script using subprocess.

Upvotes: 0

Martijn Pieters

Reputation: 1124548

Leave parsing and formatting CSV to the csv module:

import csv

txt_file = fileName + ".txt"
csv_file = fileName + ".csv"
with open(txt_file, "rb") as tf, open(csv_file, "wb") as cf:
    reader = csv.reader(tf, delimiter=' ')
    writer = csv.writer(cf)
    writer.writerows(reader)

or if you have strange quoting, treating the input file as text and manually splitting:

import csv

txt_file = fileName + ".txt"
csv_file = fileName + ".csv"
with open(txt_file, "rb") as tf, open(csv_file, "wb") as cf:
    writer = csv.writer(cf)
    writer.writerows(line.split() for line in tf)

File streams use buffers to read and write data in chunks.

Upvotes: 2

How do I optimize the python code to read a batch of multiple lines rather than one line at a time?

Answers (3)

Related Questions