learner
learner

Reputation: 549

Loading huge text file in python

I need to process a large text file (4 GB). Which is having data as:

12 23 34
22 78 98
76 56 77

Where I need to read each line as do some work based on the lines. Currently I am doing as:

sample = 'filename.txt'

with open(sample) as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]
      do_someprocess()

It is taking huge time to execute. Is there any other better way to do this in python??

Upvotes: 0

Views: 76

Answers (2)

d-coder
d-coder

Reputation: 13953

split() returns you a list. And then you are trying to access the first,second and third element by

line = [int(i) for i in line]
  a = line[0]
  b = line[1]
  c = line[2]

Instead of that you can directly say a,b,c = line.split() then a will contain line[0], b will contain line[1] and c will contain line[2]. This should save you some time.

with open(sample) as f:
    for line in f:
      a,b,c = line.split() 
      do_someprocess()

An example:

with open("sample.txt","r") as f:
    for line in f:
        a,b,c = line.split()
        print a,b,c

.txt file

12 34 45
78 67 45

Output:

12 34 45
78 67 45

EDIT : I thought of elaborating on it.I have used timeit() module to compare the time taken by the code to run. Please let me know if I'm doing something wrong here.The following is the OP way of writing the code.

v = """ with open("sample.txt","r") as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]"""
import timeit
print timeit.timeit(stmt=v, number=100000)

Output:

8.94879606286   ## seconds to complete 100000 times.

The following is my way of writing the code.

s = """ with open("sample.txt","r") as f:
            for line in f:
                a,b,c = [int(s) for s in line.split()]"""

import timeit
print timeit.timeit(stmt=s, number=100000)

Outputs :

7.60287380216 ## seconds to complete same number of times.

Upvotes: 0

John La Rooy
John La Rooy

Reputation: 304147

If do_someprocess() takes a long time compared to reading the lines and you have extra CPU cores you could use the multiprocessing module.

Try using pypy if possible. For some compute intensive tasks it is dozens of times faster than cpython

If there are a lot of duplicate ints in the file, it can surprisingly be faster to use a dict mapping than int() as it saves the time to create new int objects.

First step is to profile as @nathancahill suggests in the comments. Then focus your efforts on the parts where the biggest gains can be made.

Upvotes: 1

Related Questions