Gianni Spear
Gianni Spear

Reputation: 8010

fast method in Python to split a large text file using number of lines as input variable

I am splitting a text file using the number of lines as variable. I wrote this function in order to save in a temporary directory the spitted files. Each file has 4 millions of lines expect the last file.

import tempfile
from itertools import groupby, count

temp_dir = tempfile.mkdtemp()

def tempfile_split(filename, temp_dir, chunk=4000000):
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
            for line in group:
                with open(output_name, 'a') as outfile:
                    outfile.write(line)

the main problem is the speed of this function. In order to split one file of 8 million of lines in two files of 4 millions of line the time is than more of 30 min of my windows OS and Python 2.7

Upvotes: 4

Views: 3622

Answers (4)

radtek
radtek

Reputation: 36370

If you're in a linux or unix environment you could cheat a little and use the split command from inside python. Does the trick for me, and very fast too:

def split_file(file_path, chunk=4000):

    p = subprocess.Popen(['split', '-a', '2', '-l', str(chunk), file_path,
                          os.path.dirname(file_path) + '/'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    p.communicate()
    # Remove the original file if required
    try:
        os.remove(file_path)
    except OSError:
        pass
    return True

Upvotes: 0

dawg
dawg

Reputation: 104102

You can use tempfile.NamedTemporaryFile directly in the context manager:

import tempfile
import time
from itertools import groupby, count

def tempfile_split(filename, temp_dir, chunk=4*10**6):
    fns={}
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            with tempfile.NamedTemporaryFile(delete=False,
                           dir=temp_dir,prefix='{}_'.format(str(k))) as outfile:
                outfile.write(''.join(group))
                fns[k]=outfile.name   
    return fns                     

def make_test(size=8*10**6+1000):
    with tempfile.NamedTemporaryFile(delete=False) as fn:
        for i in xrange(size):
            fn.write('Line {}\n'.format(i))

    return fn.name        

fn=make_test()
t0=time.time()
print tempfile_split(fn,tempfile.mkdtemp()),time.time()-t0   

On my computer, the tempfile_split part runs in 3.6 seconds. It is OS X.

Upvotes: 1

Wing Tang Wong
Wing Tang Wong

Reputation: 792

Just did a quick test with an 8million line file(uptime lines) to run the length of the file and split the file in half. Basically, one pass to get the line count, second pass to do the split write.

On my system, the time it took to perform the first pass run was about 2-3 seconds. To complete the run and the write of the split file(s), total time took was under 21 seconds.

Did not implement the lamba functions in the OP's post. Code used below:

#!/usr/bin/env python

import sys
import math

infile = open("input","r")

linecount=0

for line in infile:
    linecount=linecount+1

splitpoint=linecount/2

infile.close()

infile = open("input","r")
outfile1 = open("output1","w")
outfile2 = open("output2","w")

print linecount , splitpoint

linecount=0

for line in infile:
    linecount=linecount+1
    if ( linecount <= splitpoint ):
        outfile1.write(line)
    else:
        outfile2.write(line)

infile.close()
outfile1.close()
outfile2.close()

No, it's not going to win any performance or code elegance tests. :) But short of something else being a performance bottleneck, the lambda functions causing the file to be cached in memory and forcing a swap issue, or that the lines in the file are extremely long, I don't see why it would take 30 minutes to read/split the 8million line file.

EDIT:

My environment: Mac OS X, storage was a single FW800 connected hard drive. File was created fresh to avoid filesystem caching benefits.

Upvotes: 1

unutbu
unutbu

Reputation: 880947

       for line in group:
            with open(output_name, 'a') as outfile:
                outfile.write(line)

is opening the file, and writing one line, for each line in group. This is slow.

Instead, write once per group.

            with open(output_name, 'a') as outfile:
                outfile.write(''.join(group))

Upvotes: 6

Related Questions