Reputation: 8010
I am splitting a text file using the number of lines as variable. I wrote this function in order to save in a temporary directory the spitted files. Each file has 4 millions of lines expect the last file.
import tempfile
from itertools import groupby, count
temp_dir = tempfile.mkdtemp()
def tempfile_split(filename, temp_dir, chunk=4000000):
with open(filename, 'r') as datafile:
groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
for k, group in groups:
output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
for line in group:
with open(output_name, 'a') as outfile:
outfile.write(line)
the main problem is the speed of this function. In order to split one file of 8 million of lines in two files of 4 millions of line the time is than more of 30 min of my windows OS and Python 2.7
Upvotes: 4
Views: 3622
Reputation: 36370
If you're in a linux or unix environment you could cheat a little and use the split
command from inside python. Does the trick for me, and very fast too:
def split_file(file_path, chunk=4000):
p = subprocess.Popen(['split', '-a', '2', '-l', str(chunk), file_path,
os.path.dirname(file_path) + '/'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
p.communicate()
# Remove the original file if required
try:
os.remove(file_path)
except OSError:
pass
return True
Upvotes: 0
Reputation: 104102
You can use tempfile.NamedTemporaryFile directly in the context manager:
import tempfile
import time
from itertools import groupby, count
def tempfile_split(filename, temp_dir, chunk=4*10**6):
fns={}
with open(filename, 'r') as datafile:
groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
for k, group in groups:
with tempfile.NamedTemporaryFile(delete=False,
dir=temp_dir,prefix='{}_'.format(str(k))) as outfile:
outfile.write(''.join(group))
fns[k]=outfile.name
return fns
def make_test(size=8*10**6+1000):
with tempfile.NamedTemporaryFile(delete=False) as fn:
for i in xrange(size):
fn.write('Line {}\n'.format(i))
return fn.name
fn=make_test()
t0=time.time()
print tempfile_split(fn,tempfile.mkdtemp()),time.time()-t0
On my computer, the tempfile_split
part runs in 3.6 seconds. It is OS X.
Upvotes: 1
Reputation: 792
Just did a quick test with an 8million line file(uptime lines) to run the length of the file and split the file in half. Basically, one pass to get the line count, second pass to do the split write.
On my system, the time it took to perform the first pass run was about 2-3 seconds. To complete the run and the write of the split file(s), total time took was under 21 seconds.
Did not implement the lamba functions in the OP's post. Code used below:
#!/usr/bin/env python
import sys
import math
infile = open("input","r")
linecount=0
for line in infile:
linecount=linecount+1
splitpoint=linecount/2
infile.close()
infile = open("input","r")
outfile1 = open("output1","w")
outfile2 = open("output2","w")
print linecount , splitpoint
linecount=0
for line in infile:
linecount=linecount+1
if ( linecount <= splitpoint ):
outfile1.write(line)
else:
outfile2.write(line)
infile.close()
outfile1.close()
outfile2.close()
No, it's not going to win any performance or code elegance tests. :) But short of something else being a performance bottleneck, the lambda functions causing the file to be cached in memory and forcing a swap issue, or that the lines in the file are extremely long, I don't see why it would take 30 minutes to read/split the 8million line file.
EDIT:
My environment: Mac OS X, storage was a single FW800 connected hard drive. File was created fresh to avoid filesystem caching benefits.
Upvotes: 1
Reputation: 880947
for line in group:
with open(output_name, 'a') as outfile:
outfile.write(line)
is opening the file, and writing one line, for each line in group. This is slow.
Instead, write once per group.
with open(output_name, 'a') as outfile:
outfile.write(''.join(group))
Upvotes: 6