Reputation: 63
I have a large tab delimited file containing about 1.4 million lines and 50 columns. Before I do anything with the data contained in the file I want to split this large file into about a few thousand smaller files. The first column of my file contains position information, and I want each smaller file to be a specific interval based on this information. In separate lists I have the start and stop of each interval that I want to split the larger file by. Here is the part of my code that does this operation, the start and stop positions are contained in lists called start_L and stop_L:
for i in range(len(id)):
out1=((file%s.txt)%(id[i]))
table=open('largefile.tsv',"r")
start=int(start_L[i])
stop=int(stop_L[i])
table.next()
temp_out=open(out1,"w")
reader=csv.reader(table,delimiter="\t")
for line in reader:
if int(line[0]) in range(start,stop):
for y in line:
temp_out.write(("%s\t")%(y))
temp_out.write("\n")
else:
if int(line[0]) > stop:
break
else:
pass
print "temporary file..." , id[i]
The above code achieves what I want, but is extremely slow. It can process the first hundred or so intervals in a matter of minutes, but gets exponentially slower with each passing interval, so it takes days to run. Is there a faster, or more efficient way of doing this? I believe the problem is that it has to scan over the entire file to find the positions within the specified interval each time through the loop.
Upvotes: 1
Views: 3269
Reputation: 41
For most part the solution offered above helped me but since my input does not have the line #, I had to modify with the following changes.
table=fileinput.input('largefile.csv',mode="r")
#
#
#
if fileinput.lineno() >= stop :
My file is | delimited with approx 600k lines and about 120MB in size; the entire file was split in just a few seconds.
Upvotes: 1
Reputation: 106
OK, I tried to keep this in the spirit of your code. It only iterates thru the big file once, it does not bother to parse the lines via the csv module as you were just rejoining them during the write.
id=("a","b")
start_L=(1,15)
stop_L=(16,40)
i=0
table=open('largefile.tsv',"r")
out1=(("file%s.txt")%(id[i]))
temp_out=open(out1,"w")
# start iterating through the file
for line in table:
stop=int(stop_L[i])
# Split the line into a position piece, and a
# throw away variable based upon the 1st tab char
position,the_rest= line.split("\t",1)
# I'm ignoring start as you mentioned it was sorted in the file
if int(position) >= stop :
# Close the current file
temp_out.close()
# Increment index so file name is pulled from id properly
# If the index is past the length of the id list then
# break otherwise open the new file for writing
i += 1
if (i < len(id)):
out1=(("file%s.txt")%(id[i]))
temp_out=open(out1,"w")
else:
break
temp_out.write(line)
My test file lines looked like
1 1a b c d e
2 2a b c d e
3 3a b c d e
This could be simplified quite a bit depending on your specific data but I hope it at least gives you a start.
Upvotes: 0
Reputation: 19601
The reason your program slows down over time is because you are re-reading the CSV file over and over again for each output file. As the range you are looking moves down the CSV file, you need to read more and more data (most of which you skip) for every output file. Hence, the exponential decrease in performance.
You need to re-organise your code so that you read the CSV only once, sequentially, and pick out the ranges of interest (and writing them to a file) within the loop. This is only possible if the CSV is ordered by range (you said it is) and if your start_L/stop_L are also ordered correspondingly.
Upvotes: 1