Reputation: 3199
I have three huge files, with just 2 columns, and I need both. I want to merge them into one file which I can then write to a SQLite database.
I used Python and got the job done, but it took >30 minutes and also hung my system for 10 of those. I was wondering if there is a faster way by using awk or any other unix-tool. A faster way within Python would be great too. Code written below:
'''We have tweets of three months in 3 different files.
Combine them to a single file '''
import sys, os
data1 = open(sys.argv[1], 'r')
data2 = open(sys.argv[2], 'r')
data3 = open(sys.argv[3], 'r')
data4 = open(sys.argv[4], 'w')
for line in data1:
data4.write(line)
data1.close()
for line in data2:
data4.write(line)
data2.close()
for line in data3:
data4.write(line)
data3.close()
data4.close()
Upvotes: 6
Views: 5226
Reputation: 28076
The standard Unix way to merge files is cat
. It may not be much faster but it will be faster.
cat file1 file2 file3 > bigfile
Rather than make a temporary file, you may be able to cat
directly to sqlite
cat file1 file2 file3 | sqlite database
In python, you will probably get better performance if you copy the file in blocks rather than lines. Use file.read(65536)
to read 64k of data at a time, rather than iterating through the files with for
Upvotes: 13
Reputation: 2166
I'm assuming that you need to repeat this process and that speed is a critical factor.
Try opening the files as binary files and experiment with the size of the block that you are reading. Try 4096 and 8192 bytes as these are common underlying buffer sizes.
There is a similar question, Is it possible to speed-up python IO?, that might be of interest too.
Upvotes: 1