crazyaboutliv
crazyaboutliv

Reputation: 3199

Fast way of merging huge files(>=7 GB) into one

I have three huge files, with just 2 columns, and I need both. I want to merge them into one file which I can then write to a SQLite database.

I used Python and got the job done, but it took >30 minutes and also hung my system for 10 of those. I was wondering if there is a faster way by using awk or any other unix-tool. A faster way within Python would be great too. Code written below:

'''We have tweets of three months in 3 different files.
Combine them to a single file '''
import sys, os
data1 = open(sys.argv[1], 'r')
data2 = open(sys.argv[2], 'r')
data3 = open(sys.argv[3], 'r')
data4 = open(sys.argv[4], 'w')
for line in data1:
    data4.write(line)
data1.close()
for line in data2:
    data4.write(line)
data2.close()
for line in data3:
    data4.write(line)
data3.close()
data4.close()

Upvotes: 6

Views: 5226

Answers (3)

rjmunro
rjmunro

Reputation: 28076

The standard Unix way to merge files is cat. It may not be much faster but it will be faster.

cat file1 file2 file3 > bigfile

Rather than make a temporary file, you may be able to cat directly to sqlite

cat file1 file2 file3 | sqlite database

In python, you will probably get better performance if you copy the file in blocks rather than lines. Use file.read(65536) to read 64k of data at a time, rather than iterating through the files with for

Upvotes: 13

Stuart Woodward
Stuart Woodward

Reputation: 2166

I'm assuming that you need to repeat this process and that speed is a critical factor.

Try opening the files as binary files and experiment with the size of the block that you are reading. Try 4096 and 8192 bytes as these are common underlying buffer sizes.

There is a similar question, Is it possible to speed-up python IO?, that might be of interest too.

Upvotes: 1

Sjoerd
Sjoerd

Reputation: 75619

On UNIX-like systems:

cat file1 file2 file3 > file4

Upvotes: 2

Related Questions