John
John

Reputation: 129

Merge multiple large csv files quickly

I have multiple large csv files. Each file has a size of 1 GB to 7 GB.

All my files doesn't have any headers. It is in this format. ID,col1,col2,col3

File 1 has this structure.

1,23,22,27
2,67,29,22
3,34,34,23

File 2 has this structure.

4,23,22,27
5,67,29,22
6,34,34,23

i.e. The ID is unique in all files.

I would like to merge these files into a single csv file and then sort the rows based on the ID column. The resulting file will be around 75 GB. And it starts with the ID 1.

1,23,22,27
2,67,29,22
3,34,34,23
4,23,22,27
5,67,29,22
6,34,34,23

At the moment I'm doing like this.

import pandas as pd

CHUNK_SIZE = 10000000 # Number of Rows
output_file = 'combined.csv'

for csv_file_name in sorted_fnames:
    chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE)
    print(csv_file_name)
    for chunk in chunk_container:
        chunk.to_csv(output_file, mode="a", index=False)   

And then I'm sorting the file like this.

sort --parallel=2 -t, -k1,1 -n combined.csv > combined_sorted.csv

However, the merging process is incredibly slow. It takes more than an hour to merge the file.

Note: I have only 16 GB RAM. That's why I'm using the chunking option.

Is there any fastest solution available?

Thanks

Upvotes: 0

Views: 2776

Answers (3)

Ed Morton
Ed Morton

Reputation: 203924

If this isn't all you need:

$ cat file1 file2
1,23,22,27
2,67,29,22
3,34,34,23
4,23,22,27
5,67,29,22
6,34,34,23

then edit your question to provide more useful sample input/output that truly demonstrates your requirements and where the above doesn't work for it.

Upvotes: 1

John
John

Reputation: 129

Found a fast solution. File processed in few minutes instead of hours.

The following assumes you don't have header row in all csv files. If you have header row, you need to remove that first. [Note: No need to fix if you have header row only in the first csv file]

import subprocess
sorted_fnames = ["1.csv",
                 "2.csv",
                 "3.csv"]

my_cmd = ['cat'] + sorted_fnames
with open('combined.csv', "w") as outfile:
    subprocess.run(my_cmd, stdout=outfile)

if you wanna sort, the you can use the sort command.

sort --parallel=2 -t, -k1,1 -n combined.csv > combined_sorted.csv

Upvotes: 1

eshirvana
eshirvana

Reputation: 24603

on the second thought , you can use hdf5 structure that handles big data really well:

import pandas as pd

hdf_path = '_combined.h5'

with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
    for csv_file_name in sorted_fnames:
        store.append('data', pd.read_csv(csv_file_name), index=False)

you eventually can save it back to csv, if you wanted, but working with hdf5 would be more effeient

Upvotes: 1

Related Questions