James Adams
James Adams

Reputation: 8737

How to perform a CSV concatenation in Python as effectively as this awk command?

I'm trying to automate a file concatenation process in Python which works as effectively as the bash command line process I've been using. My bash CLI process uses awk to merge the files, and the Python I've tried using for this uses pandas.

For example, let's say I have a directory containing multiple CSV files named part_0.csv, part_1.csv, ..., part_n.csv. Each file contains a header in the first line of the file. The bash CLI commands I use for this:

$ cd directory_containing_csv_files
$ mv part_0.csv merged.csv
$ awk 'FNR > 1' part*.csv > merged.csv

The Python/pandas code which does the same, but croaks when the total size gets big:

# read all the CSVs into a single file using concatenation (assumes same schema for all files)
combined_df = pd.concat([pd.read_csv(pth, header=0) for pth in csv_paths])

# write as single CSV file
combined_df.to_csv(dest_path, index=False, header=True)

The Python code seems to work well until it hits a size limit (the machine in question has 16GB RAM). The bash command line processing hasn't failed yet no matter what size the final file.

Maybe there's another Python approach that doesn't use pandas which is more memory efficient?

Upvotes: 0

Views: 107

Answers (2)

Daweo
Daweo

Reputation: 36500

Maybe there's another Python approach that doesn't use pandas which is more memory efficient?

Using pandas means you are parsing your csvs which is not neccessary if you want just to concat all but first of lines. Your task seems fit for fileinput built-in module. Create file merger.py as follows:

import fileinput
for line in fileinput.input():
    if not fileinput.isfirstline():
        print(line, end='')

then replace

awk 'FNR > 1' part*.csv > merged.csv

using

python3 merger.py part*.csv > merged.csv

Note that as fileinput.input does left newlines intact and print function by default adds newline at end so I informed print not to do so (end=''), as otherwise you will get superfluous newlines (blank lines interleaved between your data).

(tested in python 3.7.3)

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150745

You can try to_csv with a file stream:

first = True

# open a file
with open('a.csv', 'w') as f:

    # loop through the csv's
    for pth in csv_paths:
        df = pd.read_csv(pth)
        
        # write to the stream
        # only write the first header
        df.to_csv(f, header=first, index=False)

        # toggle the first header
        if first: first = False

Upvotes: 2

Related Questions