Reputation: 8737
I'm trying to automate a file concatenation process in Python which works as effectively as the bash command line process I've been using. My bash CLI process uses awk
to merge the files, and the Python I've tried using for this uses pandas
.
For example, let's say I have a directory containing multiple CSV files named part_0.csv
, part_1.csv
, ..., part_n.csv
. Each file contains a header in the first line of the file. The bash CLI commands I use for this:
$ cd directory_containing_csv_files
$ mv part_0.csv merged.csv
$ awk 'FNR > 1' part*.csv > merged.csv
The Python/pandas code which does the same, but croaks when the total size gets big:
# read all the CSVs into a single file using concatenation (assumes same schema for all files)
combined_df = pd.concat([pd.read_csv(pth, header=0) for pth in csv_paths])
# write as single CSV file
combined_df.to_csv(dest_path, index=False, header=True)
The Python code seems to work well until it hits a size limit (the machine in question has 16GB RAM). The bash command line processing hasn't failed yet no matter what size the final file.
Maybe there's another Python approach that doesn't use pandas which is more memory efficient?
Upvotes: 0
Views: 107
Reputation: 36500
Maybe there's another Python approach that doesn't use pandas which is more memory efficient?
Using pandas
means you are parsing your csv
s which is not neccessary if you want just to concat all but first of lines. Your task seems fit for fileinput built-in module. Create file merger.py
as follows:
import fileinput
for line in fileinput.input():
if not fileinput.isfirstline():
print(line, end='')
then replace
awk 'FNR > 1' part*.csv > merged.csv
using
python3 merger.py part*.csv > merged.csv
Note that as fileinput.input
does left newlines intact and print
function by default adds newline at end so I informed print
not to do so (end=''
), as otherwise you will get superfluous newlines (blank lines interleaved between your data).
(tested in python 3.7.3)
Upvotes: 1
Reputation: 150745
You can try to_csv
with a file stream:
first = True
# open a file
with open('a.csv', 'w') as f:
# loop through the csv's
for pth in csv_paths:
df = pd.read_csv(pth)
# write to the stream
# only write the first header
df.to_csv(f, header=first, index=False)
# toggle the first header
if first: first = False
Upvotes: 2