CODEWITHSUNDEEP

pythonpandascsv

Reputation: 13

Difficulty combining csv files into a single file

My dataset looks at flight delays and cancellations from 2009 to 2018. Here are the important points to consider:

Each year is its own csv file so '2009.csv', '2010.csv', all the way to '2018.csv'
Each file is roughly 700mb
I used the following to combine csv files

import pandas as pd
import numpy as np
import os, sys
import glob

os.chdir('c:\\folder'

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

combined_airline_csv = pd.concat([pd.read_csv(f) for f in all_filenames])

combined_airline_csv.to_csv('combined_airline_csv.csv', index =False, encoding = 'utf-8-sig')

When I run this, I receive the following message: MemoryError: Unable to allocate 43.3MiB for an array with shape(5674621, ) and data type float64.

I am presuming that my file is too large and that will need to run this on a virtual machine (i.e. AWS).
Any thoughts?

Thank you!

Upvotes: 1

Views: 50

Answers (1)

Reputation: 575

This is a duplicate of how to merge 200 csv files in Python.

Since you just want to combine them into one file, there is no need to load all data into a dataframe at the same time. Since they all have the same structure, I would advise creating one filewriter, then open each file with a file reader and write (if we want to be fancy let's call it stream) the data line by line. Just be careful not to copy the headers each time, since you only want them one time. Pandas is simply not the best tool for this task :)

In general, this is a typical task that can also be done easily and even faster directly on the command line. (code depends on the os)

Upvotes: 2

Related Questions