Pandas Combine Multiple CSV's and Output as One Large File

Question

So I currently have a directory, we'll call it /mydir, that contains 36 CSV files, each 2.1 GB and with the same dimensions. They are all the same size, and I want to read them into pandas, concatenate them together side-by-side (so the amount of rows stays the same), and then output the resulting dataframe as one large csv. The code I have for this works for combining a few of them but reaches a memory error after a certain point. I was wondering if there is a more efficient way to do this than what I have.

df = pd.DataFrame()
for file in os.listdir('/mydir'):
    df.concat([df, pd.read_csv('/mydir' + file, dtype = 'float)], axis = 1)
df.to_csv('mydir/file.csv')

It was suggested to me to break it up into smaller pieces, combine the files in groups of 6, then combine these together in turn but I don't know if this is a valid solution that will avoid the memory error problem

EDIT: view of the directory:

-rw-rw---- 1 m2762 2.1G Jul 11 10:35 2010.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:32 2001.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:28 1983.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 2009.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 1991.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:07 2000.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:06 1982.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 1990.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 2008.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:55 1999.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:54 1981.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 2007.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1998.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1989.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1980.csv

piRSquared · Accepted Answer

Chunk Them All!

from glob import glob
import os

# grab files
files = glob('./[0-9][0-9][0-9][0-9].csv')

# simplify the file reading
# notice this will create a generator
# that goes through chunks of the file
# at a time
def read_csv(f, n=100):
    return pd.read_csv(f, index_col=0, chunksize=n)

# simplify the concatenation
def concat(lot):
    return pd.concat(lot, axis=1)

# simplify the writing
# make sure mode is append and header is off
# if file already exists
def to_csv(f, df):
    if os.path.exists(f):
        mode = 'a'
        header = False
    else:
        mode = 'w'
        header = True
    df.to_csv(f, mode=mode, header=header)

# Fun stuff! zip will take the next element of the generator
# for each generator created for each file
# concat one chunk at a time and write
for lot in zip(*[read_csv(f, n=10) for f in files]):
    to_csv('out.csv', concat(lot))

Pandas Combine Multiple CSV's and Output as One Large File

Answers (2)

Chunk Them All!

Related Questions

Pandas Combine Multiple CSV&#39;s and Output as One Large File

Answers (2)

Chunk Them All!

Related Questions

Pandas Combine Multiple CSV's and Output as One Large File