Reputation: 1574
So I currently have a directory, we'll call it /mydir, that contains 36 CSV files, each 2.1 GB and with the same dimensions. They are all the same size, and I want to read them into pandas, concatenate them together side-by-side (so the amount of rows stays the same), and then output the resulting dataframe as one large csv. The code I have for this works for combining a few of them but reaches a memory error after a certain point. I was wondering if there is a more efficient way to do this than what I have.
df = pd.DataFrame()
for file in os.listdir('/mydir'):
df.concat([df, pd.read_csv('/mydir' + file, dtype = 'float)], axis = 1)
df.to_csv('mydir/file.csv')
It was suggested to me to break it up into smaller pieces, combine the files in groups of 6, then combine these together in turn but I don't know if this is a valid solution that will avoid the memory error problem
EDIT: view of the directory:
-rw-rw---- 1 m2762 2.1G Jul 11 10:35 2010.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:32 2001.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:28 1983.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 2009.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 1991.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:07 2000.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:06 1982.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 1990.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 2008.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:55 1999.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:54 1981.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 2007.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1998.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1989.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1980.csv
Upvotes: 5
Views: 2880
Reputation: 294278
from glob import glob
import os
# grab files
files = glob('./[0-9][0-9][0-9][0-9].csv')
# simplify the file reading
# notice this will create a generator
# that goes through chunks of the file
# at a time
def read_csv(f, n=100):
return pd.read_csv(f, index_col=0, chunksize=n)
# simplify the concatenation
def concat(lot):
return pd.concat(lot, axis=1)
# simplify the writing
# make sure mode is append and header is off
# if file already exists
def to_csv(f, df):
if os.path.exists(f):
mode = 'a'
header = False
else:
mode = 'w'
header = True
df.to_csv(f, mode=mode, header=header)
# Fun stuff! zip will take the next element of the generator
# for each generator created for each file
# concat one chunk at a time and write
for lot in zip(*[read_csv(f, n=10) for f in files]):
to_csv('out.csv', concat(lot))
Upvotes: 6
Reputation: 475
Assuming the answer to MaxU is that all the files have the same number of rows, and assuming further that minor CSV differences like quoting are done the same way in all the files, you don't need to do this with Pandas. Regular file readlines
will give you the strings that you can concatenate and write out. Assuming further that you can supply the number of rows. Something like this code:
numrows = 999 # whatever. Probably pass as argument to function or on cmdline
out_file = open('myout.csv','w')
infile_names = [ 'file01.csv',
'file02.csv',
..
'file36.csv' ]
# open all the input files
infiles = []
for fname in infile_names:
infiles.append(open(fname))
for i in range(numrows):
# read a line from each input file and add it to the output string
out_csv=''
for infile2read in infiles:
out_csv += infile2read.readline().strip() + ','
out_csv[-1] = '\n' # replace final comma with newline
# write this rows data out to the output file
outfile.write(out_csv)
#close the files
for f in infiles:
f.close()
outfile.close()
Upvotes: 0