Reputation: 1241
I have a 20-30 csv files containing 3 columns like 'id','col1','col2','col3' and 1 big csv file of 20GB size that i want to read in chunks and merge with these samller csv files. the bigger csv file has columns as 'id','name','zipdeails'. both have ID column in same sequences, smaple looks like
'id','name','zipdeails'
1,Ravi,2031345
2,Shayam,201344
3,Priya,20134
.........
1000,Pravn,204324
chunk file 1 looks like
'id','col1','col2','col3'
1,Heat,,
2,Goa,Next,
3,,,Delhi
all the smaller csv files are of same lenth(number of rows) except for the last file which may be smaller in length with header in each. the bigger csv file to which these are to be merged can be broken into chunksize that is equal to the length of these smaller files so Last chunk looks like
'id','col1','col2','col3'
1000,Jaipur,Week,Trip
Now the output should look like
'id','name','zipdeails','col1','col2','col3'
1,Ravi,2031345,Heat,NAN,NAN
2,Shayam,201344,Goa,Next,NAN
3,Priya,20134,NAN,NAN,Delhi
.........
1000,Pravn,204324,Jaipur,Week,Trip
Upvotes: 2
Views: 416
Reputation: 863166
I think you need create list of DataFrame
s for all small files, then read big file to memory and concat
all together by index
created by id
column:
import glob
#concat 30 files
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['id']) for fp in files]
#if necessary
#df_big = df_big.set_index('id')
df_fin = pd.concat([df_big, dfs], axis=1)
There is possible solution a bit modify if there is same order of id
values in all DataFrames without duplicates like 1,2,3...N
with parameter nrows
for read only first rows of big DataFrame
by max length of smaller DataFrames:
#concat 30 files
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['a']) for fp in files]
len_max= max([x.index.max() for x in dfs])
df_big= pd.read_csv('big_df_file.csv', index_col=['id'], nrows=len_max)
df_fin = pd.concat([df_big, dfs], axis=1)
EDIT:
#concat 30 files
files = glob.glob('files/*.csv')
#order of files is important for concat values -
#in first file are id = (1,100), second (101, 200)...
print (files)
#set by max rows of file
N = 100
#loop by big fileby chunk define in N
for i, x in enumerate(pd.read_csv('files/big.csv', chunksize=N, index_col=['id'])):
#added try for avoid errors if want seelct non exist file in list files
try:
df = pd.read_csv(files[i], index_col=['id'])
df1 = pd.concat([x, df], axis=1)
print (df1)
#in first loop create header in output
if i == 0:
pd.DataFrame(columns=df1.columns).to_csv('files/out.csv')
#append data to output file
df1.to_csv('files/out.csv', mode='a', header=False)
except IndexError as e:
print ('no files in list')
Upvotes: 2