Aman Singh
Aman Singh

Reputation: 1241

how do i merge n .csv files(possibly 20-30 files) with 1 BIG .csv file horizontally(axis=1) using pandas?

I have a 20-30 csv files containing 3 columns like 'id','col1','col2','col3' and 1 big csv file of 20GB size that i want to read in chunks and merge with these samller csv files. the bigger csv file has columns as 'id','name','zipdeails'. both have ID column in same sequences, smaple looks like

 'id','name','zipdeails'
  1,Ravi,2031345
  2,Shayam,201344
  3,Priya,20134
  .........
  1000,Pravn,204324

chunk file 1 looks like

 'id','col1','col2','col3'
    1,Heat,,
    2,Goa,Next,
    3,,,Delhi

all the smaller csv files are of same lenth(number of rows) except for the last file which may be smaller in length with header in each. the bigger csv file to which these are to be merged can be broken into chunksize that is equal to the length of these smaller files so Last chunk looks like

'id','col1','col2','col3'
   1000,Jaipur,Week,Trip

Now the output should look like

'id','name','zipdeails','col1','col2','col3'
 1,Ravi,2031345,Heat,NAN,NAN
 2,Shayam,201344,Goa,Next,NAN
 3,Priya,20134,NAN,NAN,Delhi
  .........
 1000,Pravn,204324,Jaipur,Week,Trip

Upvotes: 2

Views: 416

Answers (1)

jezrael
jezrael

Reputation: 863166

I think you need create list of DataFrames for all small files, then read big file to memory and concat all together by index created by id column:

import glob

#concat 30 files
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['id']) for fp in files]

#if necessary
#df_big = df_big.set_index('id')
df_fin = pd.concat([df_big, dfs], axis=1) 

There is possible solution a bit modify if there is same order of id values in all DataFrames without duplicates like 1,2,3...N with parameter nrows for read only first rows of big DataFrame by max length of smaller DataFrames:

#concat 30 files
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['a']) for fp in files]

len_max= max([x.index.max() for x in dfs])

df_big= pd.read_csv('big_df_file.csv', index_col=['id'], nrows=len_max)

df_fin = pd.concat([df_big, dfs], axis=1) 

EDIT:

#concat 30 files
files = glob.glob('files/*.csv')
#order of files is important for concat values - 
#in first file are id = (1,100), second (101, 200)...
print (files)

#set by max rows of file
N = 100
#loop by big fileby chunk define in N
for i, x in enumerate(pd.read_csv('files/big.csv', chunksize=N, index_col=['id'])):
    #added try for avoid errors if want seelct non exist file in list files
    try:
        df = pd.read_csv(files[i], index_col=['id'])
        df1 = pd.concat([x, df], axis=1)
        print (df1)
        #in first loop create header in output
        if i == 0:
            pd.DataFrame(columns=df1.columns).to_csv('files/out.csv')
        #append data to output file
        df1.to_csv('files/out.csv', mode='a', header=False)

    except IndexError as e:
        print ('no files in list')

Upvotes: 2

Related Questions