Loop through Multiple CSV Files and Merge with Specific Columns [Pandas]

Question

I have a list of csv files. Each file has 5 columns, with ‘id’ as the only common column (primary key). The rest 4 columns are all different.

My point of interest is the 5th (last) column, which is different for each file. I want to merge them on ‘id’.

I have tried the following code but it concatenates row wise, giving me too many duplicate ‘id’ as well as ‘NaN’ values:

filelist = glob.glob(path + "/*.csv")

li = []

for filename in filelist:

    df = pd.read_csv(filename, index_col=None, header=0, usecols=[0,5])

    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

I wanna concatenate them column wise with my point-of-interest column (5th column).

For example:

My list of files: ['df1.csv', 'df2.csv', 'df3.csv', 'df4.csv']

df1.csv has the following structure:

   ID  No1 AA
0   1   0   4
1   2   1   5
2   3   0   6

df2.csv has this structure:

   ID  No2 BB
0   2   0   5
1   3   1   6
2   4   0   7

The list goes on. My desired output would be:

    ID  AA  BB  CC  DD
0   1   4.0 NaN 0   1
1   2   5.0 5.0 1   0
2   3   6.0 6.0 1   0
3   4   NaN 7.0 1   1

Any suggestions would be appreciated. Thank you.

apaolillo · Accepted Answer

Starting from your example, setting 'ID' as index and joining implicitly on it seems like the easiest (retrieve simply the last column by position with -1 numerical index):

import pandas as pd

filelist = [
    '/tmp/csvs/df1.csv',
    '/tmp/csvs/df2.csv',
]

result = pd.DataFrame()

for f in filelist:
    df = pd.read_csv(f, sep='\s+').set_index('ID')
    last_col = df.columns[-1]
    result = result.join(df[last_col], how='outer')
result.reset_index(inplace=True)

result

Out[1]: 
   ID   AA   BB
0   1  4.0  NaN
1   2  5.0  5.0
2   3  6.0  6.0
3   4  NaN  7.0

Loop through Multiple CSV Files and Merge with Specific Columns [Pandas]

Answers (2)

Related Questions