Reputation: 1044
I have a list of csv files. Each file has 5 columns, with ‘id’ as the only common column (primary key). The rest 4 columns are all different.
My point of interest is the 5th (last) column, which is different for each file. I want to merge them on ‘id’.
I have tried the following code but it concatenates row wise, giving me too many duplicate ‘id’ as well as ‘NaN’ values:
filelist = glob.glob(path + "/*.csv")
li = []
for filename in filelist:
df = pd.read_csv(filename, index_col=None, header=0, usecols=[0,5])
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
I wanna concatenate them column wise with my point-of-interest column (5th column).
For example:
My list of files: ['df1.csv', 'df2.csv', 'df3.csv', 'df4.csv']
df1.csv
has the following structure:
ID No1 AA
0 1 0 4
1 2 1 5
2 3 0 6
df2.csv
has this structure:
ID No2 BB
0 2 0 5
1 3 1 6
2 4 0 7
The list goes on. My desired output would be:
ID AA BB CC DD
0 1 4.0 NaN 0 1
1 2 5.0 5.0 1 0
2 3 6.0 6.0 1 0
3 4 NaN 7.0 1 1
Any suggestions would be appreciated. Thank you.
Upvotes: 0
Views: 2334
Reputation: 135
Starting from your example, setting 'ID' as index and joining implicitly on it seems like the easiest (retrieve simply the last column by position with -1
numerical index):
import pandas as pd
filelist = [
'/tmp/csvs/df1.csv',
'/tmp/csvs/df2.csv',
]
result = pd.DataFrame()
for f in filelist:
df = pd.read_csv(f, sep='\s+').set_index('ID')
last_col = df.columns[-1]
result = result.join(df[last_col], how='outer')
result.reset_index(inplace=True)
result
Out[1]:
ID AA BB
0 1 4.0 NaN
1 2 5.0 5.0
2 3 6.0 6.0
3 4 NaN 7.0
Upvotes: 1
Reputation: 2776
Merge on ID using only the first and last columns:
df = df1.iloc[:,[0,-1]].merge(df2.iloc[:,[0,-1]],on="ID",how="outer")
After the first merge you'll want just:
df = df.merge(df3.iloc[:,[0,-1]],on="ID",how="outer")
In use:
import pandas as pd
data1 = {"ID":[1,2,3], "No1":[0,1,0], "AA":[4,5,6]}
data2 = {"ID":[2,3,4], "No2":[0,1,0], "BB":[5,6,7]}
data3 = {"ID":[1,3,4], "No2":[0,1,0], "CC":[2,3,4]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
df = df1.iloc[:,[0,-1]].merge(df2.iloc[:,[0,-1]],on="ID",how="outer")
print(df.merge(df3.iloc[:,[0,-1]],on="ID",how="outer"))
Output:
ID AA BB CC
0 1 4.0 NaN 2.0
1 2 5.0 5.0 NaN
2 3 6.0 6.0 3.0
3 4 NaN 7.0 4.0
Upvotes: 1