Reputation: 29
I am trying to merge (combine) using an outer join such that the result contains lines with the id in column 0 and all columns from both files. My files contain headings on the first line.
I have tried lots of variations but I continue to get errors complaining about the key. While there are many examples in stackoverflow, none give answers about the underlying methodology to be used.
The files have headings with the first column header = 'Code' and the key field is in fact 5 digits. I am not sure if that is causing me problems.
df1 = pd.read_csv('file1.csv', header=[0], index_col=['Code'])
df2 = pd.read_csv('file2.csv', header=[0], index_col=['Code'])
and I have tried
df1 = pd.read_csv('file1.csv', header=[0])
df2 = pd.read_csv('file2.csv', header=[0])
I have tried variations of...
dfx = pd.merge(df1, df2, left_on=['Code'], right_on=['Code'], how='outer')
dfx = df1[['Code','A-Score']].merge(df2[['Code','B-Score']], how='outer')
df1.merge(df2, on=['Code'], how='outer')
df = pd.merge(df1[['Code', 'Field1', 'Field2']], df2['Code', 'Field3', 'Field4'], on='Code', how='outer', suffixes=('-A','-B'))
dfx = pd.concat([df1,df2], axis=1, join='outer')
I want all rows from both files to be combined into one file. There are no duplicate keys in either file.
So I just want to perform a quite simple merge of the two files, and understand what parameters are required and where/why.
Postedit: My problem is that the key is interpreted as numeric as I can merge on a string key. So,
How do I 1. Override the key as being string and not numeric? 2. How do I specify the key as int64?
Upvotes: 0
Views: 64
Reputation: 6128
The following works for me.
$ cat a.csv
Code,Field1,Field2
1,10,100
2,20,200
3,30,300
5,50,500
$ cat b.csv
Code,Field3
1,11
2,21
4,41
>>> df1 = pd.read_csv('a.csv', header=0, index_col='Code')
>>> df1
Field1 Field2
Code
1 10 100
2 20 200
3 30 300
5 50 500
>>> df2 = pd.read_csv('b.csv', header=0, index_col='Code')
Field3
Code
1 11
2 21
4 41
>>> df1.merge(df2, how='outer', on='Code')
Field1 Field2 Field3
Code
1 10.0 100.0 11.0
2 20.0 200.0 21.0
3 30.0 300.0 NaN
5 50.0 500.0 NaN
4 NaN NaN 41.0
Update:
Per @OP's comment, without an index_col
and Code
as a numeric column:
>>> df1 = pd.read_csv('a.csv', header=[0])
>>> df1
Code Field1 Field2
0 1 10 100
1 2 20 200
2 3 30 300
3 5 50 500
>>> df1.dtypes
Code int64
Field1 int64
Field2 int64
dtype: object
>>> df2 = pd.read_csv('b.csv', header=[0])
>>> df2
Code Field3
0 1 11
1 2 21
2 4 41
>>> df2.dtypes
Code int64
Field3 int64
dtype: object
pd.merge(df1, df2, on='Code', how='outer')
Out[266]:
Code Field1 Field2 Field3
0 1 10.0 100.0 11.0
1 2 20.0 200.0 21.0
2 3 30.0 300.0 NaN
3 5 50.0 500.0 NaN
4 4 NaN NaN 41.0
Upvotes: 1