Cluso99
Cluso99

Reputation: 29

Merging two csv files on the key in column 1

I am trying to merge (combine) using an outer join such that the result contains lines with the id in column 0 and all columns from both files. My files contain headings on the first line.

I have tried lots of variations but I continue to get errors complaining about the key. While there are many examples in stackoverflow, none give answers about the underlying methodology to be used.

The files have headings with the first column header = 'Code' and the key field is in fact 5 digits. I am not sure if that is causing me problems.

df1 = pd.read_csv('file1.csv', header=[0], index_col=['Code'])
df2 = pd.read_csv('file2.csv', header=[0], index_col=['Code'])

and I have tried

df1 = pd.read_csv('file1.csv', header=[0])
df2 = pd.read_csv('file2.csv', header=[0])

I have tried variations of...

dfx = pd.merge(df1, df2, left_on=['Code'], right_on=['Code'], how='outer')
dfx = df1[['Code','A-Score']].merge(df2[['Code','B-Score']], how='outer')
df1.merge(df2, on=['Code'], how='outer')
df  = pd.merge(df1[['Code', 'Field1', 'Field2']], df2['Code', 'Field3', 'Field4'], on='Code', how='outer', suffixes=('-A','-B'))
dfx = pd.concat([df1,df2], axis=1, join='outer')

I want all rows from both files to be combined into one file. There are no duplicate keys in either file.

So I just want to perform a quite simple merge of the two files, and understand what parameters are required and where/why.

Postedit: My problem is that the key is interpreted as numeric as I can merge on a string key. So,

How do I 1. Override the key as being string and not numeric? 2. How do I specify the key as int64?

Upvotes: 0

Views: 64

Answers (1)

user650654
user650654

Reputation: 6128

The following works for me.

$ cat a.csv
Code,Field1,Field2
1,10,100
2,20,200
3,30,300
5,50,500

$ cat b.csv
Code,Field3
1,11
2,21
4,41
>>> df1 = pd.read_csv('a.csv', header=0, index_col='Code')
>>> df1
      Field1  Field2
Code
1         10     100
2         20     200
3         30     300
5         50     500

>>> df2 = pd.read_csv('b.csv', header=0, index_col='Code')
      Field3
Code
1         11
2         21
4         41
>>> df1.merge(df2, how='outer', on='Code')
      Field1  Field2  Field3
Code
1       10.0   100.0    11.0
2       20.0   200.0    21.0
3       30.0   300.0     NaN
5       50.0   500.0     NaN
4        NaN     NaN    41.0

Update: Per @OP's comment, without an index_col and Code as a numeric column:

>>> df1 = pd.read_csv('a.csv', header=[0])
>>> df1
   Code  Field1  Field2
0     1      10     100
1     2      20     200
2     3      30     300
3     5      50     500
>>> df1.dtypes
Code      int64
Field1    int64
Field2    int64
dtype: object

>>> df2 = pd.read_csv('b.csv', header=[0])
>>> df2
   Code  Field3
0     1      11
1     2      21
2     4      41
>>> df2.dtypes
Code      int64
Field3    int64
dtype: object

pd.merge(df1, df2, on='Code', how='outer')
Out[266]:
   Code  Field1  Field2  Field3
0     1    10.0   100.0    11.0
1     2    20.0   200.0    21.0
2     3    30.0   300.0     NaN
3     5    50.0   500.0     NaN
4     4     NaN     NaN    41.0

Upvotes: 1

Related Questions