ferus89
ferus89

Reputation: 177

Key error when performing pd.merge on string columns

I'm having trouble performing a merge on two dataframes.

My data is as follows, one comprises a dataset of 140,000 occurrence records of 121 different indicator species, eg:

taxon
spp1
spp1
spp118
spp98
spp64
spp119

The other comprises a list of the 121 species with a numeric score, eg:

taxon  score
spp1   8
spp2   4
spp3   2
spp4   7
...
spp118 4
spp119 2
spp120 1
spp121 8

I wish to join the score column to the dataset using the taxon column as the join field for subsequent analysis, I envisage an output like so:

taxon  score
spp1   8
spp1   8
spp118 4
spp98  6
spp64  2
spp119 2

So far I have tried the following:

data_in1 = pd.read_csv(r'occourence_list.csv', encoding='utf-8')
df1 = pd.DataFrame(data_in1)
df1['taxon'] = df1['taxon'].astype(str)
print(df1.head())
print(df1.dtypes)

data_in2 = pd.read_csv(r'score_list.csv', encoding='utf-8')
df2 = pd.DataFrame(data_in2)
df2['taxon'] = df2['taxon'].astype(str)
print(df2.head())
print(df2.dtypes)

result = pd.merge(df1,
                  df2[['iapa_score']],
                  left_on=['taxon'],
                  right_on=['taxon'],
                  how='outer')

print(result.head())

However, this returns the following error:

Traceback (most recent call last):
  File "C:/Users/", line 19, in <module>
    how='outer')
  File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\reshape\merge.py", line 47, in merge
    validate=validate)
  File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\reshape\merge.py", line 529, in __init__
    self.join_names) = self._get_merge_keys()
  File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\reshape\merge.py", line 833, in _get_merge_keys
    right._get_label_or_level_values(rk))
  File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\generic.py", line 1706, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'taxon'

Upvotes: 0

Views: 1070

Answers (2)

Phoenixo
Phoenixo

Reputation: 2113

In my opinion the problem comes from this line df2[['iapa_score']], because the column "taxon" does not appear anymore if you restrict your dataframe to the column "iapa_score", hence the "KeyError". You should also use the parameter how='left' in the merge since you want to keep only the lines from df1.

I created 2 dataframes by myself to reproduce your situation, please have a look :

df1 = pd.DataFrame({'taxon': ["spp1", "spp1", "spp3"]})
df2 = pd.DataFrame({'taxon': ["spp1", "spp2", "spp3", "spp4"], 'score': ["8", "7", "6", "17"]})
result = pd.merge(df1,
                  df2,
                  left_on=['taxon'],
                  right_on=['taxon'],
                  how='left')
print(result)

df1 :

  taxon
0  spp1
1  spp1
2  spp3

df2 :

  taxon score
0  spp1     8
1  spp2     7
2  spp3     6
3  spp4    17

result :

  taxon score
0  spp1     8
1  spp1     8
2  spp3     6

Upvotes: 1

ferus89
ferus89

Reputation: 177

I stumbled upon a solution, I changed:

result = pd.merge(df1,
                  df2[['iapa_score']],
                  left_on=['taxon'],
                  right_on=['taxon'],
                  how='outer')

to

result = pd.merge(df1,
                  df2,
                  left_on=['taxon'],
                  right_on=['taxon'],
                  how='outer')

and it worked. I was under the impression that column names within [] after the second dataframe implied those columns you wished to join.

Upvotes: 0

Related Questions