Reputation: 177
I'm having trouble performing a merge on two dataframes.
My data is as follows, one comprises a dataset of 140,000 occurrence records of 121 different indicator species, eg:
taxon
spp1
spp1
spp118
spp98
spp64
spp119
The other comprises a list of the 121 species with a numeric score, eg:
taxon score
spp1 8
spp2 4
spp3 2
spp4 7
...
spp118 4
spp119 2
spp120 1
spp121 8
I wish to join the score column to the dataset using the taxon column as the join field for subsequent analysis, I envisage an output like so:
taxon score
spp1 8
spp1 8
spp118 4
spp98 6
spp64 2
spp119 2
So far I have tried the following:
data_in1 = pd.read_csv(r'occourence_list.csv', encoding='utf-8')
df1 = pd.DataFrame(data_in1)
df1['taxon'] = df1['taxon'].astype(str)
print(df1.head())
print(df1.dtypes)
data_in2 = pd.read_csv(r'score_list.csv', encoding='utf-8')
df2 = pd.DataFrame(data_in2)
df2['taxon'] = df2['taxon'].astype(str)
print(df2.head())
print(df2.dtypes)
result = pd.merge(df1,
df2[['iapa_score']],
left_on=['taxon'],
right_on=['taxon'],
how='outer')
print(result.head())
However, this returns the following error:
Traceback (most recent call last):
File "C:/Users/", line 19, in <module>
how='outer')
File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\reshape\merge.py", line 47, in merge
validate=validate)
File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\reshape\merge.py", line 529, in __init__
self.join_names) = self._get_merge_keys()
File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\reshape\merge.py", line 833, in _get_merge_keys
right._get_label_or_level_values(rk))
File "C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\generic.py", line 1706, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'taxon'
Upvotes: 0
Views: 1070
Reputation: 2113
In my opinion the problem comes from this line df2[['iapa_score']]
, because the column "taxon" does not appear anymore if you restrict your dataframe to the column "iapa_score", hence the "KeyError". You should also use the parameter how='left'
in the merge since you want to keep only the lines from df1.
I created 2 dataframes by myself to reproduce your situation, please have a look :
df1 = pd.DataFrame({'taxon': ["spp1", "spp1", "spp3"]})
df2 = pd.DataFrame({'taxon': ["spp1", "spp2", "spp3", "spp4"], 'score': ["8", "7", "6", "17"]})
result = pd.merge(df1,
df2,
left_on=['taxon'],
right_on=['taxon'],
how='left')
print(result)
df1 :
taxon
0 spp1
1 spp1
2 spp3
df2 :
taxon score
0 spp1 8
1 spp2 7
2 spp3 6
3 spp4 17
result :
taxon score
0 spp1 8
1 spp1 8
2 spp3 6
Upvotes: 1
Reputation: 177
I stumbled upon a solution, I changed:
result = pd.merge(df1,
df2[['iapa_score']],
left_on=['taxon'],
right_on=['taxon'],
how='outer')
to
result = pd.merge(df1,
df2,
left_on=['taxon'],
right_on=['taxon'],
how='outer')
and it worked. I was under the impression that column names within [] after the second dataframe implied those columns you wished to join.
Upvotes: 0