Reputation: 79
I would like to join (or merge?) two dataframes. They look like the following:
Table 1 ( = df)
index | year | country
----------------------------
0 | 1970 | NL
1 | 1970 | UK
2 | 1980 | US
3 | 1990 | NL
4 | 1990 | US
Table 2 (= df_gdp)
cntry | 1970 | 1980 | 1990
-----------------------------------
NL | 5 | 3 | 0
UK | 1 | 7 | 1
US | 9 | 2 | 0
The result should be Table1 with an additional column 'GDP'. The values of Table1.year and Table.country should be used to look up the value in Table2. So the result would be:
index | year | country | GDP
--------------------------------------
0 | 1970 | NL | 5
1 | 1970 | UK | 1
2 | 1980 | US | 2
3 | 1990 | NL | 0
4 | 1990 | US | 0
I already wrote the function with .iterrows()
but as expected, this does not have a good performance. Instead, I'm wondering whether the result can also be achieved by either .join()
or .merge()
. What I do not understand is how to merge/join based on the index (cntry) and a changing column (the year). The code of .iterrows()
looks like the following:
# Add GDP data
for index, row in df.iterrows():
gdp_year = str(df.iloc[index].year)
gdp_country = str(df.iloc[index].country)
try:
df.at[index, 'GDP'] = df_gdp.loc[gdp_country][gdp_year]
except:
df.at[index, 'GDP'] = 0
df
Upvotes: 2
Views: 122
Reputation: 1
This code worked for me.
import pandas as pd
valueset_data = {
'varName': ['ICLEVEL', 'ICLEVEL', 'State', 'State'],
'codevalue': [1, 2, 1, 2],
'varLabel': ['Less than 2 years (below associate)', 'Four years', 'MA', 'CT']
}
HD2019_data = {
'InstituteID': [23, 32],
'ICLEVEL': [1, 2],
'State': [2, 1]
}
valueset = pd.DataFrame(valueset_data)
HD2019 = pd.DataFrame(HD2019_data)
valueset_dict = valueset.set_index(['varName', 'codevalue']).to_dict()['varLabel']
HD2019['ICLEVEL'] = HD2019.apply(lambda x: valueset_dict[('ICLEVEL', x['ICLEVEL'])], axis=1)
HD2019['State'] = HD2019.apply(lambda x: valueset_dict[('State', x['State'])], axis=1)
print(HD2019)
Upvotes: 0
Reputation: 820
You can also use merge
as follows.
Set index to country
column and then perform stack
operation so that you can get columns in row. Then you need to reset_index()
to get dataframe structure similar to df1. After that you need to specify column names using set_axis()
so that you can map it later while performing merge operation.
And at last just perform merge operation on df1
and preprocessed df_m
dataframe(default is inner join operation).
import pandas as pd
data1 = {
'year':['1970','1970','1980','1990','1990'],
'country':['NL','UK','US','NL','US']
}
data2 = {
'country':['NL','UK','US'],
'1970':[5,1,9],
'1980':[3,7,2],
'1990':[0,1,0]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# See below code
df_m = df2.set_index('country').stack().reset_index().set_axis(['country', 'year', 'GDP'], axis=1)
pd.merge(df1, df_m, on=['year','country'])
Upvotes: 1
Reputation: 17794
You can use melt
and merge
:
df2.rename({'cntry': 'country'}, axis=1)\
.melt('country', var_name='year', value_name='GDP')\
.merge(df1, on=['country', 'year'])
Output:
country year GDP
0 NL 1970 5
1 UK 1970 1
2 US 1980 2
3 NL 1990 0
4 US 1990 0
Upvotes: 0
Reputation: 10624
You can create a function that takes a dataframe as argument, and apply it to df:
def f(x):
return df_gdp.loc[x['country'],x['year']]
df['GDP']=df.apply(f, axis=1)
Result:
year country GDP
0 1970 NL 5
1 1970 UK 1
2 1980 US 2
3 1990 NL 0
4 1990 US 0
Upvotes: 1