Reputation: 173
refering to my a few days ago asked question i now have an additional problem with my data. I have following two DataFrames:
>>> df1
A B date
0 1 1 2015-02
1 1 1 2015-03
2 2 2 2017-01
3 2 2 2017-02
>>> df2
A B C 02-2015 03-2015 01-2017 02-2017
0 1 1 2013-07-01 0.10 0.22 0.55 0.77
1 1 1 2015-01-01 0.20 0.12 0.99 0.125
2 2 2 2016-12-01 0.13 0.15 0.15 0.245
3 2 2 2016-01-01 0.33 0.1 0.888 0.64
What i want is following DataFrame:
>>> df1
A B date value
0 1 1 2015-02 0.20
1 1 1 2015-03 0.12
2 2 2 2017-01 0.15
3 2 2 2017-02 0.245
My current code looks like following:
df1['value'] = df2.set_index('A', 'B').lookup(
df1.set_index('A', 'B').index, df1['date'])
This does not work and my df1 is a NoneType because in df2 are duplicate rows with condition A and B == 1. What I want is an additional condition where it first extracts the earliest date for each unqiue A and B, which would be for A and B == 1 the date 2015-02.
From df2 it should take row number 1 because the delta in months is only 1 instead of row 0 where the delta will be 18.
Many thanks in advance!
Upvotes: 0
Views: 1324
Reputation: 3097
It can be achieved by using melt, lambda, sort_values, drop_dulicates
as below
df3 = df2.melt(id_vars = ['A', 'B', 'C'], var_name='date')
df3[['A', 'B']] = df3[['A', 'B']].astype(float)
df3['Diff'] = df3.apply(lambda row: abs(datetime.strptime(row['date'], '%Y-%m') - datetime.strptime(row['C'], '%Y-%m-%d')), axis=1)
df3.sort_values(['Diff'], ascending=[True], inplace=True)
df3.drop_duplicates(subset=['A', 'B', 'date'], keep='first', inplace=True)
df3.drop(['C', 'Diff'], 1, inplace=True)
df4 = df1.merge(df3, on=['A', 'B', 'date'], how='left')
output is
A B date value
0 1.0 1.0 2015-02 0.200
1 1.0 1.0 2015-03 0.120
2 2.0 2.0 2017-01 0.150
3 2.0 2.0 2017-02 0.245
complete example is as below.
import pandas as pd
from datetime import datetime
df1 = pd.DataFrame(columns = ['A', 'B', 'date'])
df1.loc[len(df1)] = [1, 1, '2015-02']
df1.loc[len(df1)] = [1, 1, '2015-03']
df1.loc[len(df1)] = [2, 2, '2017-01']
df1.loc[len(df1)] = [2, 2, '2017-02']
df1[['A', 'B']] = df1[['A', 'B']].astype(float)
df2 = pd.DataFrame(columns = ['A', 'B', 'C', '2015-02', '2015-03', '2017-01', '2017-02'])
df2.loc[len(df2)] = [1, 1, '2013-07-01', 0.10, 0.22, 0.55, 0.77]
df2.loc[len(df2)] = [1, 1, '2015-01-01', 0.20, 0.12, 0.99, 0.125]
df2.loc[len(df2)] = [2, 2, '2016-12-01', 0.13, 0.15, 0.15, 0.245]
df2.loc[len(df2)] = [2, 2, '2016-01-01', 0.33, 0.1, 0.888, 0.64]
df3 = df2.melt(id_vars = ['A', 'B', 'C'], var_name='date')
df3[['A', 'B']] = df3[['A', 'B']].astype(float)
df3['Diff'] = df3.apply(lambda row: abs(datetime.strptime(row['date'], '%Y-%m') - datetime.strptime(row['C'], '%Y-%m-%d')), axis=1)
df3.sort_values(['Diff'], ascending=[True], inplace=True)
df3.drop_duplicates(subset=['A', 'B', 'date'], keep='first', inplace=True)
df3.drop(['C', 'Diff'], 1, inplace=True)
df4 = df1.merge(df3, on=['A', 'B', 'date'], how='left')
print(df4)
Upvotes: 1