Reputation: 3677
I have two dataframes, 'df_orders' and 'df_birthday':
I need to return email address from 'df_birthday' by performing match on consumer_id column in 'df_orders' to id column in 'df_birthday' ONLY if certain conditions are met.
Condition 1: consumer_id field in 'df_orders' appears ONLY once.
Condition 2: payment_complete field in 'df_orders' field equals '1.0'
Condition 3: if TIME NOW is ONLY 24 hours ahead of updated_at (datetime) field in 'df_orders'.
Condition 4: If condition 1 and 2 are true return columns 'first_name', 'last_name' and 'email_addr' from df_birthday by matching 'consumer_id' from 'df_order' to 'id' in 'df_birthday'.
To Sum up all conditions: Need to only return email_addr, first_name, and last_name from df_birthday if consumer_id field appears once, if payment_complete field equals '1.0', and updated_at field is exactly 24hours less than time now.
Here is my code(not sure how to retrieve columns 'first_name', 'last_name' and 'email_addr' from df_birthday if conditions 1-3 are true):
def first_purchase():
if df_order.groupby("consumer_id").filter(lambda x: len(x) == 1):
return "consumer_id"
elif df_order.loc[df_orders['payment_complete'] == 1.0]:
return 'payment_complete'
Should I write another function to compare the results? I am not even sure if this needs to in a function or for loop?
Here is for loop I have been tinkering with(not right):
for first_purchase in df_orders:
if df_orders.groupby("consumer_id").filter(lambda x: len(x) == 1):
elif df_orders.loc[df_orders['payment_complete'] == 1.0]:
else print 'fail'
Thank you in advance
Edit:
Sample Input:
df_birthday:
first_name last_name email_addr id
0 a A a@A 1
1 b B b@B 2
2 c B c@C 3
df_orders:
consumer_id payment_complete updated_at
0 1 1.0 2018-01-28
1 1 1.0 2018-01-28
2 2 1.0 2018-01-28
3 3 0 2018-01-28
Sample Output:
first_name last_name email_addr
0 b B b@B
Upvotes: 1
Views: 104
Reputation: 25405
You could filter a temporary copy of the orders dataframe first, and use that to filter the birthday dataframe, so only the records remain that we want to join. Then we can join the birthday dataframe back on the orders dataframe.
Working example below, hope this helps!
import pandas as pd
import numpy as np
import datetime as dt
df_birthday = pd.DataFrame([['a', 'A','a@A',1],
['b', 'B','b@B',2],
['c', 'B','c@C',3]],
columns=["first_name", "last_name",'email_addr','id'])
df_orders = pd.DataFrame([[1, 1.1],
[1, 1.0],
[2, 1.0],
[3, 0.0]],
columns=["consumer_id", "payment_complete"])
df_orders['updated_at'] = pd.to_datetime('today') + dt.timedelta(hours=1)
# Filters:
# Only occurs once
# Has payment complete == 1
# datetime difference with timestamp less than 24 hours.
df_temp = df_orders.groupby("consumer_id").filter(lambda x: len(x) == 1)
df_temp = df_temp[np.isclose(df_temp.payment_complete,1)]
df_temp = df_temp[(dt.datetime.now()- df_temp['updated_at']).astype('timedelta64[m]')<(24*60)]
# Filter the df_birthday dataframe, and join on our df_orders
df_birthday2 = df_birthday[df_birthday.id.isin(df_temp.consumer_id)]
print(df_birthday2)
# Only necessary if you want to join
df_orders = pd.merge(df_orders, df_birthday2, how='left', left_on='consumer_id', right_on='id')
df_orders = df_orders.drop('id',axis=1)
print(df_orders)
df_birthday:
first_name last_name email_addr id
0 a A a@A 1
1 b B b@B 2
2 c B c@C 3
df_orders:
consumer_id payment_complete updated_at
0 1 1 2018-01-28 01:00:00
1 1 1 2018-01-28 01:00:00
2 2 1 2018-01-28 01:00:00
3 3 0 2018-01-28 01:00:00
Resulting df_birthday2:
first_name last_name email_addr id
1 b B b@B 2
Resulting df_orders (if you run the last three lines):
consumer_id payment_complete updated_at first_name last_name email_addr
0 1 1 2018-01-28 01:00:00 NaN NaN NaN
1 1 1 2018-01-28 01:00:00 NaN NaN NaN
2 2 1 2018-01-28 01:00:00 b B b@B
3 3 0 2018-01-28 01:00:00 NaN NaN NaN
Upvotes: 1