Reputation: 63
I have a list of email addresses that I want to find the matches in an ordered dictionary turned into a dataframe.
Here is my list of email addresses:
email_list = ['[email protected]','[email protected]','[email protected]','[email protected]']
Here is my dictionary turned into a DataFrame (df2):
sender type _time
0 [email protected] email 2020-12-09 19:45:48.013140
1 [email protected] email 2020-13-09 19:45:48.013140
2 [email protected] email 2020-12-09 19:45:48.013140
3 [email protected] email 2020-14-11 19:45:48.013140
I want to create a new DataFrame that displays columns of the matched sender, the # of matches (count), first seen date, and last seen date. All grouped by the matched sender. The first seen date would be the min timestamp in the _time column of the matched sender, and the last seen column value would be the max timestamp in the _time column of the matched sender.
Sample output after script is ran would look like this:
sender count type first_seen last_seen
0 [email protected] 2 email 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140
1 [email protected] 1 email 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140
2 [email protected] 1 email 2020-14-11 19:45:48.013140 2020-14-11 19:45:48.013140
3 [email protected] 0 email NA NA
Here is my python so far:
#Collect list of email addresses I want to find in df2
email_list = ['[email protected]','[email protected]','[email protected]','[email protected]']
# Turn email list into a dataframe
df1 = pd.DataFrame(email_list, columns=['sender'])
# Collect the table that holds the dictionary of emails sent
email_result_dict = {'sender': ['[email protected]','[email protected]','[email protected]','[email protected]',], 'type': ['email','email','email','email'], '_time': [' 2020-12-09 19:45:48.013140','2020-13-09 19:45:48.013140','2020-12-09 19:45:48.013140','2020-14-09 19:45:48.013140']}
# Turn dictionary into dataframe
df2 = pd.DataFrame.from_dict(email_result_dict)
# Calculate stats
c = df2.loc[df2['sender'].isin(df1['sender'].values)].groupby('sender').size().reset_index()
output = df1.merge(c, on='sender', how='left').fillna(0)
output['first_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmin] # Get the earliest value in '_time' column
output['last_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmax] # Get the latest value in '_time' column
# Set the columns of the new dataframe
output.columns = ['sender', 'count','first_seen', 'last_seen']
Any ideas or suggestions as to how to get my expected output in a dataframe? I have tried everything and keep getting stuck on getting the first_seen and last_seen values for each match that the count is greater than 0.
Upvotes: 2
Views: 159
Reputation: 161
I believe this code will do the trick.
Data Point Creation:
data = pd.DataFrame()
data['sender'] = ['[email protected]','[email protected]','[email protected]','[email protected]']
data['type'] = 'email'
data['_time'] = ['2020-12-09 19:45:48.013140','2020-13-09
19:45:48.013140','2020-12-09 19:45:48.013140','2020-14-11 19:45:48.013140']
create a new df with expected columns :
new_data = pd.DataFrame(columns =
['count','first_seen','last_seen','sender','type'] )
new_data['sender'] = list(set(data['sender'].values)) #data from input df
new_data['type'] = 'email' #constant
Iteration through the unique senders list:
for j in new_data['sender']:
temp_data = data[data['sender'] == j] #data with only a particular sender
new_data.loc[new_data['sender'] == j, 'count'] = len(temp_data)#count
if len(temp_data) > 1:#if multiple timings for a sender
timings = list(set(temp_data['_time']))#get all possible timings for sender
new_data.loc[new_data['sender'] == j, 'first_seen'] = min(timings)
new_data.loc[new_data['sender'] == j, 'last_seen'] = max(timings)
elif len(temp_data) == 1:#if single timimngs per sender
new_data.loc[new_data['sender'] == j, 'first_seen'] = new_data.loc[new_data['sender'] == j, 'last_seen'] = temp_data.iloc[0]['_time']
You will find your required format in new_data df
Upvotes: 1
Reputation: 34056
Based on your input df
, you can do Groupby.agg
:
In [1190]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()
In [1191]: res
Out[1191]:
sender type _time
min max count
0 [email protected] email 2020-14-11 19:45:48.013140 2020-14-11 19:45:48.013140 1
1 [email protected] email 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140 2
2 [email protected] email 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140 1
EDIT: To drop nested columns, do:
In [1206]: res.columns = res.columns.droplevel()
In [1207]: res
Out[1207]:
min max count
0 [email protected] email 2020-14-11 19:45:48.013140 2020-14-11 19:45:48.013140 1
1 [email protected] email 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140 2
2 [email protected] email 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140 1
EDIT-2: Using df1
also:
In [1246]: df = df1.merge(df, how='left')
In [1254]: df.type = df.type.fillna('email')
In [1259]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()
In [1260]: res.columns = res.columns.droplevel()
In [1261]: res
Out[1261]:
min max count
0 [email protected] email NaN NaN 0
1 [email protected] email 2020-14-11 19:45:48.013140 2020-14-11 19:45:48.013140 1
2 [email protected] email 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140 2
3 [email protected] email 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140 1
Upvotes: 1