Filter Dataframe based on matched values in a column, and on min/max values timestamp of those values that matched

Question

I have a list of email addresses that I want to find the matches in an ordered dictionary turned into a dataframe.

Here is my list of email addresses:

email_list = ['c@aol.com','g@aol.com','b@aol.com','a@aol.com']

Here is my dictionary turned into a DataFrame (df2):

    sender      type          _time
0  c@aol.com      email   2020-12-09 19:45:48.013140
1  c@aol.com      email    2020-13-09 19:45:48.013140
2  g@aol.com      email   2020-12-09 19:45:48.013140
3  b@aol.com      email    2020-14-11 19:45:48.013140

I want to create a new DataFrame that displays columns of the matched sender, the # of matches (count), first seen date, and last seen date. All grouped by the matched sender. The first seen date would be the min timestamp in the _time column of the matched sender, and the last seen column value would be the max timestamp in the _time column of the matched sender.

Sample output after script is ran would look like this:

      sender  count      type          first_seen            last_seen
0  c@aol.com   2        email   2020-12-09 19:45:48.013140   2020-13-09 19:45:48.013140
1  g@aol.com   1        email   2020-12-09 19:45:48.013140   2020-12-09 19:45:48.013140
2  b@aol.com   1        email    2020-14-11 19:45:48.013140   2020-14-11 19:45:48.013140
3  a@aol.com   0        email             NA                     NA

Here is my python so far:

#Collect list of email addresses I want to find in df2
email_list = ['c@aol.com','g@aol.com','b@aol.com','a@aol.com']

# Turn email list into a dataframe
df1 = pd.DataFrame(email_list, columns=['sender'])

# Collect the table that holds the dictionary of emails sent
email_result_dict = {'sender': ['c@aol.com','c@aol.com','g@aol.com','b@aol.com',], 'type': ['email','email','email','email'], '_time': [' 2020-12-09 19:45:48.013140','2020-13-09 19:45:48.013140','2020-12-09 19:45:48.013140','2020-14-09 19:45:48.013140']}

# Turn dictionary into dataframe
df2 = pd.DataFrame.from_dict(email_result_dict)

# Calculate stats
c = df2.loc[df2['sender'].isin(df1['sender'].values)].groupby('sender').size().reset_index()
output = df1.merge(c, on='sender', how='left').fillna(0)
output['first_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmin] # Get the earliest value in '_time' column
output['last_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmax] # Get the latest value in '_time' column

# Set the columns of the new dataframe
output.columns = ['sender', 'count','first_seen', 'last_seen']

Any ideas or suggestions as to how to get my expected output in a dataframe? I have tried everything and keep getting stuck on getting the first_seen and last_seen values for each match that the count is greater than 0.

Mayank Porwal · Accepted Answer

Based on your input df, you can do Groupby.agg:

In [1190]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()

In [1191]: res
Out[1191]: 
      sender   type                       _time                                  
                                            min                         max count
0  b@aol.com  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140     1
1  c@aol.com  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140     2
2  g@aol.com  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140     1

EDIT: To drop nested columns, do:

In [1206]: res.columns = res.columns.droplevel()

In [1207]: res
Out[1207]: 
                                            min                         max  count
0  b@aol.com  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140      1
1  c@aol.com  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140      2
2  g@aol.com  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140      1

EDIT-2: Using df1 also:

In [1246]: df = df1.merge(df, how='left')
In [1254]: df.type = df.type.fillna('email')

In [1259]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()

In [1260]: res.columns = res.columns.droplevel()

In [1261]: res
Out[1261]: 
                                            min                         max  count
0  a@aol.com  email                         NaN                         NaN      0
1  b@aol.com  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140      1
2  c@aol.com  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140      2
3  g@aol.com  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140      1

Filter Dataframe based on matched values in a column, and on min/max values timestamp of those values that matched

Answers (2)

Related Questions