CoderGuru
CoderGuru

Reputation: 63

Filter Dataframe based on matched values in a column, and on min/max values timestamp of those values that matched

I have a list of email addresses that I want to find the matches in an ordered dictionary turned into a dataframe.

Here is my list of email addresses:

email_list = ['[email protected]','[email protected]','[email protected]','[email protected]']

Here is my dictionary turned into a DataFrame (df2):

    sender      type          _time
0  [email protected]      email   2020-12-09 19:45:48.013140
1  [email protected]      email    2020-13-09 19:45:48.013140
2  [email protected]      email   2020-12-09 19:45:48.013140
3  [email protected]      email    2020-14-11 19:45:48.013140

I want to create a new DataFrame that displays columns of the matched sender, the # of matches (count), first seen date, and last seen date. All grouped by the matched sender. The first seen date would be the min timestamp in the _time column of the matched sender, and the last seen column value would be the max timestamp in the _time column of the matched sender.

Sample output after script is ran would look like this:

      sender  count      type          first_seen            last_seen
0  [email protected]   2        email   2020-12-09 19:45:48.013140   2020-13-09 19:45:48.013140
1  [email protected]   1        email   2020-12-09 19:45:48.013140   2020-12-09 19:45:48.013140
2  [email protected]   1        email    2020-14-11 19:45:48.013140   2020-14-11 19:45:48.013140
3  [email protected]   0        email             NA                     NA

Here is my python so far:

#Collect list of email addresses I want to find in df2
email_list = ['[email protected]','[email protected]','[email protected]','[email protected]']

# Turn email list into a dataframe
df1 = pd.DataFrame(email_list, columns=['sender'])

# Collect the table that holds the dictionary of emails sent
email_result_dict = {'sender': ['[email protected]','[email protected]','[email protected]','[email protected]',], 'type': ['email','email','email','email'], '_time': [' 2020-12-09 19:45:48.013140','2020-13-09 19:45:48.013140','2020-12-09 19:45:48.013140','2020-14-09 19:45:48.013140']}

# Turn dictionary into dataframe
df2 = pd.DataFrame.from_dict(email_result_dict)

# Calculate stats
c = df2.loc[df2['sender'].isin(df1['sender'].values)].groupby('sender').size().reset_index()
output = df1.merge(c, on='sender', how='left').fillna(0)
output['first_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmin] # Get the earliest value in '_time' column
output['last_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmax] # Get the latest value in '_time' column

# Set the columns of the new dataframe
output.columns = ['sender', 'count','first_seen', 'last_seen']

Any ideas or suggestions as to how to get my expected output in a dataframe? I have tried everything and keep getting stuck on getting the first_seen and last_seen values for each match that the count is greater than 0.

Upvotes: 2

Views: 159

Answers (2)

Sahil_Angra
Sahil_Angra

Reputation: 161

I believe this code will do the trick.

Data Point Creation:

    data = pd.DataFrame()
    data['sender'] = ['[email protected]','[email protected]','[email protected]','[email protected]']
    data['type'] = 'email'
    data['_time'] = ['2020-12-09 19:45:48.013140','2020-13-09 
    19:45:48.013140','2020-12-09 19:45:48.013140','2020-14-11 19:45:48.013140']

create a new df with expected columns :

    new_data = pd.DataFrame(columns = 
    ['count','first_seen','last_seen','sender','type'] )
    new_data['sender'] = list(set(data['sender'].values)) #data from input df
    new_data['type'] = 'email' #constant

Iteration through the unique senders list:

     for j in new_data['sender']:
       temp_data = data[data['sender'] == j] #data with only a particular sender
       new_data.loc[new_data['sender'] == j, 'count'] = len(temp_data)#count

       if len(temp_data) > 1:#if multiple timings for a sender
            timings = list(set(temp_data['_time']))#get all possible timings for sender
            new_data.loc[new_data['sender'] == j, 'first_seen'] = min(timings)
            new_data.loc[new_data['sender'] == j, 'last_seen'] = max(timings)
    
       elif len(temp_data) == 1:#if single timimngs per sender
            new_data.loc[new_data['sender'] == j, 'first_seen'] = new_data.loc[new_data['sender'] == j, 'last_seen'] = temp_data.iloc[0]['_time']

You will find your required format in new_data df

Upvotes: 1

Mayank Porwal
Mayank Porwal

Reputation: 34056

Based on your input df, you can do Groupby.agg:

In [1190]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()

In [1191]: res
Out[1191]: 
      sender   type                       _time                                  
                                            min                         max count
0  [email protected]  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140     1
1  [email protected]  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140     2
2  [email protected]  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140     1

EDIT: To drop nested columns, do:

In [1206]: res.columns = res.columns.droplevel()

In [1207]: res
Out[1207]: 
                                            min                         max  count
0  [email protected]  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140      1
1  [email protected]  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140      2
2  [email protected]  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140      1

EDIT-2: Using df1 also:

In [1246]: df = df1.merge(df, how='left')
In [1254]: df.type = df.type.fillna('email')

In [1259]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()

In [1260]: res.columns = res.columns.droplevel()

In [1261]: res
Out[1261]: 
                                            min                         max  count
0  [email protected]  email                         NaN                         NaN      0
1  [email protected]  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140      1
2  [email protected]  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140      2
3  [email protected]  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140      1

Upvotes: 1

Related Questions