Pandas Dataframe Groupby Determine Values in 1 group vs. another group

Question

I have a dataframe as follows:

Date        ID
2014-12-31  1
2014-12-31  2
2014-12-31  3
2014-12-31  4
2014-12-31  5
2014-12-31  6
2014-12-31  7
2015-01-01  1
2015-01-01  2
2015-01-01  3
2015-01-01  4
2015-01-01  5
2015-01-02  1
2015-01-02  3
2015-01-02  7
2015-01-02  9

What I would like to do is determine the ID(s) on one date that are exclusive to that date versus the values of another date.

Example1: The result df would be the exclusive ID(s) in 2014-12-31 vs. the ID(s) in 2015-01-01 and the exclusive ID(s) in 2015-01-01 vs. the ID(s) in 2015-01-02:

   2015-01-01  6 
   2015-01-01  7
   2015-01-02  2
   2015-01-02  4
   2015-01-02  6

I would like to 'choose' how many days 'back' I compare. For instance I can enter a variable daysback=1 and each day would compare to the previous. Or I can enter variable daysback=2 and each day would compare to two days ago. etc.

Outside of df.groupby('Date'), I'm not sure where to go with this. Possibly use of diff()?

cggarvey · Accepted Answer

I'm assuming that the "Date" in your DataFrame is: 1) a date object and 2) not the index.

If those assumptions are wrong, then that changes things a bit.

import datetime
from datetime import timedelta

def find_unique_ids(df, date, daysback=1):

    date_new = date
    date_old = date - timedelta(days = daysback)

    ids_new = df[df['Date'] == date_new]['ID']
    ids_old = df[df['Date'] == date_old]['ID'] 

    return df.iloc[ids_new[-ids_new.isin(ids_old)]]

date = datetime.date(2015, 1, 2)
daysback = 1

print find_unique_ids(df, date, daysback)

Running that produces the following output:

        Date  ID
7 2015-01-01   1
9 2015-01-01   3

If the Date is your Index field, then you need to modify two lines in the function:

ids_new = df.ix[date_new]['ID']
ids_old = df.ix[date_old]['ID']

Output:

            ID
Date          
2015-01-01   1
2015-01-01   3

EDIT:

This is kind of dirty, but it should accomplish what you want to do. I added comments inline that explain what is going on. There are probably cleaner and more efficient ways to go about this if this is something that you're going to be running regularly or across massive amounts of data.

def find_unique_ids(df,daysback):

    # We need both Date and ID to both be either fields or index fields -- no mix/match.
    df = df.reset_index() 

    # Calculate DateComp by adding our daysback value as a timedelta
    df['DateComp'] = df['Date'].apply(lambda dc: dc + timedelta(days=daysback))

    # Join df back on to itself, SQL style LEFT OUTER.
    df2 = pd.merge(df,df, left_on=['DateComp','ID'], right_on=['Date','ID'], how='left')

    # Create series of missing_id values from the right table
    missing_ids = (df2['Date_y'].isnull())

    # Create series of valid DateComp values. 
    # DateComp is the "future" date that we're comparing against. Without this
    # step, all records on the last Date value will be flagged as unique IDs.
    valid_dates = df2['DateComp_x'].isin(df['Date'].unique())

    # Use those to find missing IDs and valid dates. Create a new output DataFrame.
    output = df2[(valid_dates) & (missing_ids)][['DateComp_x','ID']]

    # Rename columns of output and return
    output.columns = ['Date','ID']
    return output

Test output:

         Date  ID
5  2015-01-01   6
6  2015-01-01   7
8  2015-01-02   2
10 2015-01-02   4
11 2015-01-02   5

EDIT:

missing_ids=df2[df2['Date_y'].isnull()] #gives the whole necessary dataframe

Pandas Dataframe Groupby Determine Values in 1 group vs. another group

Answers (2)

Related Questions