Reputation: 2953
I have a dataframe as follows:
Date ID
2014-12-31 1
2014-12-31 2
2014-12-31 3
2014-12-31 4
2014-12-31 5
2014-12-31 6
2014-12-31 7
2015-01-01 1
2015-01-01 2
2015-01-01 3
2015-01-01 4
2015-01-01 5
2015-01-02 1
2015-01-02 3
2015-01-02 7
2015-01-02 9
What I would like to do is determine the ID(s) on one date that are exclusive to that date versus the values of another date.
Example1: The result df would be the exclusive ID(s) in 2014-12-31 vs. the ID(s) in 2015-01-01 and the exclusive ID(s) in 2015-01-01 vs. the ID(s) in 2015-01-02:
2015-01-01 6
2015-01-01 7
2015-01-02 2
2015-01-02 4
2015-01-02 6
I would like to 'choose' how many days 'back' I compare. For instance I can enter a variable daysback=1
and each day would compare to the previous. Or I can enter variable daysback=2
and each day would compare to two days ago. etc.
Outside of df.groupby('Date')
, I'm not sure where to go with this. Possibly use of diff()
?
Upvotes: 1
Views: 91
Reputation: 585
I'm assuming that the "Date" in your DataFrame is: 1) a date object and 2) not the index.
If those assumptions are wrong, then that changes things a bit.
import datetime
from datetime import timedelta
def find_unique_ids(df, date, daysback=1):
date_new = date
date_old = date - timedelta(days = daysback)
ids_new = df[df['Date'] == date_new]['ID']
ids_old = df[df['Date'] == date_old]['ID']
return df.iloc[ids_new[-ids_new.isin(ids_old)]]
date = datetime.date(2015, 1, 2)
daysback = 1
print find_unique_ids(df, date, daysback)
Running that produces the following output:
Date ID
7 2015-01-01 1
9 2015-01-01 3
If the Date is your Index field, then you need to modify two lines in the function:
ids_new = df.ix[date_new]['ID']
ids_old = df.ix[date_old]['ID']
Output:
ID
Date
2015-01-01 1
2015-01-01 3
EDIT:
This is kind of dirty, but it should accomplish what you want to do. I added comments inline that explain what is going on. There are probably cleaner and more efficient ways to go about this if this is something that you're going to be running regularly or across massive amounts of data.
def find_unique_ids(df,daysback):
# We need both Date and ID to both be either fields or index fields -- no mix/match.
df = df.reset_index()
# Calculate DateComp by adding our daysback value as a timedelta
df['DateComp'] = df['Date'].apply(lambda dc: dc + timedelta(days=daysback))
# Join df back on to itself, SQL style LEFT OUTER.
df2 = pd.merge(df,df, left_on=['DateComp','ID'], right_on=['Date','ID'], how='left')
# Create series of missing_id values from the right table
missing_ids = (df2['Date_y'].isnull())
# Create series of valid DateComp values.
# DateComp is the "future" date that we're comparing against. Without this
# step, all records on the last Date value will be flagged as unique IDs.
valid_dates = df2['DateComp_x'].isin(df['Date'].unique())
# Use those to find missing IDs and valid dates. Create a new output DataFrame.
output = df2[(valid_dates) & (missing_ids)][['DateComp_x','ID']]
# Rename columns of output and return
output.columns = ['Date','ID']
return output
Test output:
Date ID
5 2015-01-01 6
6 2015-01-01 7
8 2015-01-02 2
10 2015-01-02 4
11 2015-01-02 5
EDIT:
missing_ids=df2[df2['Date_y'].isnull()] #gives the whole necessary dataframe
Upvotes: 1
Reputation: 4375
Another way by applying list to aggregation,
df
Out[146]:
Date Unnamed: 2
0 2014-12-31 1
1 2014-12-31 2
2 2014-12-31 3
3 2014-12-31 4
4 2014-12-31 5
5 2014-12-31 6
6 2014-12-31 7
7 2015-01-01 1
8 2015-01-01 2
9 2015-01-01 3
10 2015-01-01 4
11 2015-01-01 5
12 2015-01-02 1
13 2015-01-02 3
14 2015-01-02 7
15 2015-01-02 9
abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)
abbs
Out[142]:
Date
2014-12-31 [1, 2, 3, 4, 5, 6, 7]
2015-01-01 [1, 2, 3, 4, 5]
2015-01-02 [1, 3, 7, 9]
Name: Unnamed: 2, dtype: object
abbs.loc['2015-01-01']
Out[143]: [1, 2, 3, 4, 5]
list(set(abbs.loc['2014-12-31']) - set(abbs.loc['2015-01-01']))
Out[145]: [6, 7]
In function
def uid(df,date1,date2):
abbs = df.groupby(['Date'])['Unnamed: 2'].apply(list)
return list(set(abbs.loc[date1]) - set(abbs.loc[date2]))
uid(df,'2015-01-01','2015-01-02')
Out[162]: [2, 4, 5]
You could write a function and use date instead of str :)
Upvotes: 1