Roel van Endhoven
Roel van Endhoven

Reputation: 180

Comparing date columns between two dataframes

The project I'm working on requires me to find out which 'project' has been updated since the last time it was processed. For this purpose I have two dataframes which both contain three columns, the last one of which is a date signifying the last time a project is updated. The first dataframe is derived from a query on a database table which records the date a 'project' is updated. The second is metadata I store myself in a different table about the last time my part of the application processed a project.

I think I came pretty far but I'm stuck on the following error, see the code provided below:

lastmatch = pd.DataFrame({
    'projectid': ['1', '2', '2', '3'],
    'stage': ['c', 'c', 'v', 'v'],
    'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
                      '2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])

processed = pd.DataFrame({
    'projectid': ['1', '2'],
    'stage': ['c', 'v'],
    'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
    processed['process_date']
)

unprocessed = lastmatch[~lastmatch.isin(processed)].dropna()

processed.set_index(['projectid', 'stage'], inplace=True)
lastmatch.set_index(['projectid', 'stage'], inplace=True)

processed.sort_index(inplace=True)
lastmatch.sort_index(inplace=True)

print(lastmatch['lastmatchdate'])
print(processed['process_date'])

to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]

The result I want to achieve is a dataframe containing the rows where the 'lastmatchdate' is greater than the date that the project was last processed (process_date). However this line:

to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]

produces a ValueError: Can only compare identically-labeled Series objects. I think it might be a syntax I don't know of or got wrong.

The output I expect is in this case:

                lastmatchdate
projectid stage              
1         c        2020-08-31

So concretely the question is: how do I get a dataframe containing only the rows of another dataframe having the (datetime) value of column a greater than column b of the other dataframe.

Upvotes: 1

Views: 479

Answers (2)

user8560167
user8560167

Reputation:

you 've receiver ValueError because you tried to compare two different dataframes, if you want to compare row by row two dataframes, merge them before

 lastmatch = pd.DataFrame({
        'projectid': ['1', '2', '2', '3'],
        'stage': ['c', 'c', 'v', 'v'],
        'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
                          '2020-08-31']
    })
    lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
    
    processed = pd.DataFrame({
        'projectid': ['1', '2'],
        'stage': ['c', 'v'],
        'process_date': ['2020-08-30', '2013-11-24']
    })
    processed['process_date'] = pd.to_datetime(
        processed['process_date']
    )
    
    df=pd.merge(lastmatch,processed,on=['stage','projectid'])
    
    df=df[
        df.lastmatchdate>df.process_date
    ]
    print(df)
      projectid stage lastmatchdate process_date
0         1     c    2020-08-31   2020-08-30

Upvotes: 1

lrh09
lrh09

Reputation: 587

merged = pd.merge(processed, lastmatch, left_index = True, right_index = True)
merged = merged.assign(to_process = merged['lastmatchdate']> merged['process_date'])

You will get the following:

                process_date lastmatchdate  to_process
projectid stage                                       
1         c       2020-08-31    2020-08-31       False
2         v       2013-11-24    2013-11-24       False

Upvotes: 1

Related Questions