jake wong
jake wong

Reputation: 5228

filtering dataframe based on date and time (in 2 separate columns)

I have the below pandas dataframe that I am trying to filter, to provide me with an updated dataframe based on the last date & time in my database.

This is a sample of the dataframe I am trying to filter:

>>> df
  # The time is in '%H:%M:%S' format, and the date is in '%d-%b-%Y'
  Time      Date          Items
  00:05:00  29-May-2018   foo
  00:06:00  30-May-2018   barr
  00:07:00  31-May-2018   gaaa
  00:11:00  31-May-2018   raaa
  ...       ...           ...

What I am trying to do, is to filter this dataframe based on the last entry in my sql database. For example, the last entry is: ['20:05:00','30-May-2018']. The below code is what I used to filter out from df:

last_entry = ['20:05:00','30-May-2018']
# Putting time into a datetime format to work within the dataframe.
last_entry_time = datetime.strptime(last_entry[0], '%H:%M:%S').time()
new_entry = df[(df['Date'] >= last_entry[1]) & (df['Time'] > last_entry_time)]

If I were to just have the filter as: new_entry = df[df['Date'] >= last_entry[1])] instead, this works well to return the current date and newer based on the last date, which is: 30-May-2018 and 31-May-2018.

However, regarding the time portion, because my last_entry time is 20:05:00, it starts to filter out the rest of the data that i'm trying to collect...

Question:

How can I perform the filter of the dataframe, such that it returns me the new entries in the dataframe, that is based off the old date and time in the database?

Ideal result

last_entry = ['20:05:00','30-May-2018']
>>> new_entry

  Time      Date          Items
  00:07:00  31-May-2018   gaaa
  00:11:00  31-May-2018   raaa
  ...       ...           ...

Upvotes: 0

Views: 100

Answers (1)

Toby Petty
Toby Petty

Reputation: 4660

One option is to create a datetime column in your DataFrame, and then filter on this column, for example:

df["real_date"] = pd.to_datetime(df["Date"], format="%d-%b-%Y")
df["real_time"] = pd.to_timedelta(df["Time"])
df["real_datetime"] = df["real_date"] + df["real_time"]

You also need to convert your last_entry variable to a proper datetime, for example like this:

from dateutil.parser import parse
from datetime import datetime


date = parse(last_entry[1], dayfirst=True)
time_elements = [int(t) for t in last_entry[0].split(":")]
last_entry_dt = datetime(date.year, date.month, date.day, time_elements[0], time_elements[1], time_elements[2])

Then you can filter the new DataFrame column like so:

df[df["real_datetime"] >= last_entry_dt]

Upvotes: 1

Related Questions