Reputation: 43
I have two pandas dataframes. One is my main file with lots of rows containing "events" such as in the example below, lets call it main_df. While only 2 rows are given, there are around 200.000 in the real file.
event_nr | date | detail_1 | detail_2 |
---|---|---|---|
e_0001 | 2010-01-02 07:30:00 | 1 | 100 |
e_0002 | 2010-06-05 07:30:00 | 0 | 132 |
I then have another df_ranges with date ranges, here also there are a multitude of ranges, albeit not overlapping.
start | end | name |
---|---|---|
2010-02-25 | 2010-03-06 | range_1 |
2010-06-02 | 2010-06-08 | range_2 |
I now want to check if each date in main_df occurs during any of the time ranges in df_ranges, and if so, create a new column in main_df with the name of the range as a value in the same row.
Desired result:
event_nr | date | detail_1 | detail_2 | in_range |
---|---|---|---|---|
e_0001 | 2010-01-02 07:30:00 | 1 | 100 | None |
e_0002 | 2010-06-05 07:30:00 | 0 | 132 | range_2 |
Most solutions I have encountered only wanted to check if certain dates were between start and end of one single range. I am sure I have to iterate over all ranges and all dates, but don't know how to proceed in practice.
Thank you!
Upvotes: 3
Views: 1267
Reputation: 10819
Here is a solution.
Given the following DataFrames:
events = pd.DataFrame(
{
"event_nr":["e_0001", "e_0002"],
"date":["2010-01-02 07:30:00", "2010-06-05 07:30:00"]
}
)
ranges = pd.DataFrame(
{
"start":["2010-02-25", "2010-06-02"],
"end":["2010-03-06", "2010-06-08"],
"range_name":["range_1","range_2"]
}
)
First, let's make sure all dates are actually datetime objects:
events["date"] = pd.to_datetime(events["date"])
ranges["start"] = pd.to_datetime(ranges["start"])
ranges["end"] = pd.to_datetime(ranges["end"])
Then, let's make a function that implement the desired selection criteria.
def f(row):
global ranges
for _i, _row in ranges.iterrows():
if _row.start <= row.date <= _row.end:
return _row.range_name
return None
Finally, let's apply the function to the dataframe to generate the desired new column.
events["in_range"] = events.apply(f, axis=1)
And here is the result:
print(events)
event_nr date in_range
0 e_0001 2010-01-02 07:30:00 None
1 e_0002 2010-06-05 07:30:00 range_2
Upvotes: 2