Reputation: 1352
I want to create a new column in a data frame. The new column is based on the range of date and time by comparing the date of the existing data frame. The first data frame is as follows:
Exp = pd.DataFrame()
Exp ['Date'] = ['2021-09-10 10:00:00','2021-09-10 11:00:01','2021-09-11 10:00:00','2021-09-11 11:00:01']
Exp ['Date'] = pd.to_datetime(Exp['Date'])
Exp ['E_Date'] = ['2021-09-10 11:00:00','2021-09-10 12:00:00','2021-09-11 11:00:00','2021-09-11 12:00:00']
Exp ['E_Date'] = pd.to_datetime(Exp['E_Date'])
Exp ['Conditions'] = ['A','B','C','D']
So the columns date and E_date are datetime64[ns] format.
The second data frame is as follows:
df = pd.DataFrame()
df ['Date'] = ['2021-09-10 10:00:00','2021-09-10 10:20:00','2021-09-10 11:30:00',
'2021-09-10 11:00:00','2021-09-10 11:15:00','2021-09-10 12:00:00',
'2021-09-11 10:10:00','2021-09-11 11:09:00','2021-09-11 11:30:00',
'2021-09-11 11:05:00','2021-09-11 11:45:00','2021-09-11 11:12:00']
df ['Date'] = pd.to_datetime(df['Date'])
df ['Para'] = list(np.linspace (0,12, num = 12))
what I want to do is add a new column in the second data frame based on the date range using date and E_date from the first data frame.
The expected output is as follow:
This means that a new column in the second data frame is based on the following logical condition:
Exp['Date]<= df['Date]<=Exp['E_Date]
What can I do to achieve that?
Upvotes: 0
Views: 187
Reputation: 28644
Kindly test it and see if it fits your purpose
Read in the data, and convert columns to appropriate date time dtypes:
df1 = pd.read_csv('Df2.csv')
df2 = pd.read_csv('Exp.csv')
df1['Date_TIme'] = pd.to_datetime(df1.Date_Time, format='%d-%m-%Y %H:%M')
df2['Start Time'] = pd.to_datetime(df2['Experiment Date'].str.cat(df2['Start Time'], sep =' '),
format='%d-%m-%Y %H:%M:%S')
df2['End Time'] = pd.to_datetime(df2['Experiment Date'].str.cat(df2['End Time'], sep =' '),
format='%d-%m-%Y %H:%M:%S')
You can do a cartesian join, followed by a filter:
df1.merge(df2, how = 'cross').query('`Start Time`<= Date_Time <= `End Time`')
Date T RH Position Time id Date_Time Experiment Date Start Time End Time Conditions
61050 08-09-2021 0:00 24.087 69.04 A-01 10:00:00 925 2021-09-08 10:00:00 08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00 0.087l/s_Tin_8.5C_0.5kW_Load
61116 08-09-2021 0:00 24.001 68.30 A-01 10:01:00 926 2021-09-08 10:01:00 08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00 0.087l/s_Tin_8.5C_0.5kW_Load
61182 08-09-2021 0:00 23.873 67.86 A-01 10:02:00 927 2021-09-08 10:02:00 08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00 0.087l/s_Tin_8.5C_0.5kW_Load
61248 08-09-2021 0:00 23.787 67.06 A-01 10:03:00 928 2021-09-08 10:03:00 08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00 0.087l/s_Tin_8.5C_0.5kW_Load
61314 08-09-2021 0:00 23.658 66.71 A-01 10:04:00 929 2021-09-08 10:04:00 08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00 0.087l/s_Tin_8.5C_0.5kW_Load
... ... ... ... ... ... ... ... ... ... ... ...
2991059 08-09-2021 0:00 23.058 71.21 H-04 16:56:00 1341 2021-09-08 16:56:00 08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00 0.087l/s_Tin_9.5C_1kW_Load
2991125 08-09-2021 0:00 22.929 70.89 H-04 16:57:00 1342 2021-09-08 16:57:00 08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00 0.087l/s_Tin_9.5C_1kW_Load
2991191 08-09-2021 0:00 22.843 70.70 H-04 16:58:00 1343 2021-09-08 16:58:00 08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00 0.087l/s_Tin_9.5C_1kW_Load
2991257 08-09-2021 0:00 22.714 70.60 H-04 16:59:00 1344 2021-09-08 16:59:00 08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00 0.087l/s_Tin_9.5C_1kW_Load
2991323 08-09-2021 0:00 22.693 70.44 H-04 17:00:00 1345 2021-09-08 17:00:00 08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00 0.087l/s_Tin_9.5C_1kW_Load
This however can be quite inefficient and consume a lot of memory, depending on the data size.
Another option is to use the development version of conditional_join from pyjanitor to execute this - it is a convenience function to avoid cartesian join.
I'm using this, instead of IntervalIndex, as your data has overlapping intervals; if there were no overlapping intervals, IntervalIndex is great and very efficient as well.
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor as jn
import pandas as pd
df1.conditional_join(df2,
('Date_Time', 'Start Time', '>='),
('Date_Time', 'End Time', '<=')
)
Date T RH Position Time id Date_Time Experiment Date Start Time End Time Conditions
0 08-09-2021 0:00 24.087 69.04 A-01 10:00:00 925 2021-09-08 10:00:00 08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00 0.087l/s_Tin_8.5C_0.5kW_Load
1 08-09-2021 0:00 24.001 68.30 A-01 10:01:00 926 2021-09-08 10:01:00 08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00 0.087l/s_Tin_8.5C_0.5kW_Load
2 08-09-2021 0:00 23.873 67.86 A-01 10:02:00 927 2021-09-08 10:02:00 08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00 0.087l/s_Tin_8.5C_0.5kW_Load
3 08-09-2021 0:00 23.787 67.06 A-01 10:03:00 928 2021-09-08 10:03:00 08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00 0.087l/s_Tin_8.5C_0.5kW_Load
4 08-09-2021 0:00 23.658 66.71 A-01 10:04:00 929 2021-09-08 10:04:00 08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00 0.087l/s_Tin_8.5C_0.5kW_Load
... ... ... ... ... ... ... ... ... ... ... ...
12563 08-09-2021 0:00 23.058 71.21 H-04 16:56:00 1341 2021-09-08 16:56:00 08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00 0.087l/s_Tin_9.5C_1kW_Load
12564 08-09-2021 0:00 22.929 70.89 H-04 16:57:00 1342 2021-09-08 16:57:00 08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00 0.087l/s_Tin_9.5C_1kW_Load
12565 08-09-2021 0:00 22.843 70.70 H-04 16:58:00 1343 2021-09-08 16:58:00 08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00 0.087l/s_Tin_9.5C_1kW_Load
12566 08-09-2021 0:00 22.714 70.60 H-04 16:59:00 1344 2021-09-08 16:59:00 08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00 0.087l/s_Tin_9.5C_1kW_Load
12567 08-09-2021 0:00 22.693 70.44 H-04 17:00:00 1345 2021-09-08 17:00:00 08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00 0.087l/s_Tin_9.5C_1kW_Load
If you can find a common column that already exists within the dataframes, then a merge, followed by a filter is quite efficient, and better/ more efficient than the conditional_join.
Upvotes: 1
Reputation: 195438
Try pd.IntervalIndex
+ pd.cut
:
Exp["idx"] = pd.IntervalIndex.from_arrays(
Exp["Date"], Exp["E_Date"], closed="both"
)
Exp = Exp.set_index("idx")
df["Conditions"] = Exp.loc[
pd.cut(df["Date"], Exp.index).values, "Conditions"
].values
print(df)
Prints:
Date Para Conditions
0 2021-09-10 10:00:00 0.000000 A
1 2021-09-10 10:20:00 1.090909 A
2 2021-09-10 11:30:00 2.181818 B
3 2021-09-10 11:00:00 3.272727 A
4 2021-09-10 11:15:00 4.363636 B
5 2021-09-10 12:00:00 5.454545 B
6 2021-09-11 10:10:00 6.545455 C
7 2021-09-11 11:09:00 7.636364 D
8 2021-09-11 11:30:00 8.727273 D
9 2021-09-11 11:05:00 9.818182 D
10 2021-09-11 11:45:00 10.909091 D
11 2021-09-11 11:12:00 12.000000 D
Upvotes: 2