Reputation: 1352

Creating a column by using condition of ranges of date and time in two different data frames

I want to create a new column in a data frame. The new column is based on the range of date and time by comparing the date of the existing data frame. The first data frame is as follows:

Exp = pd.DataFrame()
Exp ['Date'] = ['2021-09-10 10:00:00','2021-09-10 11:00:01','2021-09-11 10:00:00','2021-09-11 11:00:01']
Exp ['Date'] = pd.to_datetime(Exp['Date'])

Exp ['E_Date'] =  ['2021-09-10 11:00:00','2021-09-10 12:00:00','2021-09-11 11:00:00','2021-09-11 12:00:00']
Exp ['E_Date'] = pd.to_datetime(Exp['E_Date'])
Exp ['Conditions'] = ['A','B','C','D']

So the columns date and E_date are datetime64[ns] format.

The second data frame is as follows:

df = pd.DataFrame()
df ['Date'] = ['2021-09-10 10:00:00','2021-09-10 10:20:00','2021-09-10 11:30:00',
               '2021-09-10 11:00:00','2021-09-10 11:15:00','2021-09-10 12:00:00',
               '2021-09-11 10:10:00','2021-09-11 11:09:00','2021-09-11 11:30:00',
               '2021-09-11 11:05:00','2021-09-11 11:45:00','2021-09-11 11:12:00']
df ['Date'] = pd.to_datetime(df['Date'])

df ['Para'] =  list(np.linspace (0,12, num = 12))

what I want to do is add a new column in the second data frame based on the date range using date and E_date from the first data frame.

The expected output is as follow:

This means that a new column in the second data frame is based on the following logical condition:

Exp['Date]<= df['Date]<=Exp['E_Date]

What can I do to achieve that?

Exp file here

Df file here

Upvotes: 0

Answers (2)

sammywemmy

Reputation: 28644

Kindly test it and see if it fits your purpose

Read in the data, and convert columns to appropriate date time dtypes:

df1 = pd.read_csv('Df2.csv')
df2 = pd.read_csv('Exp.csv')
df1['Date_TIme'] = pd.to_datetime(df1.Date_Time, format='%d-%m-%Y %H:%M')
df2['Start Time'] = pd.to_datetime(df2['Experiment Date'].str.cat(df2['Start Time'], sep =' '), 
                                  format='%d-%m-%Y %H:%M:%S')
df2['End Time'] = pd.to_datetime(df2['Experiment Date'].str.cat(df2['End Time'], sep =' '), 
                                 format='%d-%m-%Y %H:%M:%S')

You can do a cartesian join, followed by a filter:

df1.merge(df2, how = 'cross').query('`Start Time`<= Date_Time <= `End Time`')

                      Date       T     RH Position      Time    id           Date_Time Experiment Date          Start Time            End Time                    Conditions
61050    08-09-2021 0:00  24.087  69.04     A-01  10:00:00   925 2021-09-08 10:00:00      08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00  0.087l/s_Tin_8.5C_0.5kW_Load
61116    08-09-2021 0:00  24.001  68.30     A-01  10:01:00   926 2021-09-08 10:01:00      08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00  0.087l/s_Tin_8.5C_0.5kW_Load
61182    08-09-2021 0:00  23.873  67.86     A-01  10:02:00   927 2021-09-08 10:02:00      08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00  0.087l/s_Tin_8.5C_0.5kW_Load
61248    08-09-2021 0:00  23.787  67.06     A-01  10:03:00   928 2021-09-08 10:03:00      08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00  0.087l/s_Tin_8.5C_0.5kW_Load
61314    08-09-2021 0:00  23.658  66.71     A-01  10:04:00   929 2021-09-08 10:04:00      08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00  0.087l/s_Tin_8.5C_0.5kW_Load
...                  ...     ...    ...      ...       ...   ...                 ...             ...                 ...                 ...                           ...
2991059  08-09-2021 0:00  23.058  71.21     H-04  16:56:00  1341 2021-09-08 16:56:00      08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00    0.087l/s_Tin_9.5C_1kW_Load
2991125  08-09-2021 0:00  22.929  70.89     H-04  16:57:00  1342 2021-09-08 16:57:00      08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00    0.087l/s_Tin_9.5C_1kW_Load
2991191  08-09-2021 0:00  22.843  70.70     H-04  16:58:00  1343 2021-09-08 16:58:00      08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00    0.087l/s_Tin_9.5C_1kW_Load
2991257  08-09-2021 0:00  22.714  70.60     H-04  16:59:00  1344 2021-09-08 16:59:00      08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00    0.087l/s_Tin_9.5C_1kW_Load
2991323  08-09-2021 0:00  22.693  70.44     H-04  17:00:00  1345 2021-09-08 17:00:00      08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00    0.087l/s_Tin_9.5C_1kW_Load

This however can be quite inefficient and consume a lot of memory, depending on the data size.

Another option is to use the development version of conditional_join from pyjanitor to execute this - it is a convenience function to avoid cartesian join.

I'm using this, instead of IntervalIndex, as your data has overlapping intervals; if there were no overlapping intervals, IntervalIndex is great and very efficient as well.

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor as jn
import pandas as pd
df1.conditional_join(df2,
    ('Date_Time', 'Start Time', '>='),
    ('Date_Time', 'End Time', '<=')
    )
                  Date       T     RH Position      Time    id           Date_Time Experiment Date          Start Time            End Time                    Conditions
0      08-09-2021 0:00  24.087  69.04     A-01  10:00:00   925 2021-09-08 10:00:00      08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00  0.087l/s_Tin_8.5C_0.5kW_Load
1      08-09-2021 0:00  24.001  68.30     A-01  10:01:00   926 2021-09-08 10:01:00      08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00  0.087l/s_Tin_8.5C_0.5kW_Load
2      08-09-2021 0:00  23.873  67.86     A-01  10:02:00   927 2021-09-08 10:02:00      08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00  0.087l/s_Tin_8.5C_0.5kW_Load
3      08-09-2021 0:00  23.787  67.06     A-01  10:03:00   928 2021-09-08 10:03:00      08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00  0.087l/s_Tin_8.5C_0.5kW_Load
4      08-09-2021 0:00  23.658  66.71     A-01  10:04:00   929 2021-09-08 10:04:00      08-09-2021 2021-09-08 10:00:00 2021-09-08 11:00:00  0.087l/s_Tin_8.5C_0.5kW_Load
...                ...     ...    ...      ...       ...   ...                 ...             ...                 ...                 ...                           ...
12563  08-09-2021 0:00  23.058  71.21     H-04  16:56:00  1341 2021-09-08 16:56:00      08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00    0.087l/s_Tin_9.5C_1kW_Load
12564  08-09-2021 0:00  22.929  70.89     H-04  16:57:00  1342 2021-09-08 16:57:00      08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00    0.087l/s_Tin_9.5C_1kW_Load
12565  08-09-2021 0:00  22.843  70.70     H-04  16:58:00  1343 2021-09-08 16:58:00      08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00    0.087l/s_Tin_9.5C_1kW_Load
12566  08-09-2021 0:00  22.714  70.60     H-04  16:59:00  1344 2021-09-08 16:59:00      08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00    0.087l/s_Tin_9.5C_1kW_Load
12567  08-09-2021 0:00  22.693  70.44     H-04  17:00:00  1345 2021-09-08 17:00:00      08-09-2021 2021-09-08 16:00:00 2021-09-08 17:00:00    0.087l/s_Tin_9.5C_1kW_Load

If you can find a common column that already exists within the dataframes, then a merge, followed by a filter is quite efficient, and better/ more efficient than the conditional_join.

Upvotes: 1

Andrej Kesely

Reputation: 195438

Try pd.IntervalIndex + pd.cut:

Exp["idx"] = pd.IntervalIndex.from_arrays(
    Exp["Date"], Exp["E_Date"], closed="both"
)
Exp = Exp.set_index("idx")
df["Conditions"] = Exp.loc[
    pd.cut(df["Date"], Exp.index).values, "Conditions"
].values

print(df)

Prints:

                  Date       Para Conditions
0  2021-09-10 10:00:00   0.000000          A
1  2021-09-10 10:20:00   1.090909          A
2  2021-09-10 11:30:00   2.181818          B
3  2021-09-10 11:00:00   3.272727          A
4  2021-09-10 11:15:00   4.363636          B
5  2021-09-10 12:00:00   5.454545          B
6  2021-09-11 10:10:00   6.545455          C
7  2021-09-11 11:09:00   7.636364          D
8  2021-09-11 11:30:00   8.727273          D
9  2021-09-11 11:05:00   9.818182          D
10 2021-09-11 11:45:00  10.909091          D
11 2021-09-11 11:12:00  12.000000          D

Upvotes: 2

Creating a column by using condition of ranges of date and time in two different data frames

Answers (2)

Related Questions