Jane Borges
Jane Borges

Reputation: 592

How to compare date records and add new column to dataframe as criteria

I have the following sensor data dataframe:

    Data_Digital    Data_Analog         Time
        1               10           2015-02-01 00:00:00
        1               12           2015-02-01 00:00:05
        1               25           2015-02-01 07:45:07
        1               25           2015-02-01 07:45:08
        1               25           2015-02-01 21:45:10
        0               25           2015-03-04 00:00:08

I need to compare the 'Time' of position 0 and the 'Time' of position 1. If the difference between the two teams is greater than six hours they must belong to different classes. However, if the time difference is small (< 6 hours) they must belong to the same class. I need this class to be represented in a new dataframe column.

The desired output is:

      Data_Digital  Data_Analog         Time                New_Col_Target
        1               10           2015-02-01 00:00:00            1      # init with 1
        1               12           2015-02-01 00:00:05            1
        1               25           2015-02-01 07:45:07            2      # far from the previous
        1               25           2015-02-01 07:45:08            2
        1               25           2015-02-01 21:45:10            3      # far from the previous
        0               25           2015-03-04 00:00:08            4      # far from the previous

Just below is the original dataframe:

    import pandas as pd

    df = pd.DataFrame({'Data_Digital': [1, 1, 1, 1, 1, 0],
                       'Data_Analog':  [10, 12, 25, 25, 25, 25],
                       'Time': ['2015-02-01 00:00:00', '2015-02-01 00:00:05','2015-02-01 07:45:07',
                                '2015-02-01 07:45:08', '2015-02-01 21:45:10', '2015-03-04 00:00:08']})

    print(df)

I tried to do (but it's wrong):

    index = 0
    index2 = 1

   df['New_Col_Target'] = 1

   for i in range(0, len(df) -1):
       for j in range(1, len(df)):

           if(abs(pd.to_datetime(df['Time'].iloc[i]) -         
                                 pd.to_datetime(df['Time'].iloc[j])) > 
                                 pd.to_timedelta('0 day 06:00:00')): 

               # I don't know how to do the assignments
               df['New_Col_Target'].iloc[i] = index
           else:
               # I don't know how to do the assignments
               df['New_Col_Target'].iloc[i] = index2
               index2 += 1




    # New process

       Date Init                 Date End        Mean_Dig   Mean_Analog
      2015-02-01 00:00:00   2015-02-01 00:00:05       1         11
      2015-02-01 07:45:07   2015-02-01 07:45:08       1         25
      2015-02-01 07:45:08   2015-02-01 07:45:08       1         25
      2015-03-04 00:00:08   2015-03-04 00:00:08       0         25

    df_mean_group_New_Col_Target = pd.DataFrame({'Date Init': ['2015-02-01 00:00:00', '2015-02-01 07:45:07', '2015-02-01 07:45:08', '2015-03-04 00:00:08'],
                                         'Date End': ['2015-02-01 00:00:05', '2015-02-01 07:45:08', '2015-02-01 07:45:08', '2015-03-04 00:00:08'],
                                         'Mean_Data_Digital': [1, 1, 1, 0], 
                                         'Mean_Data_Analog': [11, 25, 25, 25]})

    print(df_mean_group_New_Col_Target)

Upvotes: 2

Views: 37

Answers (1)

Erfan
Erfan

Reputation: 42886

Use diff, pd.Timedelta and cumsum:

df['New_col_target'] = (df['Time'].diff() > pd.Timedelta(hours=6)).cumsum().add(1)

Output

   Data_Digital  Data_Analog                Time  New_col_target
0             1           10 2015-02-01 00:00:00               1
1             1           12 2015-02-01 00:00:05               1
2             1           25 2015-02-01 07:45:07               2
3             1           25 2015-02-01 07:45:08               2
4             1           25 2015-02-01 21:45:10               3
5             0           25 2015-03-04 00:00:08               4

If your Time column is not datetime yet, use:

df['Time'] = pd.to_datetime(df['Time'])

Scenario 2: per group of Data_Digital:

We have to use groupby:

m1 = df.groupby('Data_Digital')['Time'].diff().ge(pd.Timedelta(hours=6))
m2 = df['Data_Digital'].diff().ne(0)

df['New_col_target'] = (m1|m2).cumsum()

output

   Data_Digital  Data_Analog                Time  New_col_target
0             1           10 2015-02-01 00:00:00               1
1             1           12 2015-02-01 00:00:05               1
2             1           25 2015-02-01 07:45:07               2
3             1           25 2015-02-01 07:45:08               2
4             1           25 2015-02-01 21:45:10               3
5             0           25 2015-03-04 00:00:08               4

Finally, get the mean per target

We can use groupby.mean:

df.groupby('New_col_target',as_index=False)[['Data_Digital', 'Data_Analog']].mean()

Or

df.groupby('New_col_target',as_index=False).agg({'Data_Digital':'mean',
                                                 'Data_Analog':'mean'})

Or if you have pandas >= 0.25.0 (check with pd.__version__), we can use named_agreggations:

df.groupby('New_col_target').agg(
    Digital_mean=('Data_Digital', 'mean'),
    Analog_mean=('Data_Analog', 'mean')
).reset_index()

Output

   New_col_target  Data_Digital  Data_Analog
0               1             1           11
1               2             1           25
2               3             1           25
3               4             0           25

Output named aggregations

   New_col_target  Digital_mean  Analog_mean
0               1             1           11
1               2             1           25
2               3             1           25
3               4             0           25

Upvotes: 3

Related Questions