Reputation: 384

How to exclude consecutive values in dataframe

Using this code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.options.mode.chained_assignment = None
pd.options.display.float_format = '{:.2f}'.format
dateparse = lambda x: pd.datetime.strptime(x,'%Y%m%d%H%M')
a = pd.read_csv(r'C:\Users\Leonardo\Desktop\Nova pasta\TU_boia0401.out', parse_dates = ['data'], index_col = 0, date_parser = dateparse)

The output is something like this:

index               hs
2015-02-23 14:50:00 0.99
2015-02-23 15:50:00 0.96
2015-02-23 16:50:00 1.04
2015-02-23 17:50:00 0.96
.                   .
.                   .
.                   .
2017-09-01 12:40:00 1.25

Till here everything was fine, but analysing all data by plotting it something was noticed. This is the problem:

As can be seeable near of 2015-03-06 there's a lot of repeated values that should not exist. Looking on dataframe this can be seeable:

2015-03-04 10:50:00 1.18
2015-03-04 11:50:00 1.18
2015-03-04 12:50:00 1.18
2015-03-04 13:50:00 1.18

It repeats a lot of times along of the dataframe. The main objective is to filter these BAD data and erase them from the dataframe and set np.nan to every value that repeats 3 consecutive times (or more than 3 too) to the entire dataframe. The output expectation is something like this:

index               hs
2015-02-23 14:50:00 0.99
2015-02-23 15:50:00 0.96
2015-02-23 16:50:00 1.04
2015-02-23 17:50:00 0.96
.                   .
.                   .
.                   .
2015-03-04 10:50:00 1.18
2015-03-04 11:50:00 nan
2015-03-04 12:50:00 nan
2015-03-04 13:50:00 nan
.                   .
.                   .
.                   .
2016-01-20 12:40:00 0.98
2016-01-20 12:50:00 nan
2016-01-20 13:00:00 nan
2016-01-20 13:10:00 nan
.                   .
.                   .
.                   .
2017-09-01 12:40:00 1.25

Would be thankful if someone could help.

Upvotes: 1

Answers (4)

Alexander

Reputation: 109626

This sets to NaN all forward duplicates greater than or equal to n (e.g. 3).

# Set-up.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 1), index=pd.DatetimeIndex(start='2017-01-01', freq='min', periods=10), columns=['hs'])
df.loc[3:6] = df.iat[2, 0]
df.loc[8:10] = df.iat[7, 0]

>>> df
                           hs
2017-01-01 00:00:00  1.764052
2017-01-01 00:01:00  0.400157
2017-01-01 00:02:00  0.978738
2017-01-01 00:03:00  0.978738  # Duplicate x3
2017-01-01 00:04:00  0.978738  # Duplicate x3
2017-01-01 00:05:00  0.978738  # Duplicate x3
2017-01-01 00:06:00  0.950088
2017-01-01 00:07:00 -0.151357
2017-01-01 00:08:00 -0.151357  # Duplicate x2
2017-01-01 00:09:00 -0.151357  # Duplicate x2

# Set forward duplicates to NaN.
n = 3
bool_mask = df.hs.shift() == df.hs
df = df.assign(
    mask=bool_mask,
    group=(bool_mask != bool_mask.shift()).cumsum())
filter_groups = df.groupby('group')[['mask']].sum().query('mask >= {}'.format(n)).index
df.loc[df.group.isin(filter_groups), 'hs'] = np.nan
df = df[['hs']]
>>> df
                           hs
2017-01-01 00:00:00  1.764052
2017-01-01 00:01:00  0.400157
2017-01-01 00:02:00  0.978738
2017-01-01 00:03:00       NaN
2017-01-01 00:04:00       NaN
2017-01-01 00:05:00       NaN
2017-01-01 00:06:00  0.950088
2017-01-01 00:07:00 -0.151357
2017-01-01 00:08:00 -0.151357
2017-01-01 00:09:00 -0.151357

It creates a boolean mask to check for duplicates and adds this column to the dataframe via assign. The code also adds a group column that looks for consecutive duplicate regions (created using the shift-cumsum pattern). A groupby is applied to the groups, summing the mask boolean values. This gives the number of consecutive repeats. These results are then filtered using query to locate those where the repeat count is greater than or equal to n (e.g. 3).

Finally, the dataframe uses loc to set hs to NaN for the groups where the consecutive count exceeds three. The temp columns are excluded by just taking hs from the dataframe via df = df[['hs']].

Upvotes: 3

piRSquared

Reputation: 294488

Setup
Borrowing @Alexander's dataframe

np.random.seed(0)
df = pd.DataFrame(
    np.random.randn(10, 1),
    pd.DatetimeIndex(start='2017-01-01', freq='min', periods=10),
    ['hs'])

df.loc[4:6] = df.iat[3, 0]

Solution
Using pd.DataFrame.mask and pd.DataFrame.diff
Note: This is a generalized solution that performs this same task for all columns at once.

df.mask(df.diff() == 0)

                           hs
2017-01-01 00:00:00  1.764052
2017-01-01 00:01:00  0.400157
2017-01-01 00:02:00  0.978738
2017-01-01 00:03:00  2.240893
2017-01-01 00:04:00       NaN
2017-01-01 00:05:00       NaN
2017-01-01 00:06:00  0.950088
2017-01-01 00:07:00 -0.151357
2017-01-01 00:08:00 -0.103219
2017-01-01 00:09:00  0.410599

Larger Example

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(5, size=(10, 5)).astype(float),
    pd.DatetimeIndex(start='2017-01-01', freq='min', periods=10),
).add_prefix('col')

df

                     col0  col1  col2  col3  col4
2017-01-01 00:00:00   0.0   3.0   2.0   3.0   2.0
2017-01-01 00:01:00   2.0   3.0   2.0   3.0   0.0
2017-01-01 00:02:00   2.0   0.0   0.0   4.0   0.0
2017-01-01 00:03:00   2.0   2.0   0.0   4.0   1.0
2017-01-01 00:04:00   3.0   2.0   4.0   4.0   4.0
2017-01-01 00:05:00   4.0   3.0   3.0   3.0   4.0
2017-01-01 00:06:00   3.0   1.0   3.0   0.0   4.0
2017-01-01 00:07:00   4.0   2.0   2.0   0.0   2.0
2017-01-01 00:08:00   4.0   0.0   4.0   1.0   4.0
2017-01-01 00:09:00   4.0   2.0   2.0   0.0   2.0

df.mask(df.diff() == 0)

                     col0  col1  col2  col3  col4
2017-01-01 00:00:00   0.0   3.0   2.0   3.0   2.0
2017-01-01 00:01:00   2.0   NaN   NaN   NaN   0.0
2017-01-01 00:02:00   NaN   0.0   0.0   4.0   NaN
2017-01-01 00:03:00   NaN   2.0   NaN   NaN   1.0
2017-01-01 00:04:00   3.0   NaN   4.0   NaN   4.0
2017-01-01 00:05:00   4.0   3.0   3.0   3.0   NaN
2017-01-01 00:06:00   3.0   1.0   NaN   0.0   NaN
2017-01-01 00:07:00   4.0   2.0   2.0   NaN   2.0
2017-01-01 00:08:00   NaN   0.0   4.0   1.0   4.0
2017-01-01 00:09:00   NaN   2.0   2.0   0.0   2.0

Upvotes: 1

user2285236

Reputation:

If you want to control the window size, you can use the rolling object. The idea is that if n successive elements are the same, their standard deviation will be 0. The rest is

successive = (ser.where(np.isclose(ser.rolling(3).std(), 0, atol=10**-6))
                 .bfill(limit=2).notnull())
ser[successive] = np.nan

For Series, ser

ser = pd.Series([1, 1, 1, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 2, 1, 3, 2, 1, 1, 1])

this yields

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     1.0
7     1.0
8     NaN
9     NaN
10    NaN
11    NaN
12    1.0
13    2.0
14    1.0
15    3.0
16    2.0
17    NaN
18    NaN
19    NaN

Upvotes: 2

Acccumulation

Reputation: 3591

You can do a for loop that keeps track of how many times something has been repeated:

replacement_value = np.nan
last_value = None
number_of_repetitions = 0
for index in range(len(values)):
   if value == last_value:
       if number_of_repetitions == 2: 
       #if we previously had 2 repetitions, we should replace both the current and the previous values
          values[index-1] = replacement_value
          values[index] = replacement_value
       if number_of_repetitions == 3: 
          #if this is the third or more repetition, we've already replaced the previous value, so we just need to handle the current one
          values[index] = replacement_value
       else:
          number_of_repetitions = number_of_repetitions+1
          #if it hasn't reach 3 yet, we should increment every time we see a repetition
          #but we don't need to keep track after 3      
   else: 
      #if this is a new value, we should reset
      number_of_repetitions = 1 
      last_value = value

Upvotes: 1

How to exclude consecutive values in dataframe

Answers (4)

Related Questions