Reputation: 384
Using this code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None
pd.options.display.float_format = '{:.2f}'.format
dateparse = lambda x: pd.datetime.strptime(x,'%Y%m%d%H%M')
a = pd.read_csv(r'C:\Users\Leonardo\Desktop\Nova pasta\TU_boia0401.out', parse_dates = ['data'], index_col = 0, date_parser = dateparse)
The output is something like this:
index hs
2015-02-23 14:50:00 0.99
2015-02-23 15:50:00 0.96
2015-02-23 16:50:00 1.04
2015-02-23 17:50:00 0.96
. .
. .
. .
2017-09-01 12:40:00 1.25
Till here everything was fine, but analysing all data by plotting it something was noticed. This is the problem:
As can be seeable near of 2015-03-06 there's a lot of repeated values that should not exist. Looking on dataframe this can be seeable:
2015-03-04 10:50:00 1.18
2015-03-04 11:50:00 1.18
2015-03-04 12:50:00 1.18
2015-03-04 13:50:00 1.18
It repeats a lot of times along of the dataframe. The main objective is to filter these BAD data and erase them from the dataframe and set np.nan to every value that repeats 3 consecutive times (or more than 3 too) to the entire dataframe. The output expectation is something like this:
index hs
2015-02-23 14:50:00 0.99
2015-02-23 15:50:00 0.96
2015-02-23 16:50:00 1.04
2015-02-23 17:50:00 0.96
. .
. .
. .
2015-03-04 10:50:00 1.18
2015-03-04 11:50:00 nan
2015-03-04 12:50:00 nan
2015-03-04 13:50:00 nan
. .
. .
. .
2016-01-20 12:40:00 0.98
2016-01-20 12:50:00 nan
2016-01-20 13:00:00 nan
2016-01-20 13:10:00 nan
. .
. .
. .
2017-09-01 12:40:00 1.25
Would be thankful if someone could help.
Upvotes: 1
Views: 249
Reputation: 109626
This sets to NaN all forward duplicates greater than or equal to n
(e.g. 3).
# Set-up.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 1), index=pd.DatetimeIndex(start='2017-01-01', freq='min', periods=10), columns=['hs'])
df.loc[3:6] = df.iat[2, 0]
df.loc[8:10] = df.iat[7, 0]
>>> df
hs
2017-01-01 00:00:00 1.764052
2017-01-01 00:01:00 0.400157
2017-01-01 00:02:00 0.978738
2017-01-01 00:03:00 0.978738 # Duplicate x3
2017-01-01 00:04:00 0.978738 # Duplicate x3
2017-01-01 00:05:00 0.978738 # Duplicate x3
2017-01-01 00:06:00 0.950088
2017-01-01 00:07:00 -0.151357
2017-01-01 00:08:00 -0.151357 # Duplicate x2
2017-01-01 00:09:00 -0.151357 # Duplicate x2
# Set forward duplicates to NaN.
n = 3
bool_mask = df.hs.shift() == df.hs
df = df.assign(
mask=bool_mask,
group=(bool_mask != bool_mask.shift()).cumsum())
filter_groups = df.groupby('group')[['mask']].sum().query('mask >= {}'.format(n)).index
df.loc[df.group.isin(filter_groups), 'hs'] = np.nan
df = df[['hs']]
>>> df
hs
2017-01-01 00:00:00 1.764052
2017-01-01 00:01:00 0.400157
2017-01-01 00:02:00 0.978738
2017-01-01 00:03:00 NaN
2017-01-01 00:04:00 NaN
2017-01-01 00:05:00 NaN
2017-01-01 00:06:00 0.950088
2017-01-01 00:07:00 -0.151357
2017-01-01 00:08:00 -0.151357
2017-01-01 00:09:00 -0.151357
It creates a boolean mask to check for duplicates and adds this column to the dataframe via assign
. The code also adds a group
column that looks for consecutive duplicate regions (created using the shift-cumsum pattern). A groupby is applied to the groups, summing the mask
boolean values. This gives the number of consecutive repeats. These results are then filtered using query
to locate those where the repeat count is greater than or equal to n
(e.g. 3).
Finally, the dataframe uses loc
to set hs
to NaN
for the groups where the consecutive count exceeds three. The temp columns are excluded by just taking hs
from the dataframe via df = df[['hs']]
.
Upvotes: 3
Reputation: 294488
Setup
Borrowing @Alexander's dataframe
np.random.seed(0)
df = pd.DataFrame(
np.random.randn(10, 1),
pd.DatetimeIndex(start='2017-01-01', freq='min', periods=10),
['hs'])
df.loc[4:6] = df.iat[3, 0]
Solution
Using pd.DataFrame.mask
and pd.DataFrame.diff
Note: This is a generalized solution that performs this same task for all columns at once.
df.mask(df.diff() == 0)
hs
2017-01-01 00:00:00 1.764052
2017-01-01 00:01:00 0.400157
2017-01-01 00:02:00 0.978738
2017-01-01 00:03:00 2.240893
2017-01-01 00:04:00 NaN
2017-01-01 00:05:00 NaN
2017-01-01 00:06:00 0.950088
2017-01-01 00:07:00 -0.151357
2017-01-01 00:08:00 -0.103219
2017-01-01 00:09:00 0.410599
Larger Example
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(5, size=(10, 5)).astype(float),
pd.DatetimeIndex(start='2017-01-01', freq='min', periods=10),
).add_prefix('col')
df
col0 col1 col2 col3 col4
2017-01-01 00:00:00 0.0 3.0 2.0 3.0 2.0
2017-01-01 00:01:00 2.0 3.0 2.0 3.0 0.0
2017-01-01 00:02:00 2.0 0.0 0.0 4.0 0.0
2017-01-01 00:03:00 2.0 2.0 0.0 4.0 1.0
2017-01-01 00:04:00 3.0 2.0 4.0 4.0 4.0
2017-01-01 00:05:00 4.0 3.0 3.0 3.0 4.0
2017-01-01 00:06:00 3.0 1.0 3.0 0.0 4.0
2017-01-01 00:07:00 4.0 2.0 2.0 0.0 2.0
2017-01-01 00:08:00 4.0 0.0 4.0 1.0 4.0
2017-01-01 00:09:00 4.0 2.0 2.0 0.0 2.0
df.mask(df.diff() == 0)
col0 col1 col2 col3 col4
2017-01-01 00:00:00 0.0 3.0 2.0 3.0 2.0
2017-01-01 00:01:00 2.0 NaN NaN NaN 0.0
2017-01-01 00:02:00 NaN 0.0 0.0 4.0 NaN
2017-01-01 00:03:00 NaN 2.0 NaN NaN 1.0
2017-01-01 00:04:00 3.0 NaN 4.0 NaN 4.0
2017-01-01 00:05:00 4.0 3.0 3.0 3.0 NaN
2017-01-01 00:06:00 3.0 1.0 NaN 0.0 NaN
2017-01-01 00:07:00 4.0 2.0 2.0 NaN 2.0
2017-01-01 00:08:00 NaN 0.0 4.0 1.0 4.0
2017-01-01 00:09:00 NaN 2.0 2.0 0.0 2.0
Upvotes: 1
Reputation:
If you want to control the window size, you can use the rolling object. The idea is that if n successive elements are the same, their standard deviation will be 0. The rest is
successive = (ser.where(np.isclose(ser.rolling(3).std(), 0, atol=10**-6))
.bfill(limit=2).notnull())
ser[successive] = np.nan
For Series, ser
ser = pd.Series([1, 1, 1, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 2, 1, 3, 2, 1, 1, 1])
this yields
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 1.0
7 1.0
8 NaN
9 NaN
10 NaN
11 NaN
12 1.0
13 2.0
14 1.0
15 3.0
16 2.0
17 NaN
18 NaN
19 NaN
Upvotes: 2
Reputation: 3591
You can do a for loop that keeps track of how many times something has been repeated:
replacement_value = np.nan
last_value = None
number_of_repetitions = 0
for index in range(len(values)):
if value == last_value:
if number_of_repetitions == 2:
#if we previously had 2 repetitions, we should replace both the current and the previous values
values[index-1] = replacement_value
values[index] = replacement_value
if number_of_repetitions == 3:
#if this is the third or more repetition, we've already replaced the previous value, so we just need to handle the current one
values[index] = replacement_value
else:
number_of_repetitions = number_of_repetitions+1
#if it hasn't reach 3 yet, we should increment every time we see a repetition
#but we don't need to keep track after 3
else:
#if this is a new value, we should reset
number_of_repetitions = 1
last_value = value
Upvotes: 1