Reputation: 139
I have a dataset which consists entirely of float values representing spatial data (basically a horizontal cutaway of a surface). Sometimes the sensor generating these values does so incorrectly and will repeat several values in a row. I want to remove repeating sequences of length 3 or greater while leaving the first value of the repetition and all other instances of the value (including sequences of length 2) where they are.
For example, suppose a row contains [0.5, 0.2, 0.2, 0.2, 0.2, 0.3, 0.5, 0.2, 0.2, ...]
. There is a 4-long repetition of 0.2
at the beginning of the row and a 2-long repetition of 0.2
at the end. What I want to do is remove every value of the 4-long repetition of 0.2
while leaving the first instance where it is, and do nothing to the 2-long repetition. So the output I desire would be [0.5, 0.2, NaN, NaN, NaN, 0.3, 0.5, 0.2, 0.2, ...]
.
I know I can do this by just iterating through the rows and finding these sequences, but I'm wondering if there's a more efficient way to do this using Pandas's built in functions or another library? The data files can be absolutely massive so I need an efficient way to filter out these repetitions.
Upvotes: 1
Views: 185
Reputation: 191
Here is my proposition:
import numpy as np
import pandas as pd
from collections import Counter
data=pd.DataFrame({'vals':[0.2,0.2,0.2,0.3,0.4,0.1,0.1,0.1,0.2,0.5,0.6,0.8]})
#split data by blocks of consecutive same values
data['blocks'] = (data.vals.shift(1) != data.vals).astype(int).cumsum()
#count number of values in each block
blocks=Counter(data['blocks'])
def filter(block_id, nb_vals):
#if one value in block, continue
if nb_vals<1:
return
#if more, set values to nan except the first one
else:
indxs=data[data.blocks==block_id].index.values[1:]
data.vals[indxs]=np.nan
#loop on blocks
for block_id, nb_vals in blocks.items():
filter(block_id, nb_vals)
data has been modified:
vals blocks
0 0.2 1
1 NaN 1
2 NaN 1
3 0.3 2
4 0.4 3
5 0.1 4
6 NaN 4
7 NaN 4
8 0.2 5
9 0.5 6
10 0.6 7
11 0.8 8
Upvotes: 0
Reputation: 59579
Use shift
+ ne
(not equal) + cumsum
to create unique label for each group of consecutive values. Then we group to find the size. You can then use where
to NaN
the duplicate values within each consecutive group if they are above a certain size.
import pandas as pd
df = pd.DataFrame({'data': [0.5, 0.2, 0.2, 0.2, 0.2, 0.3, 0.6, 0.2, 0.2]})
df['grp'] = df['data'].ne(df['data'].shift()).cumsum()
df['size'] = df.groupby('grp').grp.transform('size')
df['data'].where(~(df['grp'].duplicated() & df['size'].ge(3))).tolist()
#[0.5, 0.2, nan, nan, nan, 0.3, 0.6, 0.2, 0.2]
With the created columns the DataFrame is:
print(df)
data grp size
0 0.5 1 1
1 0.2 2 4
2 0.2 2 4
3 0.2 2 4
4 0.2 2 4
5 0.3 3 1
6 0.6 4 1
7 0.2 5 2
8 0.2 5 2
Upvotes: 2