How to remove repetitions of length 3 or greater in Pandas DataFrame row?

Question

I have a dataset which consists entirely of float values representing spatial data (basically a horizontal cutaway of a surface). Sometimes the sensor generating these values does so incorrectly and will repeat several values in a row. I want to remove repeating sequences of length 3 or greater while leaving the first value of the repetition and all other instances of the value (including sequences of length 2) where they are.

For example, suppose a row contains [0.5, 0.2, 0.2, 0.2, 0.2, 0.3, 0.5, 0.2, 0.2, ...]. There is a 4-long repetition of 0.2 at the beginning of the row and a 2-long repetition of 0.2 at the end. What I want to do is remove every value of the 4-long repetition of 0.2 while leaving the first instance where it is, and do nothing to the 2-long repetition. So the output I desire would be [0.5, 0.2, NaN, NaN, NaN, 0.3, 0.5, 0.2, 0.2, ...].

I know I can do this by just iterating through the rows and finding these sequences, but I'm wondering if there's a more efficient way to do this using Pandas's built in functions or another library? The data files can be absolutely massive so I need an efficient way to filter out these repetitions.

ALollz · Accepted Answer

Use shift + ne (not equal) + cumsum to create unique label for each group of consecutive values. Then we group to find the size. You can then use where to NaN the duplicate values within each consecutive group if they are above a certain size.

import pandas as pd
df = pd.DataFrame({'data': [0.5, 0.2, 0.2, 0.2, 0.2, 0.3, 0.6, 0.2, 0.2]})

df['grp'] = df['data'].ne(df['data'].shift()).cumsum()
df['size'] = df.groupby('grp').grp.transform('size')

df['data'].where(~(df['grp'].duplicated() & df['size'].ge(3))).tolist()
#[0.5, 0.2, nan, nan, nan, 0.3, 0.6, 0.2, 0.2]

With the created columns the DataFrame is:

print(df)
   data  grp  size
0   0.5    1     1
1   0.2    2     4
2   0.2    2     4
3   0.2    2     4
4   0.2    2     4
5   0.3    3     1
6   0.6    4     1
7   0.2    5     2
8   0.2    5     2

How to remove repetitions of length 3 or greater in Pandas DataFrame row?

Answers (2)

Related Questions