Tarek Allam
Tarek Allam

Reputation: 139

How to remove repetitions of length 3 or greater in Pandas DataFrame row?

I have a dataset which consists entirely of float values representing spatial data (basically a horizontal cutaway of a surface). Sometimes the sensor generating these values does so incorrectly and will repeat several values in a row. I want to remove repeating sequences of length 3 or greater while leaving the first value of the repetition and all other instances of the value (including sequences of length 2) where they are.

For example, suppose a row contains [0.5, 0.2, 0.2, 0.2, 0.2, 0.3, 0.5, 0.2, 0.2, ...]. There is a 4-long repetition of 0.2 at the beginning of the row and a 2-long repetition of 0.2 at the end. What I want to do is remove every value of the 4-long repetition of 0.2 while leaving the first instance where it is, and do nothing to the 2-long repetition. So the output I desire would be [0.5, 0.2, NaN, NaN, NaN, 0.3, 0.5, 0.2, 0.2, ...].

I know I can do this by just iterating through the rows and finding these sequences, but I'm wondering if there's a more efficient way to do this using Pandas's built in functions or another library? The data files can be absolutely massive so I need an efficient way to filter out these repetitions.

Upvotes: 1

Views: 185

Answers (2)

Adrien Wehrlé
Adrien Wehrlé

Reputation: 191

Here is my proposition:

import numpy as np
import pandas as pd
from collections import Counter

data=pd.DataFrame({'vals':[0.2,0.2,0.2,0.3,0.4,0.1,0.1,0.1,0.2,0.5,0.6,0.8]})

#split data by blocks of consecutive same values
data['blocks'] = (data.vals.shift(1) != data.vals).astype(int).cumsum()

#count number of values in each block
blocks=Counter(data['blocks'])

def filter(block_id, nb_vals):
    
    #if one value in block, continue
    if nb_vals<1:
        return
    
    #if more, set values to nan except the first one
    else:
        indxs=data[data.blocks==block_id].index.values[1:]
        data.vals[indxs]=np.nan
        

#loop on blocks
for block_id, nb_vals in blocks.items():
    
    filter(block_id, nb_vals) 

data has been modified:

    vals  blocks
0    0.2       1
1    NaN       1
2    NaN       1
3    0.3       2
4    0.4       3
5    0.1       4
6    NaN       4
7    NaN       4
8    0.2       5
9    0.5       6
10   0.6       7
11   0.8       8

Upvotes: 0

ALollz
ALollz

Reputation: 59579

Use shift + ne (not equal) + cumsum to create unique label for each group of consecutive values. Then we group to find the size. You can then use where to NaN the duplicate values within each consecutive group if they are above a certain size.

import pandas as pd
df = pd.DataFrame({'data': [0.5, 0.2, 0.2, 0.2, 0.2, 0.3, 0.6, 0.2, 0.2]})

df['grp'] = df['data'].ne(df['data'].shift()).cumsum()
df['size'] = df.groupby('grp').grp.transform('size')

df['data'].where(~(df['grp'].duplicated() & df['size'].ge(3))).tolist()
#[0.5, 0.2, nan, nan, nan, 0.3, 0.6, 0.2, 0.2]

With the created columns the DataFrame is:

print(df)
   data  grp  size
0   0.5    1     1
1   0.2    2     4
2   0.2    2     4
3   0.2    2     4
4   0.2    2     4
5   0.3    3     1
6   0.6    4     1
7   0.2    5     2
8   0.2    5     2

Upvotes: 2

Related Questions