jonboy
jonboy

Reputation: 368

Count of unique rows based on preceding row - Pandas

I want to return unique rows from multiple columns in a df. The issue is I want to include the same set of values if they don't appear in the previous row. This is a little hard to explain so I'll display it;

df = pd.DataFrame({   
        'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.3','2019-08-02 09:50:10.4','2019-08-02 09:50:10.5','2019-08-02 09:50:10.6','2019-08-02 09:50:10.7','2019-08-02 09:50:10.8','2019-08-02 09:50:10.9','2019-08-02 09:50:11.0'],
        'Code1' : ['A','A','B','B','C','C','A','A','B','B'],    
        'Code2' : ['B','B','A','A','B','B','B','B','A','A'],    
        'Code3' : [np.nan,np.nan,'C','C','A','A','C','C','C','C'],                                   
        })


df = df[df.iloc[:, 1:].shift().ne(df.iloc[:, 1:]).any(axis=1)].reset_index(drop = True)

Intended Output:

                    Time Code1 Code2 Code3
0  2019-08-02 09:50:10.1     A     B   NaN
1  2019-08-02 09:50:10.3     B     A     C
2  2019-08-02 09:50:10.5     C     B     A
3  2019-08-02 09:50:10.7     A     B     C
4  2019-08-02 09:50:10.9     B     A     C

Upvotes: 0

Views: 52

Answers (2)

Alexander
Alexander

Reputation: 109528

You can use a sentinel value for the nulls, then group on the result.

The groupby is a variant of the shift-cumsum pattern.

sentinel = 99999999
mask = df[cols].fillna(sentinel).ne(df[cols].fillna(sentinel).shift())
>>> df.groupby(
        mask.any(axis=1).cumsum(), 
        sort=False,
        as_index=False
    ).first()
                    Time Code1 Code2 Code3
0  2019-08-02 09:50:10.1     A     B   NaN
1  2019-08-02 09:50:10.3     B     A     C
2  2019-08-02 09:50:10.5     C     B     A
3  2019-08-02 09:50:10.7     A     B     C
4  2019-08-02 09:50:10.9     B     A     C

Upvotes: 1

Erfan
Erfan

Reputation: 42886

First we use iloc to select the correct columns, then we use shiftto check if current row is not equal to the next one. Finally we use any over axis=1 (columns). Because A B C and B A C are different, but have C in common:

df[df.iloc[:, 1:].shift().ne(df.iloc[:, 1:]).any(axis=1)]

Or the same, but written down little less concise:

mask = df.iloc[:, 1:].shift().ne(df.iloc[:, 1:])
df[mask.any(axis=1)]
                    Time Code1 Code2 Code3
0  2019-08-02 09:50:10.1     A     B     C
2  2019-08-02 09:50:10.3     B     A     C
4  2019-08-02 09:50:10.5     C     B     A
6  2019-08-02 09:50:10.7     A     B     C
8  2019-08-02 09:50:10.9     B     A     C

Upvotes: 1

Related Questions