Find all rows which have different values in the columns of a Pandas DataFrame (time series)

Question

Let's assume I have a pandas DataFrame in Python which shows the name of the business unit leader for different units over time. It could look something like this and can be recreated like:

import pandas as pd
import numpy as np

d = pd.DataFrame({'Boss_January': ['Nina', 'Lena', 'Max', np.NaN], 'Boss_February': ['Nina', 'Emilia','Max','Leonie'],'Boss_March':['Nina','Lena','Mark','Leonie']})

I would like to find the units where there has been a change in the business unit leader, i.e. rows where the values across the different columns aren't all the same.

The result should include Unit 1, Unit 2 and Unit 3 because of changes in the unit leader (and missing value), but exclude Unit 0.

Since the real DataFrame has much more columns, I don't want to check all the values by iterating through the rows and checking if Boss_January==Boss_February==Boss_March.

jpp · Accepted Answer

You can check for equality with the first series, test all values are True, then take the negative:

res = df[~df.eq(df.iloc[:, 0], axis=0).all(1)]

print(res)

  Boss_February Boss_January Boss_March
1        Emilia         Lena       Lena
2           Max          Max       Mark
3        Leonie          NaN     Leonie

If you just need indices, subsetting the dataframe is not required:

bools = ~df.eq(df.iloc[:, 0], axis=0).all(1)
idx = bools[bools].index

print(idx)

Int64Index([1, 2, 3], dtype='int64')

Find all rows which have different values in the columns of a Pandas DataFrame (time series)

Answers (2)

Related Questions