What is an efficient way to isolate dataframe rows with identical values in specific columns?

Question

I have a Pandas DataFrame that looks like this:

AA,    BB,    CC,    _id1,     _id2,     _id3
"foo", "foo", "foo", "1A1A1A", "1A1A1A", "1A1A1A"
"foo", "foo", "foo", "2B2B2B", "4L4L4L", "2B2B2B"

I want to isolate rows with _id values that are not all identical. In the above example, it would isolate the second row, and ignore the first.

I do not know how many _id columns will exist (and need checking) in any given run, and their names will vary (but will be available in as a list of strings: id_column_names. The names will also always have _id prefix.

Is there any table-wise (or at least series-wise) operation that is reasonably efficient here?

I can of course write some function (to iterate and the relevant row values) and apply() it row-wise, but is there a better option?

jezrael · Accepted Answer

First filter _id columns by DataFrame.filter and then test if same values by DataFrame.eq with first column with DataFrame.all:

df1 = df.filter(regex='^_id')

df = df[~df1.eq(df1.iloc[:, 0], axis=0).all(axis=1)]
print (df)
   AA   BB   CC    _id1    _id2    _id3
1  foo  foo  foo  2B2B2B  4L4L4L  2B2B2B

What is an efficient way to isolate dataframe rows with identical values in specific columns?

Answers (2)

Related Questions