defraggled
defraggled

Reputation: 1218

What is an efficient way to isolate dataframe rows with identical values in specific columns?

I have a Pandas DataFrame that looks like this:

AA,    BB,    CC,    _id1,     _id2,     _id3
"foo", "foo", "foo", "1A1A1A", "1A1A1A", "1A1A1A"
"foo", "foo", "foo", "2B2B2B", "4L4L4L", "2B2B2B"

I want to isolate rows with _id values that are not all identical. In the above example, it would isolate the second row, and ignore the first.

I do not know how many _id columns will exist (and need checking) in any given run, and their names will vary (but will be available in as a list of strings: id_column_names. The names will also always have _id prefix.

Is there any table-wise (or at least series-wise) operation that is reasonably efficient here?

I can of course write some function (to iterate and the relevant row values) and apply() it row-wise, but is there a better option?

Upvotes: 2

Views: 186

Answers (2)

jezrael
jezrael

Reputation: 863266

First filter _id columns by DataFrame.filter and then test if same values by DataFrame.eq with first column with DataFrame.all:

df1 = df.filter(regex='^_id')

df = df[~df1.eq(df1.iloc[:, 0], axis=0).all(axis=1)]
print (df)
   AA   BB   CC    _id1    _id2    _id3
1  foo  foo  foo  2B2B2B  4L4L4L  2B2B2B

Upvotes: 2

wasif
wasif

Reputation: 15498

Yeah, use df.filter:

df.filter(regex=r'^[_]')

To remove these:

df.filter(regex=r'^[^_]')

Upvotes: 0

Related Questions