rra
rra

Reputation: 809

Drop rows on multiple conditions (based on 2 column) in pandas dataframe

I have a pandas data frame with 2 million records. (2 columns such as DA1 and DA2) If I sort the data frame base on DA1 value, it looks like below. My real data frame has 2 million records.

df
    DA1   DA2
0   1.0   1.0
1   1.0   11256.0
2   1.0   131248.0
3   2.0   2.0
4   2.0   11585.0
5   2.0   25894.0
6   3.0   3.0

If I sort the data frame base on DA2 value, it looks like below.

df.head(6)
        DA1         DA2
0       1.0         1.0
87472   11256.0     1.0
267142  131248.0    1.0
3       2.0         2.0
568875  11585.5     2.0
598253  25894.0     2.0
652875  365845.0    2.0

If any row is flipping and has it in another row I want to drop that row. I mean if DA1 2.0 and DA2 365845.0 and another row DA1 365845.0 and DA2 2.0 I want to drop one of that.

I basically want to drop those repeat values in indexes 87472, 267142, 568875, and 598253. below rows basically want to drop.

        DA1         DA2
87472   11256.0     1.0
267142  131248.0    1.0
568875  11585.5     2.0
598253  25894.0     2.0

Can I do that in a for loop or lambda function? **Most of them confuse with the above data frame. Here it's what I really want to do with a simple example. If any row flips and have the same values I want to drop that. DA1 1 and DA2 2 another row DA1 2 and DA2 1 I want to drop that. **.

df = pd.DataFrame({"Column1":[1, 1,1, 3,2],
                    "Column2":[1,2,3,1,1]})

df
    Column1     Column2
0       1           1
1       1           2
2       1           3
3       3           1
4       2           1

# I basically need to delete 3, 4 indexes since those values already in 1,2 indexes
I need this data frame
df
    Column1     Column2
0       1           1
1       1           2
2       1           3

Thanks in advance.

Upvotes: 2

Views: 621

Answers (2)

Ricardo Erikson
Ricardo Erikson

Reputation: 485

It will require some engineering to find the duplicated rows considering the combination of the two columns. You can create a Series object to show you the duplicated rows:

key=df.apply(lambda x: '{}-{}'.format(min(x), max(x)), axis=1)

This will basically create a key for each row with the ordered values in each column separated by a dash. Then you can use this key to remove the duplicated rows:

df[~key.duplicated()]

Example:

>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [2, 1, 4, 3, 5]})
>>> df

Output: 

   a  b
0  1  2
1  2  1
2  3  4
3  4  3
4  5  5

>>> key=df.apply(lambda x: '{}-{}'.format(min(x), max(x)), axis=1)
>>> key

Output:

0    1-2
1    1-2
2    3-4
3    3-4
4    5-5
dtype: object

>>> df[~key.duplicated()]

Output:

   a  b
0  1  2
2  3  4
4  5  5

Upvotes: 2

BENY
BENY

Reputation: 323326

After your sort

df = df.drop_duplicates('DA2')

Upvotes: 1

Related Questions