Reputation: 809
I have a pandas data frame with 2 million records. (2 columns such as DA1 and DA2) If I sort the data frame base on DA1 value, it looks like below. My real data frame has 2 million records.
df
DA1 DA2
0 1.0 1.0
1 1.0 11256.0
2 1.0 131248.0
3 2.0 2.0
4 2.0 11585.0
5 2.0 25894.0
6 3.0 3.0
If I sort the data frame base on DA2 value, it looks like below.
df.head(6)
DA1 DA2
0 1.0 1.0
87472 11256.0 1.0
267142 131248.0 1.0
3 2.0 2.0
568875 11585.5 2.0
598253 25894.0 2.0
652875 365845.0 2.0
If any row is flipping and has it in another row I want to drop that row. I mean if DA1 2.0 and DA2 365845.0 and another row DA1 365845.0 and DA2 2.0 I want to drop one of that.
I basically want to drop those repeat values in indexes 87472, 267142, 568875, and 598253. below rows basically want to drop.
DA1 DA2
87472 11256.0 1.0
267142 131248.0 1.0
568875 11585.5 2.0
598253 25894.0 2.0
Can I do that in a for loop or lambda function? **Most of them confuse with the above data frame. Here it's what I really want to do with a simple example. If any row flips and have the same values I want to drop that. DA1 1 and DA2 2 another row DA1 2 and DA2 1 I want to drop that. **.
df = pd.DataFrame({"Column1":[1, 1,1, 3,2],
"Column2":[1,2,3,1,1]})
df
Column1 Column2
0 1 1
1 1 2
2 1 3
3 3 1
4 2 1
# I basically need to delete 3, 4 indexes since those values already in 1,2 indexes
I need this data frame
df
Column1 Column2
0 1 1
1 1 2
2 1 3
Thanks in advance.
Upvotes: 2
Views: 621
Reputation: 485
It will require some engineering to find the duplicated rows considering the combination of the two columns. You can create a Series
object to show you the duplicated rows:
key=df.apply(lambda x: '{}-{}'.format(min(x), max(x)), axis=1)
This will basically create a key for each row with the ordered values in each column separated by a dash. Then you can use this key to remove the duplicated rows:
df[~key.duplicated()]
Example:
>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [2, 1, 4, 3, 5]})
>>> df
Output:
a b
0 1 2
1 2 1
2 3 4
3 4 3
4 5 5
>>> key=df.apply(lambda x: '{}-{}'.format(min(x), max(x)), axis=1)
>>> key
Output:
0 1-2
1 1-2
2 3-4
3 3-4
4 5-5
dtype: object
>>> df[~key.duplicated()]
Output:
a b
0 1 2
2 3 4
4 5 5
Upvotes: 2