Javier Monsalve
Javier Monsalve

Reputation: 326

Drop duplicates based on condition

I have the following pandas dataframe:

df = pd.DataFrame([[5, 10],[8, 40],[8, 50],[10, 390], [10, 395], [10, 405], [11, 390], [11, 395], [11, 405], [13, 390], [13, 395], [13, 405]], columns=['index', 'so_id'])
index so_id
5 10
8 40
8 50
10 390
10 395
10 405
11 390
11 395
11 405
13 390
13 395
13 405

The desired output would be the following:

index so_id
5 10
8 40
10 390
11 395
13 405

Basically my goal is to drop duplicates on the column 'index' while keeping a different ascending value for the column 'so_id'.

The key point is that I don't want a simple drop_duplicates on the variable 'index' since this would get me the same 'so_id' after the drop_duplicates. I want drop_duplicates on 'index' and at the same time get the different values of the column 'so_id'.

Upvotes: 1

Views: 173

Answers (2)

piterbarg
piterbarg

Reputation: 8219

Somewhat brute force but will find a solution even if a simplistic 'take first available number' strategy leads into a dead end (so more general). Also will tell you if a solution cannot be found (as may well be the case):

n_tries = 1000
for _ in range(n_tries):
    df2 = df.groupby('index').apply(lambda g:g.sample(n=1))
    if df2['so_id'].is_unique:
        print('solution\n',df2)
        break
else:
    print('no solution found')

output

solution
          index  so_id
index                
5     0      5     10
8     2      8     50
10    5     10    405
11    7     11    395
13    9     13    390

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195438

If your values are sorted, you can do:

seen = set()


def fn(x):
    for val in x:
        if val in seen:
            continue
        seen.add(val)
        return val


df = df.groupby("index")["so_id"].apply(fn).reset_index()
print(df)

Prints:

   index  so_id
0      5     10
1      8     40
2     10    390
3     11    395
4     13    405

Upvotes: 4

Related Questions