Reputation: 326
I have the following pandas dataframe:
df = pd.DataFrame([[5, 10],[8, 40],[8, 50],[10, 390], [10, 395], [10, 405], [11, 390], [11, 395], [11, 405], [13, 390], [13, 395], [13, 405]], columns=['index', 'so_id'])
index | so_id |
---|---|
5 | 10 |
8 | 40 |
8 | 50 |
10 | 390 |
10 | 395 |
10 | 405 |
11 | 390 |
11 | 395 |
11 | 405 |
13 | 390 |
13 | 395 |
13 | 405 |
The desired output would be the following:
index | so_id |
---|---|
5 | 10 |
8 | 40 |
10 | 390 |
11 | 395 |
13 | 405 |
Basically my goal is to drop duplicates on the column 'index' while keeping a different ascending value for the column 'so_id'.
The key point is that I don't want a simple drop_duplicates on the variable 'index' since this would get me the same 'so_id' after the drop_duplicates. I want drop_duplicates on 'index' and at the same time get the different values of the column 'so_id'.
Upvotes: 1
Views: 173
Reputation: 8219
Somewhat brute force but will find a solution even if a simplistic 'take first available number' strategy leads into a dead end (so more general). Also will tell you if a solution cannot be found (as may well be the case):
n_tries = 1000
for _ in range(n_tries):
df2 = df.groupby('index').apply(lambda g:g.sample(n=1))
if df2['so_id'].is_unique:
print('solution\n',df2)
break
else:
print('no solution found')
output
solution
index so_id
index
5 0 5 10
8 2 8 50
10 5 10 405
11 7 11 395
13 9 13 390
Upvotes: 2
Reputation: 195438
If your values are sorted, you can do:
seen = set()
def fn(x):
for val in x:
if val in seen:
continue
seen.add(val)
return val
df = df.groupby("index")["so_id"].apply(fn).reset_index()
print(df)
Prints:
index so_id
0 5 10
1 8 40
2 10 390
3 11 395
4 13 405
Upvotes: 4