Reputation: 715
I am trying to see what are the unique values corresponding to each index in my data frame. Assume I have the following dataframe:
A
ind1 a
ind1 a
ind1 a
ind1 b
ind1 b
ind2 a
ind2 b
ind2 c
How can I get a table of:
index_related_uniqe
ind1 a
ind1 b
ind2 a
ind2 b
ind2 c
I tried to use drop_duplicates, but it applies its removal to the whole column.
For example df.drop_duplicates('A', keep='first') gives:
ind1 a
ind1 b
ind2 c
Any idea is highly appreciated.
Upvotes: 1
Views: 1569
Reputation: 294258
Option 1
Use pd.DataFrame.duplicated
after resetting the index and use the values as a boolean mask. This reduces the need to reset the index then set it again. We simply reset it in order to take advantage of pd.DataFrame.duplicated
and slice
df[~df.reset_index().duplicated().values]
A
ind1 a
ind1 b
ind2 a
ind2 b
ind2 c
Option 2
We can use generators to remove ourselves from creating new pandas objects altogether.
d = {}
z = zip(df.index.values.tolist(), df.A.values.tolist())
df[[False if k in d else d.setdefault(k, True) for k in z]]
A
ind1 a
ind1 b
ind2 a
ind2 b
ind2 c
Timing
We can see this improves performance over this tiny data sample.
%timeit df.reset_index().drop_duplicates().set_index('index')
%timeit df[~df.reset_index().duplicated().values]
%%timeit
d = {}
z = zip(df.index.values.tolist(), df.A.values.tolist())
df[[False if k in d else d.setdefault(k, True) for k in z]]
1.53 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
945 µs ± 44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
188 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It also avoids producing an artifact of naming the previously unnamed index 'index'
Upvotes: 1