Find unique values in one column that have the same index in Pandas DataFrame

Question

I am trying to see what are the unique values corresponding to each index in my data frame. Assume I have the following dataframe:

        A   
ind1    a   
ind1    a   
ind1    a   
ind1    b   
ind1    b   
ind2    a 
ind2    b 
ind2    c

How can I get a table of:

        index_related_uniqe
ind1    a
ind1    b
ind2    a
ind2    b
ind2    c

I tried to use drop_duplicates, but it applies its removal to the whole column.

For example df.drop_duplicates('A', keep='first') gives:

ind1    a   
ind1    b   
ind2    c

Any idea is highly appreciated.

piRSquared · Accepted Answer

Option 1
Use pd.DataFrame.duplicated after resetting the index and use the values as a boolean mask. This reduces the need to reset the index then set it again. We simply reset it in order to take advantage of pd.DataFrame.duplicated and slice

df[~df.reset_index().duplicated().values]

      A
ind1  a
ind1  b
ind2  a
ind2  b
ind2  c

Option 2
We can use generators to remove ourselves from creating new pandas objects altogether.

d = {}
z = zip(df.index.values.tolist(), df.A.values.tolist())
df[[False if k in d else d.setdefault(k, True) for k in z]]

      A
ind1  a
ind1  b
ind2  a
ind2  b
ind2  c

Timing

We can see this improves performance over this tiny data sample.

%timeit df.reset_index().drop_duplicates().set_index('index')
%timeit df[~df.reset_index().duplicated().values]

%%timeit
d = {}
z = zip(df.index.values.tolist(), df.A.values.tolist())
df[[False if k in d else d.setdefault(k, True) for k in z]]

1.53 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
945 µs ± 44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
188 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

It also avoids producing an artifact of naming the previously unnamed index 'index'

Find unique values in one column that have the same index in Pandas DataFrame

Answers (1)

Related Questions