aghd
aghd

Reputation: 715

Find unique values in one column that have the same index in Pandas DataFrame

I am trying to see what are the unique values corresponding to each index in my data frame. Assume I have the following dataframe:

        A   
ind1    a   
ind1    a   
ind1    a   
ind1    b   
ind1    b   
ind2    a 
ind2    b 
ind2    c 

How can I get a table of:

        index_related_uniqe
ind1    a
ind1    b
ind2    a
ind2    b
ind2    c

I tried to use drop_duplicates, but it applies its removal to the whole column.

For example df.drop_duplicates('A', keep='first') gives:

ind1    a   
ind1    b   
ind2    c 

Any idea is highly appreciated.

Upvotes: 1

Views: 1569

Answers (1)

piRSquared
piRSquared

Reputation: 294258

Option 1
Use pd.DataFrame.duplicated after resetting the index and use the values as a boolean mask. This reduces the need to reset the index then set it again. We simply reset it in order to take advantage of pd.DataFrame.duplicated and slice

df[~df.reset_index().duplicated().values]

      A
ind1  a
ind1  b
ind2  a
ind2  b
ind2  c

Option 2
We can use generators to remove ourselves from creating new pandas objects altogether.

d = {}
z = zip(df.index.values.tolist(), df.A.values.tolist())
df[[False if k in d else d.setdefault(k, True) for k in z]]

      A
ind1  a
ind1  b
ind2  a
ind2  b
ind2  c

Timing

We can see this improves performance over this tiny data sample.

%timeit df.reset_index().drop_duplicates().set_index('index')
%timeit df[~df.reset_index().duplicated().values]

%%timeit
d = {}
z = zip(df.index.values.tolist(), df.A.values.tolist())
df[[False if k in d else d.setdefault(k, True) for k in z]]

1.53 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
945 µs ± 44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
188 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

It also avoids producing an artifact of naming the previously unnamed index 'index'

Upvotes: 1

Related Questions