Find unique value in Pandas

Question

I currently have a Pandas column, on each row of the column there are multiple values. I would like to obtain a set of unique values in the whole column. For example:

From:

+-------------------------------------------+
|                  Column                   |
+-------------------------------------------+
| 300000,50000,500000,100000,1000000,200000 |
| 100000,1000000,200000,300000,50000,500000 |
|                                       ... |
+-------------------------------------------+

To:

+--------+
| Column |
+--------+
|  50000 |
| 100000 |
| 200000 |
| 300000 |
|    ... |
+--------+

Thank you very much

jezrael · Accepted Answer

Pure pandas solution should be slowier, if large data - idea is create Series by split and stack, remove duplicated, convert to integers and sorting:

df = (df['Column'].str.split(',', expand=True)
                  .stack()
                  .drop_duplicates()
                  .astype(int)
                  .sort_values()
                  .reset_index(drop=True)
                  .to_frame('col'))
print (df)
       col
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

Or use set comprehension with flatten splitted lists, convert to integers, sorted and last pass to Dataframe - solution should be faster in large DataFrame:

#solution working if no missing values, no Nones
L = sorted(set([int(y) for x in df['Column'] for y in x.split(',')]))

#solution1 (working with NaN)s
L = sorted(set([int(y) for x in df['Column'] if x == x for y in x.split(',')]))

#solution2 (working with None)s
L = sorted(set([int(y) for x in df['Column'] if x != None for y in x.split(',')]))

#solution3 (working with NaN, None)s
L = sorted(set([int(y) for x in df['Column'] if pd.notna(x) for y in x.split(',')]))

df = pd.DataFrame({'col':L})
print (df)
       col
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

Find unique value in Pandas

Answers (2)

Related Questions