Winston
Winston

Reputation: 1428

Find unique value in Pandas

I currently have a Pandas column, on each row of the column there are multiple values. I would like to obtain a set of unique values in the whole column. For example:

From:

+-------------------------------------------+
|                  Column                   |
+-------------------------------------------+
| 300000,50000,500000,100000,1000000,200000 |
| 100000,1000000,200000,300000,50000,500000 |
|                                       ... |
+-------------------------------------------+

To:

+--------+
| Column |
+--------+
|  50000 |
| 100000 |
| 200000 |
| 300000 |
|    ... |
+--------+

Thank you very much

Upvotes: 0

Views: 82

Answers (2)

luigigi
luigigi

Reputation: 4215

This:

>>> data = {'column' : ["300000,50000,500000,100000,1000000,200000","100000,1000000,200000,300000,50000,500000"]}
>>> df = pd.DataFrame(data)
>>> df.column.str.split(',').explode().astype(int).drop_duplicates().sort_values(ascending=True)

Outputs:

    column
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

Upvotes: 3

jezrael
jezrael

Reputation: 862511

Pure pandas solution should be slowier, if large data - idea is create Series by split and stack, remove duplicated, convert to integers and sorting:

df = (df['Column'].str.split(',', expand=True)
                  .stack()
                  .drop_duplicates()
                  .astype(int)
                  .sort_values()
                  .reset_index(drop=True)
                  .to_frame('col'))
print (df)
       col
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

Or use set comprehension with flatten splitted lists, convert to integers, sorted and last pass to Dataframe - solution should be faster in large DataFrame:

#solution working if no missing values, no Nones
L = sorted(set([int(y) for x in df['Column'] for y in x.split(',')]))

#solution1 (working with NaN)s
L = sorted(set([int(y) for x in df['Column'] if x == x for y in x.split(',')]))

#solution2 (working with None)s
L = sorted(set([int(y) for x in df['Column'] if x != None for y in x.split(',')]))

#solution3 (working with NaN, None)s
L = sorted(set([int(y) for x in df['Column'] if pd.notna(x) for y in x.split(',')]))

df = pd.DataFrame({'col':L})
print (df)
       col
0    50000
1   100000
2   200000
3   300000
4   500000
5  1000000

Upvotes: 1

Related Questions