Reputation: 1428
I currently have a Pandas column, on each row of the column there are multiple values. I would like to obtain a set of unique values in the whole column. For example:
From:
+-------------------------------------------+
| Column |
+-------------------------------------------+
| 300000,50000,500000,100000,1000000,200000 |
| 100000,1000000,200000,300000,50000,500000 |
| ... |
+-------------------------------------------+
To:
+--------+
| Column |
+--------+
| 50000 |
| 100000 |
| 200000 |
| 300000 |
| ... |
+--------+
Thank you very much
Upvotes: 0
Views: 82
Reputation: 4215
This:
>>> data = {'column' : ["300000,50000,500000,100000,1000000,200000","100000,1000000,200000,300000,50000,500000"]}
>>> df = pd.DataFrame(data)
>>> df.column.str.split(',').explode().astype(int).drop_duplicates().sort_values(ascending=True)
Outputs:
column
0 50000
1 100000
2 200000
3 300000
4 500000
5 1000000
Upvotes: 3
Reputation: 862511
Pure pandas solution should be slowier, if large data - idea is create Series by split
and stack
, remove duplicated, convert to integers and sorting:
df = (df['Column'].str.split(',', expand=True)
.stack()
.drop_duplicates()
.astype(int)
.sort_values()
.reset_index(drop=True)
.to_frame('col'))
print (df)
col
0 50000
1 100000
2 200000
3 300000
4 500000
5 1000000
Or use set comprehension with flatten splitted lists, convert to integers, sorted and last pass to Dataframe - solution should be faster in large DataFrame:
#solution working if no missing values, no Nones
L = sorted(set([int(y) for x in df['Column'] for y in x.split(',')]))
#solution1 (working with NaN)s
L = sorted(set([int(y) for x in df['Column'] if x == x for y in x.split(',')]))
#solution2 (working with None)s
L = sorted(set([int(y) for x in df['Column'] if x != None for y in x.split(',')]))
#solution3 (working with NaN, None)s
L = sorted(set([int(y) for x in df['Column'] if pd.notna(x) for y in x.split(',')]))
df = pd.DataFrame({'col':L})
print (df)
col
0 50000
1 100000
2 200000
3 300000
4 500000
5 1000000
Upvotes: 1