Reputation: 341
I have a spreadsheet where I had records of Cities in different columns and spread across different rows
For eg this is my dataset
df = pd.DataFrame([['Jersey City','NYC','Miami','Charlotte'],
['Santa Clara','Santa Barbara','NYC'],
['Jersey City','Miami']])
So this is how I tried to do, first a transpose
df.transpose()
Output is
print(df.transpose().to_string())
0 1 2
0 Jersey City Santa Clara Jersey City
1 NYC Santa Barbara Miami
2 Miami NYC None
3 Charlotte None None
(duplicates removed)
I am not sure if my approach is the desired way and or there is a better approach
Upvotes: 3
Views: 247
Reputation: 30940
Use:
pd.DataFrame(df.stack().unique())
Output
0
0 Jersey City
1 NYC
2 Miami
3 Charlotte
4 Santa Clara
5 Santa Barbara
Time comparision.
%%timeit
df.stack().drop_duplicates().reset_index(drop=True)
1.88 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
pd.DataFrame(df.stack().unique())
1.12 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 3
Reputation: 1018
This will work:
Data
df = pd.DataFrame([['Jersey City','NYC','Miami','Charlotte'],
['Santa Clara','Santa Barbara','NYC'],
['Jersey City','Miami']])
print(df.to_string())
0 1 2 3
0 Jersey City NYC Miami Charlotte
1 Santa Clara Santa Barbara NYC None
2 Jersey City Miami None None
Solution (You get same results with or without the transpose)
df.transpose().stack().drop_duplicates().reset_index(drop=True)
0 Jersey City
1 Santa Clara
2 NYC
3 Santa Barbara
4 Miami
5 Charlotte
dtype: object
df.stack().drop_duplicates().reset_index(drop=True)
0 Jersey City
1 NYC
2 Miami
3 Charlotte
4 Santa Clara
5 Santa Barbara
Upvotes: 1
Reputation: 403128
stack
and drop_duplicates
# read CSV
# df = pd.read_csv('data.csv', sep=',' header=None)
df.stack().drop_duplicates().reset_index(drop=True)
0 Jersey City
1 NYC
2 Miami
3 Charlotte
4 Santa Clara
5 Santa Barbara
6 Jersey City
dtype: object
Upvotes: 3