user3423407
user3423407

Reputation: 341

Collapse multiple columns into one, removing duplicates

I have a spreadsheet where I had records of Cities in different columns and spread across different rows

For eg this is my dataset

df = pd.DataFrame([['Jersey City','NYC','Miami','Charlotte'],
   ['Santa Clara','Santa Barbara','NYC'],
   ['Jersey City','Miami']])

So this is how I tried to do, first a transpose

df.transpose()

Output is

print(df.transpose().to_string())

             0              1            2
0  Jersey City    Santa Clara  Jersey City
1          NYC  Santa Barbara        Miami
2        Miami            NYC         None
3    Charlotte           None         None

(duplicates removed)

I am not sure if my approach is the desired way and or there is a better approach

Upvotes: 3

Views: 247

Answers (3)

ansev
ansev

Reputation: 30940

Use:

pd.DataFrame(df.stack().unique())

Output

               0
0    Jersey City
1            NYC
2          Miami
3      Charlotte
4    Santa Clara
5  Santa Barbara

Time comparision.

%%timeit
df.stack().drop_duplicates().reset_index(drop=True)
1.88 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
pd.DataFrame(df.stack().unique())
1.12 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 3

Brandon
Brandon

Reputation: 1018

This will work:

Data

df = pd.DataFrame([['Jersey City','NYC','Miami','Charlotte'],
   ['Santa Clara','Santa Barbara','NYC'],
   ['Jersey City','Miami']])

print(df.to_string())

             0              1      2          3
0  Jersey City            NYC  Miami  Charlotte
1  Santa Clara  Santa Barbara    NYC       None
2  Jersey City          Miami   None       None

Solution (You get same results with or without the transpose)

df.transpose().stack().drop_duplicates().reset_index(drop=True)

0      Jersey City
1      Santa Clara
2              NYC
3    Santa Barbara
4            Miami
5        Charlotte
dtype: object
df.stack().drop_duplicates().reset_index(drop=True)

0      Jersey City
1              NYC
2            Miami
3        Charlotte
4      Santa Clara
5    Santa Barbara

Upvotes: 1

cs95
cs95

Reputation: 403128

stack and drop_duplicates

# read CSV
# df = pd.read_csv('data.csv', sep=',' header=None)

df.stack().drop_duplicates().reset_index(drop=True)

0       Jersey City
1               NYC
2             Miami
3         Charlotte
4       Santa Clara
5     Santa Barbara
6       Jersey City
dtype: object

Upvotes: 3

Related Questions