user1911092
user1911092

Reputation: 4241

Two Columns of a pandas dataframe - Concat in Python

New to pandas python.

I have a dataframe (df) with two columns of cusips. I want to turn those columns into a list of the unique entries of the two columns.

My first attempt was to do the following:

cusips = pd.concat(df['long'], df['short']).

This returned the error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

I have read a few postings, but I am still having trouble with why this comes up. What am I missing here?

Also, what's the most efficient way to select the unique entries in a column or a dataframe? Can I call it in one function? Does the function differ if I want to create a list or a new, one-coulmn dataframe?

Thank you.

Upvotes: 1

Views: 2380

Answers (2)

Andy Hayden
Andy Hayden

Reputation: 375435

To obtain the unique values in a column you can use the unique Series method, which will return a numpy array of the unique values (and it is fast!).

df.long.unique()
# returns numpy array of unique values

You could then use numpy.append:

np.append(df.long.unique(), df.short.unique())

Note: This just appends the two unique results together and so itself is not unique!

.

Here's a (trivial) example:

import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2], [1, 4]], columns=['long','short'])

In [4]: df
Out[4]: 
   long  short
0     1      2
1     1      4

In [5]: df.long.unique()
Out[5]: array([1])

In [6]: df.short.unique()
Out[6]: array([2, 4])

And then appending the resulting two arrays:

In [7]: np.append(df.long.unique(), df.short.unique())
Out[7]: array([1, 2, 4])

Using @Zalazny7's set is significantly faster (since it runs over the array only once) and somewhat upsettingly it's even faster than np.unique (which sorts the resulting array!).

Upvotes: 1

Zelazny7
Zelazny7

Reputation: 40628

Adding to Hayden's answer, you could also use the set() method for the same result. The performance is slightly better if that's a consideration:

In [28]: %timeit set(np.append(df[0],df[1]))
100000 loops, best of 3: 19.6 us per loop

In [29]: %timeit np.append(df[0].unique(), df[1].unique())
10000 loops, best of 3: 55 us per loop

Upvotes: 1

Related Questions