TheLastCoder
TheLastCoder

Reputation: 610

Find total number of unique elements in two columns combined

I have a dataframe names_df with 800 million rows with two columns. firstname and lastname. I need to find the total number of unique names from the two columns combined.

           first_name last_name
0          john       doe
1          jane       doe
2          doe        john
3          doe        jane
:
799999999  Levi       Ackerman
800000000  Gojo       Satoru

I can simply do:

unique_names = np.concatenate((names_df.first_name.unique(), names_df.last_name.unique()), axis=None)
unique_names=set(unique_names.tolist())
print(len(unique_names))

However, this takes a lot of time and is inefficient, what is a more efficient of finding the total number of unique values from the two columns combined? the unique_names would look like this =

>>>print(unique_names)
>>> {'john','jane','doe','levi','ackerman','Gojo','satoru'}

Upvotes: 1

Views: 98

Answers (4)

ernesi
ernesi

Reputation: 378

Edit upon edited question If you want all unique values from multiple columns you can use:

names= names_df[["first_name", "last_name"]].values.ravel()
unique_names =  pd.unique(names)
n_unique_names = len(n_unique_names)

Please see here


Old answer:

You could use drop_duplicates() from pandas and then look at the shape of the returned pandas.DataFrame. If / How much faster is to be seen though.

Upvotes: 0

DS-Dalton
DS-Dalton

Reputation: 26

Creating a temporary third column and computing the length of its unique values should be less computationally expensive.

names_df['full_name'] = names_df.first_name + names_df.last_name
total_unique_length = len(names_df.full_name.unique())
names_df = names_df.drop(columns='full_name')
print(total_unique_length)

EDIT - based on your edit, you just want unique names from both lists. If there is a Jane Doe and a John Doe, you want [John, Jane, Doe]

In this case it's much easier.

total_unique_length = len(names_df.first_name.unique()) + len(names_df.last_name.unique())
print(total_unique_length)

Upvotes: 0

Anurag Dabas
Anurag Dabas

Reputation: 24314

use this(This is faster than your method):

set(names_df['first_name'].unique().tolist()+names_df['last_name'].unique().tolist())

Upvotes: 1

Prakitidev Verma
Prakitidev Verma

Reputation: 76

IF you have a multicore machine then, sort the data and divide it into 26 batches use python multiprocessing library. Get the unique from each batch and then you can merge all those unique df.

Upvotes: 0

Related Questions