Reputation: 610
I have a dataframe names_df with 800 million rows with two columns. firstname and lastname. I need to find the total number of unique names from the two columns combined.
first_name last_name
0 john doe
1 jane doe
2 doe john
3 doe jane
:
799999999 Levi Ackerman
800000000 Gojo Satoru
I can simply do:
unique_names = np.concatenate((names_df.first_name.unique(), names_df.last_name.unique()), axis=None)
unique_names=set(unique_names.tolist())
print(len(unique_names))
However, this takes a lot of time and is inefficient, what is a more efficient of finding the total number of unique values from the two columns combined? the unique_names would look like this =
>>>print(unique_names)
>>> {'john','jane','doe','levi','ackerman','Gojo','satoru'}
Upvotes: 1
Views: 98
Reputation: 378
Edit upon edited question If you want all unique values from multiple columns you can use:
names= names_df[["first_name", "last_name"]].values.ravel()
unique_names = pd.unique(names)
n_unique_names = len(n_unique_names)
Old answer:
You could use drop_duplicates() from pandas and then look at the shape of the returned pandas.DataFrame. If / How much faster is to be seen though.
Upvotes: 0
Reputation: 26
Creating a temporary third column and computing the length of its unique values should be less computationally expensive.
names_df['full_name'] = names_df.first_name + names_df.last_name
total_unique_length = len(names_df.full_name.unique())
names_df = names_df.drop(columns='full_name')
print(total_unique_length)
EDIT - based on your edit, you just want unique names from both lists. If there is a Jane Doe and a John Doe, you want [John, Jane, Doe]
In this case it's much easier.
total_unique_length = len(names_df.first_name.unique()) + len(names_df.last_name.unique())
print(total_unique_length)
Upvotes: 0
Reputation: 24314
use this(This is faster than your method):
set(names_df['first_name'].unique().tolist()+names_df['last_name'].unique().tolist())
Upvotes: 1
Reputation: 76
IF you have a multicore machine then, sort the data and divide it into 26 batches use python multiprocessing library. Get the unique from each batch and then you can merge all those unique df.
Upvotes: 0