Python - count distinct rows from a dataframe

Question

I have a dataframe in the following format:

UserId, CurrentUserLocationId, RegisteredUserLocationId, RestorauntId

I wish to count the amount of unique appearances of the key (UserId, CurrentUserLocationId, RegisteredUserLocationId)

For example, if the pair (1, 1, 1) appears once, I wish to stop counting and include it in the final result. So each unique pair that appears I need to count it only once.

What I tried doing is to use groupby(['col1', 'col2', 'col3']).size() however this counts all the records. The dataset I will be using the code on has billion records.

Is there a built-in way to accomplish what I'm trying to do? Or to be more precise, what's the fastest way to do this sort of counting?

TLOwater · Accepted Answer

DataFrame.drop_duplicates()
DataFrame.count

If necessary duplicate the dataframe before dropping duplicates and when making the duplicate dataframe only call in the columns you want to be unique combinations.

Python - count distinct rows from a dataframe

Answers (1)

Related Questions