Nebojsa V NLPTrader
Nebojsa V NLPTrader

Reputation: 91

Number of occurrence of pair of value in dataframe

I have dataframe with following columns:

Name, Surname, dateOfBirth, city, country

I am interested to find what is most common combination of name and surname and how much it occurs as well. Would be nice also to see list of top 10 combinations.

My idea for top one was:

mostFreqComb= df.groupby(['Name','Surname'])['Name'].count().argmax()

But I think it is not giving me correct answer. Help would be much appreciated !

Thanks, Neb

Upvotes: 7

Views: 3536

Answers (2)

jpp
jpp

Reputation: 164783

For performance implications of the below solutions, see Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series. They are presented below with best performance first.

GroupBy.size

You can create a series of counts with (Name, Surname) tuple indices using GroupBy.size:

res = df.groupby(['Name', 'Surname']).size().sort_values(ascending=False)

By sorting these values, we can easily extract the most common:

most_common = res.head(1)
most_common_dups = res[res == res.iloc[0]].index.tolist()  # handles duplicate top counts

value_counts

Another way is to construct a series of tuples, then apply pd.Series.value_counts:

res = pd.Series(list(zip(df.Name, df.Surname))).value_counts()

The result will be a series of counts indexed by Name-Surname combinations, sorted from most common to least.

name, surname = res.index[0]  # return most common
most_common_dups = res[res == res.max()].index.tolist()

collections.Counter

If you wish to create a dictionary of (name, surname): counts entries, you can do so via collections.Counter:

from collections import Counter

zipper = zip(df.Name, df.Surname)
c = Counter(zipper)

Counter has useful methods such as most_common, which you can use to extract your result.

Upvotes: 11

cs95
cs95

Reputation: 402912

Seems like a good use case for the performant Counter:

from collections import Counter
popular_names = Counter(zip(df.Name, df.Surname)).most_common(10) 

Upvotes: 2

Related Questions