Verbal_Kint
Verbal_Kint

Reputation: 1416

Count commonalities/frequency among subgroups

I have a Series as follows:

(multi-index)              |  Count  |
Group  |  Source  |  Name  |
______________________________________
  A    |    X     |  Jolly |   2
       |          |  Stone |   1
       |    Y     |  Sand  |   1
  B    |    X     |  Sand  |   1
       |    Y     |  Beach |   1

I also have the precursor dataframe to this, which looks like:

Group  |  Source  |  Name
___________________________
  A    |    X     |  Jolly
  A    |    X     |  Stone
  A    |    X     |  Jolly
  A    |    Y     |  Sand
  B    |    X     |  Sand
  B    |    X     |  Beach
  B    |    Y     |  Stone

Each Group is guaranteed to have a Source X and a Source Y, where the corresponding Names for X and Y within a Group do not overlap. That is to say if a Name is found for Source X, it won't be found for Source Y within that group.

For example, I want to count how many times Stone and Sand are seen together for different Sources within a Group. So, Jolly is seen twice along with Stone once for Source Y Name Sand within Group A. If I see Jolly and Sand frequently together across Groups, where each is from a different Source this may mean they reference the same entity.

How do I accomplish this with pandas?

I don't have a particular expected output in mind, perhaps this:

                                Y
            Stone | Sand  |  Beach  |  Water
   Stone      NaN     1       NaN       NaN
X  Sand       1      NaN      NaN       NaN
   ...      ...      ...      ...       ...

EDIT #1: I was able to arrive at this based on @mozway answer:

def get_source_grouped_combos(df):
    src_x = df[df['Source'] == 'X'].groupby('Name').count()
    src_y = df[df['Source'] == 'Y'].groupby('Name').count()
    return list(product(src_x.index.values, src_y.index.values))

((dfex.loc[dfex['Group'].map(
    dfex.groupby(['Group','Source'])['Name']
    .agg(set)
    .groupby(level=0)
    .agg(lambda x: len(set.intersection(*x))==0))])
 .groupby(['Group'])[['Source','Name']]
 .apply(get_source_grouped_combos)
 .explode('Name')
 .agg(set)
 .value_counts()
)

Which outputs this:

{Sand, Stone}     2
{Jolly, Sand}     1
{Beach, Stone}    1
dtype: int64

Apply is pretty slow, however. Is there a faster way?

EDIT #2: When applying this to my actual dataset (which has all the same dtypes per column) the last 2 lines of the above solution:

 .agg(set)
 .value_counts()

Throws the following error:

AttributeError: 'set' object has no attribute 'value_counts'

.agg(set) seems to be wrapping the entire Series in a set instead of each value. How can I fix it so it counts the occurence of each set of values?

EDIT #3: Seems as though the appearance of NaN values caused this. I would be interested to know why it wraps the series in a set when NaN values are in the series instead of the individual elements.

Upvotes: 0

Views: 52

Answers (1)

mozway
mozway

Reputation: 260640

The expected output is unclear, but do you want something like:

from itertools import combinations
counts = (df
 .groupby('Group')['Name']
 .agg(lambda x: set(combinations(set(x), 2)))
 .explode('Name')
 .value_counts()
)

output:

(Sand, Stone)     2
(Jolly, Sand)     1
(Jolly, Stone)    1
(Sand, Beach)     1
(Beach, Stone)    1
Name: Name, dtype: int64

Upvotes: 1

Related Questions