Reputation: 1416
I have a Series as follows:
(multi-index) | Count |
Group | Source | Name |
______________________________________
A | X | Jolly | 2
| | Stone | 1
| Y | Sand | 1
B | X | Sand | 1
| Y | Beach | 1
I also have the precursor dataframe to this, which looks like:
Group | Source | Name
___________________________
A | X | Jolly
A | X | Stone
A | X | Jolly
A | Y | Sand
B | X | Sand
B | X | Beach
B | Y | Stone
Each Group
is guaranteed to have a Source
X
and a Source
Y
, where the corresponding Name
s for X
and Y
within a Group
do not overlap. That is to say if a Name
is found for Source
X
, it won't be found for Source
Y
within that group.
For example, I want to count how many times Stone
and Sand
are seen together for different Source
s within a Group
. So, Jolly
is seen twice along with Stone
once for Source
Y
Name
Sand
within Group
A
. If I see Jolly
and Sand
frequently together across Group
s, where each is from a different Source
this may mean they reference the same entity.
How do I accomplish this with pandas?
I don't have a particular expected output in mind, perhaps this:
Y
Stone | Sand | Beach | Water
Stone NaN 1 NaN NaN
X Sand 1 NaN NaN NaN
... ... ... ... ...
EDIT #1: I was able to arrive at this based on @mozway answer:
def get_source_grouped_combos(df):
src_x = df[df['Source'] == 'X'].groupby('Name').count()
src_y = df[df['Source'] == 'Y'].groupby('Name').count()
return list(product(src_x.index.values, src_y.index.values))
((dfex.loc[dfex['Group'].map(
dfex.groupby(['Group','Source'])['Name']
.agg(set)
.groupby(level=0)
.agg(lambda x: len(set.intersection(*x))==0))])
.groupby(['Group'])[['Source','Name']]
.apply(get_source_grouped_combos)
.explode('Name')
.agg(set)
.value_counts()
)
Which outputs this:
{Sand, Stone} 2
{Jolly, Sand} 1
{Beach, Stone} 1
dtype: int64
Apply is pretty slow, however. Is there a faster way?
EDIT #2: When applying this to my actual dataset (which has all the same dtypes per column) the last 2 lines of the above solution:
.agg(set)
.value_counts()
Throws the following error:
AttributeError: 'set' object has no attribute 'value_counts'
.agg(set)
seems to be wrapping the entire Series in a set instead of each value. How can I fix it so it counts the occurence of each set
of values?
EDIT #3:
Seems as though the appearance of NaN values caused this. I would be interested to know why it wraps the series in a set
when NaN values are in the series instead of the individual elements.
Upvotes: 0
Views: 52
Reputation: 260640
The expected output is unclear, but do you want something like:
from itertools import combinations
counts = (df
.groupby('Group')['Name']
.agg(lambda x: set(combinations(set(x), 2)))
.explode('Name')
.value_counts()
)
output:
(Sand, Stone) 2
(Jolly, Sand) 1
(Jolly, Stone) 1
(Sand, Beach) 1
(Beach, Stone) 1
Name: Name, dtype: int64
Upvotes: 1