Reputation: 45
I have the following DF:
import pandas as pd
df = pd.DataFrame({
"occurrence": [1,1,2,2,2,3,4,4],
"a": ["a1","a2","a1","a2","a3","a2","a1","a1"]})
occurrence a
0 1 a1
1 1 a2
2 2 a1
3 2 a2
4 2 a3
5 3 a2
6 4 a1
7 4 a1
I want an output like in DF2 below:
a1 a2 a3
a1 0.33 0.66 1
a2 0.66 0 1
a3 0.33 0.33 0
There are 3 unique occurrences (occurrence 1,2,4) for a1. The probability of a1 occurring with itself a1 is 1/3 or 0.33, with a2 is 2/3 or 0.66, and the prob with a3 is 1/3 or 0.33. So we have the first column in Df2 [0.33, 0.66, 0.33].
There are 3 occurrences for a2 (occurrence 1,2, and 3). The prob of occurring with a1 is 2/3 or 0.66, with itself a2 is 0/3 or 0 and with a3 is 1/3 or 0.33. so we have column 2 in df2 [0.66, 0, 0.33].
Finally, there is 1 unique occurrence for a3 (occurrence 2). The prob of occurring with a1 is 1/1 = 1, with a2 = 1/1, and occurrence of a3 with itself a3 is 0/1 = 0. So we have column 3 in df2 [1,1,0].
I have a long list of occurrences like this but this example should generalize.
Is there an easy way to do this in pandas or pyspark?
Upvotes: 1
Views: 216
Reputation: 120439
I will add more explanation later if I have time and if it's necessary.
As @TimRoberts, It was not so easy to find a simple solution...
# Input data
df = pd.DataFrame({
"a": ["a1","a2","a1","a2","a3","a2","a1","a1"],
"o": [1,1,2,2,2,3,4,4]})
>>> df
a o
0 a1 1
1 a2 1
2 a1 2
3 a2 2
4 a3 2
5 a2 3
6 a1 4
7 a1 4
# Step 1. Count occurrences of each combination
data = []
for a in df.a.unique():
m1 = df.o.isin(df[df.a == a].o) # keep all rows match aX.o
m2 = df.a != a # but not aX itself
data.append(df[m1 & m2].value_counts("a").rename(a))
stats = pd.concat(data, axis="columns")
# equivalent to this comprehension
# stats = pd.concat([df[(df.o.isin(df[df.a == a].o))
# & (df.a != a)].value_counts("a").rename(a)
# for a in df.a.unique()], axis="columns")
>>> stats
a1 a2 a3
a
a1 NaN 2.0 1.0
a2 2.0 NaN 1.0
a3 1.0 1.0 NaN
# Step 2. Fill diagonal
diag = df.loc[df.duplicated()].groupby("a").size().reindex(df.a.unique(), fill_value=0)
stats = stats.fillna(diag)
>>> stats
a1 a2 a3
a
a1 1.0 2.0 1.0
a2 2.0 0.0 1.0
a3 1.0 1.0 0.0
# Step 3. Compute probabilities
div = df.drop_duplicates().a.value_counts()
stats = stats / div
>>> stats
a1 a2 a3
a
a1 0.333333 0.666667 1.0
a2 0.666667 0.000000 1.0
a3 0.333333 0.333333 0.0
Upvotes: 1