alex riverson
alex riverson

Reputation: 45

calculating probabilities for a matrix in pandas or pyspark

I have the following DF:

import pandas as pd
df = pd.DataFrame({
        "occurrence": [1,1,2,2,2,3,4,4],
        "a": ["a1","a2","a1","a2","a3","a2","a1","a1"]})
   occurrence   a
0      1        a1
1      1        a2
2      2        a1
3      2        a2
4      2        a3
5      3        a2
6      4        a1
7      4        a1

I want an output like in DF2 below:

      a1     a2    a3
a1    0.33   0.66  1
a2    0.66   0     1
a3    0.33   0.33  0

There are 3 unique occurrences (occurrence 1,2,4) for a1. The probability of a1 occurring with itself a1 is 1/3 or 0.33, with a2 is 2/3 or 0.66, and the prob with a3 is 1/3 or 0.33. So we have the first column in Df2 [0.33, 0.66, 0.33].

There are 3 occurrences for a2 (occurrence 1,2, and 3). The prob of occurring with a1 is 2/3 or 0.66, with itself a2 is 0/3 or 0 and with a3 is 1/3 or 0.33. so we have column 2 in df2 [0.66, 0, 0.33].

Finally, there is 1 unique occurrence for a3 (occurrence 2). The prob of occurring with a1 is 1/1 = 1, with a2 = 1/1, and occurrence of a3 with itself a3 is 0/1 = 0. So we have column 3 in df2 [1,1,0].

I have a long list of occurrences like this but this example should generalize.

Is there an easy way to do this in pandas or pyspark?

Upvotes: 1

Views: 216

Answers (1)

Corralien
Corralien

Reputation: 120439

I will add more explanation later if I have time and if it's necessary.

As @TimRoberts, It was not so easy to find a simple solution...

# Input data
df = pd.DataFrame({
        "a": ["a1","a2","a1","a2","a3","a2","a1","a1"],
        "o": [1,1,2,2,2,3,4,4]})

>>> df
    a  o
0  a1  1
1  a2  1
2  a1  2
3  a2  2
4  a3  2
5  a2  3
6  a1  4
7  a1  4
# Step 1. Count occurrences of each combination
data = []
for a in df.a.unique():
    m1 = df.o.isin(df[df.a == a].o)  # keep all rows match aX.o
    m2 = df.a != a  # but not aX itself
    data.append(df[m1 & m2].value_counts("a").rename(a))
stats = pd.concat(data, axis="columns")

# equivalent to this comprehension
# stats = pd.concat([df[(df.o.isin(df[df.a == a].o))
#                       & (df.a != a)].value_counts("a").rename(a)
#                   for a in df.a.unique()], axis="columns")

>>> stats
     a1   a2   a3
a
a1  NaN  2.0  1.0
a2  2.0  NaN  1.0
a3  1.0  1.0  NaN
# Step 2. Fill diagonal
diag = df.loc[df.duplicated()].groupby("a").size().reindex(df.a.unique(), fill_value=0)
stats = stats.fillna(diag)

>>> stats
     a1   a2   a3
a
a1  1.0  2.0  1.0
a2  2.0  0.0  1.0
a3  1.0  1.0  0.0
# Step 3. Compute probabilities
div = df.drop_duplicates().a.value_counts()
stats = stats / div

>>> stats
          a1        a2   a3
a
a1  0.333333  0.666667  1.0
a2  0.666667  0.000000  1.0
a3  0.333333  0.333333  0.0

Upvotes: 1

Related Questions