Kurt Peek
Kurt Peek

Reputation: 57471

In Pandas, how to calculate the probability of a set of values in one column given a set of values of another column?

I have a DataFrame in which the rows represent traffic accidents. Two of the columns are Weather and Skidding:

import pandas as pd

df = pd.DataFrame({'Weather': ['rain', 'fine', 'rain', 'fine', 'snow', 'fine', 'snow'],
                   'Skidding': ['skid', 'skid', 'no skid', 'no skid', 'skid', 'no skid', 'jackknife']})

I'd like to compute how much more likely it is that either skidding or jackknifing occurs when it is raining or snowing compared to when it is not. So far I've come up with a solution using Boolean indexing and four auxiliary data frames:

df_rainsnow = df[[weather in ('rain', 'snow') for weather in df.Weather]]
df_rainsnow_skid = df_rainsnow[[skid in ('skid', 'jackknife') for skid in df_rainsnow.Skidding]]

df_fine = df[df.Weather == 'fine']
df_fine_skid = df_fine[[skid in ('skid', 'jackknife') for skid in df_fine.Skidding]]

relative_probability = len(df_rainsnow_skid)/len(df_fine_skid)

which evaluates to a relative_probability of 3.0 for this example. This seems unnecessarily verbose, however, and I'd like to refactor it.

One solution I tried is

counts = df.groupby('Weather')['Skidding'].value_counts()

relative_probability = (counts['rain']['skid'] + counts['snow']['skid']
    + counts['rain']['jackknife'] + counts['snow']['jackknife']) / (counts['fine']['skid'] + counts['fine']['jackknife'])

However, this leads to a KeyError because jackknife doesn't occur in every weather situation, and anyways it is also verbose to write out all the terms. What is a better way to achieve this?

Upvotes: 0

Views: 1647

Answers (1)

akuiper
akuiper

Reputation: 214957

You can use isin instead of ... in ... for ... comprehension; Also no need to filter the data frame if you just need the number at the end, just build the conditions, sum and divide:

rain_snow = df.Weather.isin(['rain', 'snow'])
fine = df.Weather.eq('fine')
skid = df.Skidding.isin(['skid', 'jackknife'])
​
(rain_snow & skid).sum()/(fine & skid).sum()
# 3

Upvotes: 1

Related Questions