Reputation: 57471
I have a DataFrame
in which the rows represent traffic accidents. Two of the columns are Weather
and Skidding
:
import pandas as pd
df = pd.DataFrame({'Weather': ['rain', 'fine', 'rain', 'fine', 'snow', 'fine', 'snow'],
'Skidding': ['skid', 'skid', 'no skid', 'no skid', 'skid', 'no skid', 'jackknife']})
I'd like to compute how much more likely it is that either skidding or jackknifing occurs when it is raining or snowing compared to when it is not. So far I've come up with a solution using Boolean indexing and four auxiliary data frames:
df_rainsnow = df[[weather in ('rain', 'snow') for weather in df.Weather]]
df_rainsnow_skid = df_rainsnow[[skid in ('skid', 'jackknife') for skid in df_rainsnow.Skidding]]
df_fine = df[df.Weather == 'fine']
df_fine_skid = df_fine[[skid in ('skid', 'jackknife') for skid in df_fine.Skidding]]
relative_probability = len(df_rainsnow_skid)/len(df_fine_skid)
which evaluates to a relative_probability
of 3.0
for this example. This seems unnecessarily verbose, however, and I'd like to refactor it.
One solution I tried is
counts = df.groupby('Weather')['Skidding'].value_counts()
relative_probability = (counts['rain']['skid'] + counts['snow']['skid']
+ counts['rain']['jackknife'] + counts['snow']['jackknife']) / (counts['fine']['skid'] + counts['fine']['jackknife'])
However, this leads to a KeyError
because jackknife
doesn't occur in every weather situation, and anyways it is also verbose to write out all the terms. What is a better way to achieve this?
Upvotes: 0
Views: 1647
Reputation: 214957
You can use isin
instead of ... in ... for ...
comprehension; Also no need to filter the data frame if you just need the number at the end, just build the conditions, sum
and divide
:
rain_snow = df.Weather.isin(['rain', 'snow'])
fine = df.Weather.eq('fine')
skid = df.Skidding.isin(['skid', 'jackknife'])
(rain_snow & skid).sum()/(fine & skid).sum()
# 3
Upvotes: 1