tom
tom

Reputation: 1077

Pandas aggregate by unique occurrence per group

In pandas, I'd like to analyze groups if there is a single occurrence of a conditional value. I've included a sample dataframe with a first step attempt at identifying such groups below. So, let's say, in the data frame below, I want to filter the original data frame only for species of iris that ever had a sepal length greater than 6. In the last command, I'm counting the number of unique species groups that had a sepal length greater than 6 (so, at least I can count them).

But, what I really want is the original dataframe where I analyze rows only if the species had a sepal length greater than 6 (so, it would be a dataframe without the species "setosa" since they never have one).

The longer explanation is that I have a real dataset of users. Each user will have values in certain columns that may exceed a threshold value of interest. I haven't figured out how to analyze users who have these threshold values.

Perhaps a loop would be better. I might loop through each unique user name and look if any row with that user ever exceeds a certain value and gets some kind of new column (though I know loops are frowned upon in pandas so I'm posting here to see if there's some kind of well-known method of identifying groups by occurrence).

Thanks and let me know if I can make this question any more clear!

import pandas as pd
import seaborn as sns
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None) 

iris = sns.load_dataset('iris')

iris['longsepal'] = iris['sepal_length'] > 7

iris['longpetal'] = iris['petal_length'] > 5

iris.groupby(['longsepal'])['species'].nunique()

Upvotes: 0

Views: 65

Answers (1)

Parfait
Parfait

Reputation: 107587

Consider groupby().transform() to calculate inline max aggregates to be later filtered on its value by species. Technically, the > 7 only returns one species as veriscolor max reaches 7.0. Below shows the operator and functional form of inequality logic.

iris['longsepal'] = iris.groupby(['species'])['sepal_length'].transform('max')
iris['longpetal'] = iris.groupby(['species'])['petal_length'].transform('max')

# DATA FILTERS
longsepal_iris = iris.loc[iris['longsepal'] > 7]     # GREATER THAN OPERATOR FORM: >
longsepal_iris = iris.loc[iris['longsepal'].gt(7)]   # GREATER THAN FUNCTIONAL FORM: gt()

longpetal_iris = iris.loc[iris['longpetal'] > 5]     # GREATER THAN OPERATOR FORM: >   
longpetal_iris = iris.loc[iris['longpetal'].gt(5)]   # GREATER THAN FUNCTIONAL FORM: gt()

# SPECIES
longsepal_iris['species'].unique()       
# ['virginica']

longpetal_iris['species'].unique()    
# ['versicolor' 'virginica']

Upvotes: 3

Related Questions