Reputation: 65
I want to identify outlier values in a Pandas dataframe and the answers I've found so far on the internet don't work. The example bellow contains two zones (A & B). Zone B has an outlier value, but the python function doesn't flag it. Could someone explain why this code doesn't work?
# Sample DataFrame
df = pd.DataFrame({
'zone': ['A', 'A', 'B', 'B', 'B'],
'value': [10, 15, 20, 22, 10000],
'ID':[1,1,1,1,1]})
# Function to flag outliers using IQR
def flag_outliers_iqr(group):
Q1 = group.quantile(0.25)
Q3 = group.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return (group < lower_bound) | (group > upper_bound)
# Create a new column 'outlier' and flag outliers by zone
df['outlier'] = df.groupby('zone')['value'].apply(flag_outliers_iqr)
print(df)
The last row should be true
Upvotes: 0
Views: 23