Reputation: 19
I've got a basic scatterplot and want to show all the outliers in a different colour. I define outliers as being more than 2 standard deviations from the mean. The code I've produced only shows up a single outlier whereas I want ALL the outliers to be a different colour:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv('1fXr31hcEemkYxLyQ1aU1g_50fc36ee697c4b158fe26ade3ec3bc24_Banknote-authentication-dataset- (1).csv')
data = np.array(data)
mean = np.mean(data, 0)
min = np.min(data,0)
max = np.max(data,0)
normed = (data - min) / (max - min)
mean = np.mean(normed, 0)
std_dev = np.std (normed, 0)
fig, graph = plt.subplots()
graph.scatter(normed [:,0], normed [:,1])
graph.scatter(mean[0], mean[1])
outliers = normed[normed>2*std_dev]
graph.scatter(outliers [0], outliers [1], c='red')
plt.show
Upvotes: 1
Views: 370
Reputation: 206
A simple way to do this is by creating a new column in your dataframe that identifies outliers, and then feeding this into the c
parameter in the plt.scatter()
:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'x' : np.random.normal(0, size = 100),
'y' : np.random.normal(0, size = 100)})
# Identifies the means of x and y
x_mean = df['x'].mean()
y_mean = df['y'].mean()
# Identify the standard deviation multiplied by 2
x_std2 = x_mean + df['x'].std()*2
y_std2 = y_mean + df['y'].std()*2
# Create a new column indicating if a value is below or above the mean +/- 2 times the standard deviation
df['outlier'] = (((x_std2*-1 <= df['x']) & (df['x'] <= x_std2)) &
((y_std2*-1 <= df['y']) & (df['y'] <= y_std2)))
# Here we use the indicator to signify the color that point should be assigned
plt.scatter(df['x'],
df['y'],
s = 15,
c = df['outlier'],
cmap = 'RdYlGn')
plt.xlabel('X')
plt.ylabel('Y')
# I just added a couple reference lines so you can see that the points are indeed below or above the mean +/- 2 times the standard deviation
plt.axvline(x_mean, linestyle = '--', color = 'k')
plt.axhline(y_mean, linestyle = '--', color = 'k')
plt.axvline(x_std2, linestyle = ':', color = 'k')
plt.axhline(y_std2, linestyle = ':', color = 'k')
plt.axvline(x_std2*-1, linestyle = ':', color = 'k')
plt.axhline(y_std2*-1, linestyle = ':', color = 'k')
Upvotes: 1