Reputation: 89
I have a dataframe that looks as follows:
print(df.head(10))
day CO2
1 549.500000
2 663.541667
3 830.416667
4 799.695652
5 813.850000
6 769.583333
7 681.941176
8 653.333333
9 845.666667
10 436.086957
I then use the following function and lines of code to get the ouliers from the CO2 column:
def estimate_gaussian(dataset):
mu = np.mean(dataset)#moyenne cf mu
sigma = np.std(dataset)#écart_type/standard deviation
limit = sigma * 1.5
min_threshold = mu - limit
max_threshold = mu + limit
return mu, sigma, min_threshold, max_threshold
mu, sigma, min_threshold, max_threshold = estimate_gaussian(df['CO2'].values)
condition1 = (dataset < min_threshold)
condition2 = (dataset > max_threshold)
outliers1 = np.extract(condition1, dataset)
outliers2 = np.extract(condition2, dataset)
outliers = np.concatenate((outliers1, outliers2), axis=0)
Which gives me the following result:
print(outliers)
[830.41666667 799.69565217 813.85 769.58333333 845.66666667]
Now I would like to mark those outliers with a red color on a scatter plot.
You can find below the code I have used so far to mark a single outlier in red on the scatter plot but I cannot find a way to do it for every element of the outliers list which is a numpy.ndarray:
y = df['CO2']
x = df['day']
col = np.where(x<0,'k',np.where(y<845.66666667,'b','r'))
plt.scatter(x, y, c=col, s=5, linewidth=3)
plt.show()
Here is what I get but I would like the same result of all the ouliers. Could you please help me?
Upvotes: 2
Views: 10223
Reputation: 71
Here's one quick solution:
I'll re-create what you already have to begin. You only shared the head of your dataframe but whatever, I just inserted some random outliers. Looks like your "estimate_gaussian()" function can only ever return two outliers?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([549.500000,
50.0000000,
830.416667,
799.695652,
1200.00000,
769.583333,
681.941176,
1300.00000,
845.666667,
436.086957],
columns=['CO2'],
index=list(range(1,11)))
def estimate_gaussian(dataset):
mu = np.mean(dataset) # moyenne cf mu
sigma = np.std(dataset) # écart_type/standard deviation
limit = sigma * 1.5
min_threshold = mu - limit
max_threshold = mu + limit
return mu, sigma, min_threshold, max_threshold
mu, sigma, min_threshold, max_threshold = estimate_gaussian(df.values)
condition1 = (df < min_threshold)
condition2 = (df > max_threshold)
outliers1 = np.extract(condition1, df)
outliers2 = np.extract(condition2, df)
outliers = np.concatenate((outliers1, outliers2), axis=0)
Then we'll plot:
df_red = df[df.values==outliers]
plt.scatter(df.index,df.values)
plt.scatter(df_red.index,df_red.values,c='red')
plt.show()
Let me know if you need something more nuanced!
Upvotes: 2
Reputation: 23743
There are several ways, one would be to create a sequence of colors based on your condition and pass it to the c
parameter.
df = pd.DataFrame({'CO2': {0: 549.5,
1: 663.54166699999996,
2: 830.41666699999996,
3: 799.695652,
4: 813.85000000000002,
5: 769.58333300000004,
6: 681.94117599999993,
7: 653.33333300000004,
8: 845.66666699999996,
9: 436.08695700000004},
'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}})
In [11]: colors = ['r' if n<750 else 'b' for n in df['CO2']]
In [12]: colors
Out[12]: ['r', 'r', 'b', 'b', 'b', 'b', 'r', 'r', 'b', 'r']
In [13]: plt.scatter(df['day'],df['CO2'],c=colors)
Or use np.where
to create the sequence
In [14]: colors = np.where(df['CO2'] < 750, 'r', 'b')
Upvotes: 0
Reputation: 175
I am not sure what the idea behind your col list is, but you can replace col with
col = ['red' if yy in list(outliers) else 'blue' for yy in y]
Upvotes: 0
Reputation: 7509
Possibly not the most efficient solution, but I feel like it's easier to call plt.scatter
multiple times, passing a single xy pair each time. Since we never call a new figure (e.g. using plt.figure()
), each xy pair is plotted on the same figure.
Then, in each iteration we just need to check if the y value is an outlier. If it is, we change the color
keyword argument on the plt.scatter
call.
Try this:
mu, sigma, min_threshold, max_threshold = estimate_gaussian(df['CO2'].values)
xs = df['day']
ys = df['CO2']
for x, y in zip(xs, ys):
color = 'blue' # non-outlier color
if not min_threshold <= y <= max_threshold: # condition for being an outlier
color = 'red' # outlier color
plt.scatter(x, y, color=color)
plt.show()
Upvotes: 1
Reputation: 1718
You could create an additional column (boolean) in which you define if the point is an outlier (True) or not (False), and then work with two scatter plots:
df["outlier"] = # your boolean np array goes in here
plt.scatter[df.loc[df["outlier"], "day"], df.loc[df["outlier"], "CO2"], color="k"]
plt.scatter[df.loc[~df["outlier"], "day"], df.loc[~df["outlier"], "CO2"], color="r"]
Upvotes: 0