Synox
Synox

Reputation: 89

Marking outliers on a Scatter Plot

I have a dataframe that looks as follows:

 print(df.head(10))

 day         CO2
   1  549.500000
   2  663.541667
   3  830.416667
   4  799.695652
   5  813.850000
   6  769.583333
   7  681.941176
   8  653.333333
   9  845.666667
  10  436.086957

I then use the following function and lines of code to get the ouliers from the CO2 column:

def estimate_gaussian(dataset):

    mu = np.mean(dataset)#moyenne cf mu
    sigma = np.std(dataset)#écart_type/standard deviation
    limit = sigma * 1.5

    min_threshold = mu - limit
    max_threshold = mu + limit

    return mu, sigma, min_threshold, max_threshold

mu, sigma, min_threshold, max_threshold = estimate_gaussian(df['CO2'].values)


condition1 = (dataset < min_threshold)
condition2 = (dataset > max_threshold)

outliers1 = np.extract(condition1, dataset)
outliers2 = np.extract(condition2, dataset)

outliers = np.concatenate((outliers1, outliers2), axis=0)

Which gives me the following result:

print(outliers)

[830.41666667 799.69565217 813.85       769.58333333 845.66666667]

Now I would like to mark those outliers with a red color on a scatter plot.

You can find below the code I have used so far to mark a single outlier in red on the scatter plot but I cannot find a way to do it for every element of the outliers list which is a numpy.ndarray:

y = df['CO2']

x = df['day']

col = np.where(x<0,'k',np.where(y<845.66666667,'b','r'))

plt.scatter(x, y, c=col, s=5, linewidth=3)
plt.show()

Here is what I get but I would like the same result of all the ouliers. Could you please help me?

https://ibb.co/Ns9V7Zz

Upvotes: 2

Views: 10223

Answers (5)

Robert Boscacci
Robert Boscacci

Reputation: 71

Here's one quick solution:

I'll re-create what you already have to begin. You only shared the head of your dataframe but whatever, I just inserted some random outliers. Looks like your "estimate_gaussian()" function can only ever return two outliers?

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame([549.500000,
                50.0000000,
                830.416667,
                799.695652,
                1200.00000,
                769.583333,
                681.941176,
                1300.00000,
                845.666667,
                436.086957], 
                columns=['CO2'],
                index=list(range(1,11)))

def estimate_gaussian(dataset):

    mu = np.mean(dataset) # moyenne cf mu
    sigma = np.std(dataset) # écart_type/standard deviation
    limit = sigma * 1.5

    min_threshold = mu - limit
    max_threshold = mu + limit

    return mu, sigma, min_threshold, max_threshold

mu, sigma, min_threshold, max_threshold = estimate_gaussian(df.values)

condition1 = (df < min_threshold)
condition2 = (df > max_threshold)

outliers1 = np.extract(condition1, df)
outliers2 = np.extract(condition2, df)

outliers = np.concatenate((outliers1, outliers2), axis=0)

Then we'll plot:

df_red = df[df.values==outliers]

plt.scatter(df.index,df.values)
plt.scatter(df_red.index,df_red.values,c='red')
plt.show()

enter image description here

Let me know if you need something more nuanced!

Upvotes: 2

wwii
wwii

Reputation: 23743

There are several ways, one would be to create a sequence of colors based on your condition and pass it to the c parameter.

df = pd.DataFrame({'CO2': {0: 549.5,
  1: 663.54166699999996,
  2: 830.41666699999996,
  3: 799.695652,
  4: 813.85000000000002,
  5: 769.58333300000004,
  6: 681.94117599999993,
  7: 653.33333300000004,
  8: 845.66666699999996,
  9: 436.08695700000004},
 'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}})

In [11]: colors = ['r' if n<750 else 'b' for n in df['CO2']]

In [12]: colors
Out[12]: ['r', 'r', 'b', 'b', 'b', 'b', 'r', 'r', 'b', 'r']

In [13]: plt.scatter(df['day'],df['CO2'],c=colors)

Or use np.where to create the sequence

In [14]: colors = np.where(df['CO2'] < 750, 'r', 'b')

Upvotes: 0

Tarje Bargheer
Tarje Bargheer

Reputation: 175

I am not sure what the idea behind your col list is, but you can replace col with

col = ['red' if yy in list(outliers) else 'blue' for yy in y] 

Upvotes: 0

jfaccioni
jfaccioni

Reputation: 7509

Possibly not the most efficient solution, but I feel like it's easier to call plt.scatter multiple times, passing a single xy pair each time. Since we never call a new figure (e.g. using plt.figure()), each xy pair is plotted on the same figure.

Then, in each iteration we just need to check if the y value is an outlier. If it is, we change the color keyword argument on the plt.scatter call.

Try this:

mu, sigma, min_threshold, max_threshold = estimate_gaussian(df['CO2'].values)

xs = df['day']
ys = df['CO2']

for x, y in zip(xs, ys):
    color = 'blue'  # non-outlier color
    if not min_threshold <= y <= max_threshold:  # condition for being an outlier
        color = 'red'  # outlier color
    plt.scatter(x, y, color=color)
plt.show()

Upvotes: 1

Sosel
Sosel

Reputation: 1718

You could create an additional column (boolean) in which you define if the point is an outlier (True) or not (False), and then work with two scatter plots:

df["outlier"] = # your boolean np array goes in here
plt.scatter[df.loc[df["outlier"], "day"], df.loc[df["outlier"], "CO2"], color="k"]
plt.scatter[df.loc[~df["outlier"], "day"], df.loc[~df["outlier"], "CO2"], color="r"]

Upvotes: 0

Related Questions