eliza.b
eliza.b

Reputation: 477

Highlight outliers in pandas dataframe for matplotlib graph

I have 2 dataframes that I built using pandas. I am able to have pandas tell me when my data falls out of a certain parameter by using a Boolean index. I want to highlight my outliers on the same graph as the raw data. My attempts have been commented out in the code below, none of them work. My question is this: how can I highlight the outliers in my graph?

This is my code that finds the outliers in my dataframes:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
#plt.style.use("dark_background")
plt.style.use("seaborn-bright")

x4 = (e[0].time[:47172])
y4 = (e[0].data.f[:47172])

x6 = (t[0].time[:47211])
y6 = (t[0].data.f[:47211])

df4 = pd.DataFrame({'Time': x4, 'Data': y4})
df4['Outlier'] = (df4['Data'] < 2) | (df4['Data'] > 4)
#----This prints out only outliers
df4[df4.Outlier] 

df6 = pd.DataFrame({'Time': x4, 'Data': y4})
df6['Outlier'] = (df6['Data'] < 2) | (df6['Data'] > 4)
#----This prints out only outliers
df6[df6.Outlier]

plt.xlabel('Relative Time in Seconds', fontsize=12)
plt.ylabel('Data', fontsize=12)
plt.grid(linestyle = 'dashed')

This just plots the raw data:

plt.plot(x4, y4)
plt.plot(x6, y6)
plt.show()

This is an example of what my dataframe looks like:

        Data          Time  Outlier
0      0.000      7.343689     True
1      0.000      7.391689     True
2      0.000      7.439689     True
...    ...       ...          ...
47169  2.315  15402.062500    False
47170  0.000  15402.110352     True
47171  0.000  18682.187500     True
[47172 rows x 3 columns]

These are my attempts that do not work:

#fig = plt.figure()
#ax=fig.add_subplot(111)
#ax.plot((df4 < 2), (df4 > 4), color="r")

^this one just plots a straight line, which is incorrect.

#df4.plot((df4['Data'] < 2), (df4['Data'] > 4), color = "r")

^This one prints out a graph that has 'True' and 'False on the x axis instead of time.

I'm thinking something like this for loop might work but I'm not sure how to implement it. Any help/feedback would be appreciated.

for True in 'Outlier':
    plt.plot(x4, y4, color='r')

Upvotes: 0

Views: 3268

Answers (1)

WhoIsJack
WhoIsJack

Reputation: 1498

You already managed to print only the outliers, so now you can simply plot them on top of your normal data, for example like this:

plt.plot(x4, y4)  # Data
plt.plot(x4[df4.Outlier], y4[df4.Outlier], 'r.')  # Outlier highlights
plt.plot(x6, y6)
plt.plot(x6[df6.Outlier], y6[df6.Outlier], 'r.')
plt.show()

The important thing is to use the Boolean series (e.g. df4.Outlier) as a mask to retrieve the actual outlier values by indexing. In your non-functional examples, you are instead plotting the mask itself.


Side note 1: You can skip the entire pandas part in your code (unless you need it somewhere else) and just do:

mask4 = np.logical_or(y4 < 2, y4 > 4)
mask6 = np.logical_or(y6 < 2, y6 > 4)

plt.plot(x4, y4)
plt.plot(x4[mask4], y4[mask4], 'r.')
plt.plot(x6, y6)
plt.plot(x6[mask6], y6[mask6], 'r.')

plt.show()

Side note 2: there's a mistake in the line where you create df6: you're using x4 and y4 instead of x6 and y6 as input.


Side note 3: the loop approach is much less effective/elegant compared to Boolean masking, but here's how it would work (for the sake of learning):

for index,truth_value in enumerate(df4.Outlier):
    if truth_value:
        plt.plot(x4[index], y4[index], 'r.')

Or as a list comprehension:

[plt.plot(x4[i], y4[i], 'r.') for i,t in enumerate(df4.Outlier) if t]

Upvotes: 2

Related Questions