Reputation: 477
I have 2 dataframes that I built using pandas. I am able to have pandas tell me when my data falls out of a certain parameter by using a Boolean index. I want to highlight my outliers on the same graph as the raw data. My attempts have been commented out in the code below, none of them work. My question is this: how can I highlight the outliers in my graph?
This is my code that finds the outliers in my dataframes:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
#plt.style.use("dark_background")
plt.style.use("seaborn-bright")
x4 = (e[0].time[:47172])
y4 = (e[0].data.f[:47172])
x6 = (t[0].time[:47211])
y6 = (t[0].data.f[:47211])
df4 = pd.DataFrame({'Time': x4, 'Data': y4})
df4['Outlier'] = (df4['Data'] < 2) | (df4['Data'] > 4)
#----This prints out only outliers
df4[df4.Outlier]
df6 = pd.DataFrame({'Time': x4, 'Data': y4})
df6['Outlier'] = (df6['Data'] < 2) | (df6['Data'] > 4)
#----This prints out only outliers
df6[df6.Outlier]
plt.xlabel('Relative Time in Seconds', fontsize=12)
plt.ylabel('Data', fontsize=12)
plt.grid(linestyle = 'dashed')
This just plots the raw data:
plt.plot(x4, y4)
plt.plot(x6, y6)
plt.show()
This is an example of what my dataframe looks like:
Data Time Outlier
0 0.000 7.343689 True
1 0.000 7.391689 True
2 0.000 7.439689 True
... ... ... ...
47169 2.315 15402.062500 False
47170 0.000 15402.110352 True
47171 0.000 18682.187500 True
[47172 rows x 3 columns]
These are my attempts that do not work:
#fig = plt.figure()
#ax=fig.add_subplot(111)
#ax.plot((df4 < 2), (df4 > 4), color="r")
^this one just plots a straight line, which is incorrect.
#df4.plot((df4['Data'] < 2), (df4['Data'] > 4), color = "r")
^This one prints out a graph that has 'True' and 'False on the x axis instead of time.
I'm thinking something like this for loop might work but I'm not sure how to implement it. Any help/feedback would be appreciated.
for True in 'Outlier':
plt.plot(x4, y4, color='r')
Upvotes: 0
Views: 3268
Reputation: 1498
You already managed to print only the outliers, so now you can simply plot them on top of your normal data, for example like this:
plt.plot(x4, y4) # Data
plt.plot(x4[df4.Outlier], y4[df4.Outlier], 'r.') # Outlier highlights
plt.plot(x6, y6)
plt.plot(x6[df6.Outlier], y6[df6.Outlier], 'r.')
plt.show()
The important thing is to use the Boolean series
(e.g. df4.Outlier
) as a mask
to retrieve the actual outlier values by indexing. In your non-functional examples, you are instead plotting the mask
itself.
Side note 1: You can skip the entire pandas part in your code (unless you need it somewhere else) and just do:
mask4 = np.logical_or(y4 < 2, y4 > 4)
mask6 = np.logical_or(y6 < 2, y6 > 4)
plt.plot(x4, y4)
plt.plot(x4[mask4], y4[mask4], 'r.')
plt.plot(x6, y6)
plt.plot(x6[mask6], y6[mask6], 'r.')
plt.show()
Side note 2: there's a mistake in the line where you create df6
: you're using x4
and y4
instead of x6
and y6
as input.
Side note 3: the loop approach is much less effective/elegant compared to Boolean masking
, but here's how it would work (for the sake of learning):
for index,truth_value in enumerate(df4.Outlier):
if truth_value:
plt.plot(x4[index], y4[index], 'r.')
Or as a list comprehension:
[plt.plot(x4[i], y4[i], 'r.') for i,t in enumerate(df4.Outlier) if t]
Upvotes: 2