Reputation: 382
I am trying to feed the output data from an outlier detector into a pandas dataframe. There are multiple columns representing each time series that I want to run the outlier detector on. They are "1", "2"...."n". Here is a small snippet of the data:
df = pd.DataFrame({"Datetime": [2016-6-13,2016-6-14,2016-6-15,2016-6-16],
"CompanyID": [271, 271, 271, 271],
"1": [140, 143, 142, 143],
"2": [42, 43, 49, 230]})
I do not think the full code is of use, but here it is anyways:
#support vector machines outlier detection
from sklearn import preprocessing, svm
def find_outliers(ts, perc=0.01, figsize=(15,5)):
## fit svm
scaler = preprocessing.StandardScaler()
ts_scaled = scaler.fit_transform(ts.values.reshape(-1,1))
model = svm.OneClassSVM(nu=perc, kernel="rbf", gamma=0.01)
model.fit(ts_scaled)
## dtf output
df_outliers = ts.to_frame(name="ts")
df_outliers["index"] = ts.index
df_outliers["outlier"] = model.predict(ts_scaled)
df_outliers["outlier"] = df_outliers["outlier"].apply(lambda
x: 1 if x==-1 else 0)
## plot
fig, ax = plt.subplots(figsize=figsize)
ax.set(title="Outliers detection: found"
+str(sum(df_outliers["outlier"]==1)))
ax.plot(df_outliers["index"], df_outliers["ts"],
color="black")
ax.scatter(x=df_outliers[df_outliers["outlier"]==1]["index"],
y=df_outliers[df_outliers["outlier"]==1]['ts'],
color='red')
ax.grid(True)
plt.show()
for column in df.columns[2:]:
find_outliers(df[column])
The output from the anomaly detector from running
print(df_outliers["outlier"] == 1)
print(type(df_outliers))
inside the function is:
Datetime
2016-06-13 True
2016-06-14 True
2016-06-15 True
2016-06-16 True
2016-06-17 True
2021-02-03 False
2021-02-04 False
2021-02-05 False
2021-02-06 False
2021-02-07 True
Name: outlier, Length: 1425, dtype: bool
<class 'pandas.core.frame.DataFrame'>
I want this transformed so that I get a dataframe that looks like the input data, only that it contains True/False for each column, so "1", "2", ... "n".
Upvotes: 0
Views: 358
Reputation: 16172
You could return the outliers column from your function and overwrite the column with the bool values of the return.
import pandas as pd
df = pd.DataFrame({"Datetime": [2016-6-13,2016-6-14,2016-6-15,2016-6-16],
"CompanyID": [271, 271, 271, 271],
"1": [140, 143, 142, 143],
"2": [42, 43, 49, 230]})
#support vector machines outlier detection
from sklearn import preprocessing, svm
import matplotlib.pyplot as plt
def find_outliers(ts, perc=0.01, figsize=(15,5)):
## fit svm
scaler = preprocessing.StandardScaler()
ts_scaled = scaler.fit_transform(ts.values.reshape(-1,1))
model = svm.OneClassSVM(nu=perc, kernel="rbf", gamma=0.01)
model.fit(ts_scaled)
## dtf output
df_outliers = ts.to_frame(name="ts")
df_outliers["index"] = ts.index
df_outliers["outlier"] = model.predict(ts_scaled)
df_outliers["outlier"] = df_outliers["outlier"].apply(lambda
x: 1 if x==-1 else 0)
## plot
fig, ax = plt.subplots(figsize=figsize)
ax.set(title="Outliers detection: found"
+str(sum(df_outliers["outlier"]==1)))
ax.plot(df_outliers["index"], df_outliers["ts"],
color="black")
ax.scatter(x=df_outliers[df_outliers["outlier"]==1]["index"],
y=df_outliers[df_outliers["outlier"]==1]['ts'],
color='red')
ax.grid(True)
plt.show()
# Return outlier column here
return(df_outliers['outlier'])
for column in df.columns[2:]:
# Capture outlier column
outliers = find_outliers(df[column])
# Overwrite values with bool outlier values
df[column] = outliers.astype(bool)
Output
Datetime CompanyID 1 2
0 1997 271 False True
1 1996 271 False False
2 1995 271 False False
3 1994 271 False True
Upvotes: 1