Reputation: 272
I am working on school's project about Outlier detecttion. I think i will create my own small dataset and use DBSCAN to work with it. I think i will try to create a dataset that about a click on ads on a website is cheat or not. Below is detail information of the dataset that i am gona create.
Dataset Name: Cheat Ads Click detection.
Column:value
source: (categorical) url: 0, redirect: 1, search: 2
visited_before: (categorical) no:1, few_time: 1, fan: 2
time_on_site(seconds): (numerical) time user working on the site before leaving by seconds.
active_type: (categorical) fake_active: 0 (like they just open website but don't do anythings but click ads), normal_active: 1, real_acive: 2 (Maybe i will let it become score of active: float value from 0 to 10.)
Cheat (label): (categorical) no: 0, yes: 1
Maybe i will have some more other columns like number of times user click on ads,...
My question is do you think that DBSCAN can work well on this dataset? If yes, can you please give me some tips to make a great dataset or to create dataset faster? And if no, please suggest me some other datasets that DBSCAN can work well with theme.
Thank you so much.
Upvotes: 0
Views: 1699
Reputation: 755
DBSCAN has the inherent ability to detect outliers. Since points that are outliers will fail to belong to any cluster. Wiki states:
it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away)
This can be easily demonstrated using synthetic datasets from sklearn such as make_moons
and make_blobs
. Sklearn has a pretty decent demo on this.
from sklearn.datasets import make_moons
x, label = make_moons(n_samples=200, noise=0.1, random_state=19)
plt.plot(x[:,0], x[:,1],'ro')
I implemented the dbscan algorithm a while ago to learn. (The repo has since been moved) However, as Anony-Mousse has stated
noise (low density) is not the same as outlier
And the intuition learned from synthetic datasets don't necessarily carry over to actual real-life data. So the above-suggested dataset and implementation are only meant for learning purposes.
Upvotes: 1
Reputation: 77454
Are describing a classification problem, not a clustering problem.
Also that data does not have a bottom of density, does it?
Last but not least, (A) click fraud is heavily clustered, not outliers, (B) noise (low density) is not the same as outlier (rare) and (C) first get the data, then speculate about possible algorithms, because what if you can't get the data?
Upvotes: 0