Reputation: 19
I need to classify given urls as porn or non porn via python script (not by visiting them in person and watching videos) and I thought about calculating porn probability for each url by classifying words it contains, e.g. if url contains words 'bang' and '18' there is high probability its porn site, I tried implementing it, but it isnt very accurate, are there any python libraries than can help me classify those urls? I'm looking for libraries which can learn from test data, like smart anti-spam filters, like:
data = {
'google.com':0,
'superxxx.com':1,
'bigbangtheory.com':0,
'hot18bangbang.com':1,
...
...
}
and so on, I've pretty big collection of 'bad' urls, so I think I could train some AI classifier. If this is bad idea, could you recommend me any way of filtering out 'bad' urls from 'good' urls?
Upvotes: 0
Views: 431
Reputation: 3791
The modern approach to do this is to use a character level LSTM sequence classifier. It requires a fairly large amount of data though, but it shouldn't be too hard to find, by getting examples of family filter black lists for example.
Here are some examples of the concept:
Recurrent neural networks are neural networks that take their own output as input for the next step, or that learn to output state vectors that are passed to their own cell at the next step to represent short term memory.
Basically, your features are sequences of sub sequences of letters (aka, friendship becomes [frie, frien, riend, iends, endsh, ...]
in one hot representation), and you have a neural net that has a state that evolves with subsequence it sees, and gives you a judgement at the end.
Upvotes: 1
Reputation: 18877
This is a good use case for logistic regression, but it's not a very good question for Stack Overflow. If you already have the training data, go find a tool (or implement this yourself because it wouldn't be that difficult) and then ask a question about the troubles you're having getting it to work. Stack Overflow is not the place to as for recommendations on tools to use.
Upvotes: 1