Sunil
Sunil

Reputation: 307

Algorithm to analyze a list of URLs and black list URLs with blacklisted words

Let's say there is a list of URLs in text file (in Millions) and there is another list in a text file containing blacklisted words.

I am willing to do the following processing on the list of URLs.

- Parse the URLs and store them in some DS
- Process the URLs and blacklist those URLs which contain atleast one of the 
  blacklisted words.
- If there exists a URL containing 50% or more blacklisted words, add the other 
  words of that URL in the list of blacklisted words.
- Since now the blacklisted words list has been modified then it's probable 
  that the URLs which were not blacklisted earlier can get blacklisted now. So, 
  the algorithm should handle this case as well and mark the earlier whitelisted 
  URLs as blacklisted if they contain these newly added blacklisted words.

In the end I should have a list of whitelisted URLs

Any suggestions what could be a best algorithm and DS which can be used to achieve the most effective time and space complexity solution ?

Upvotes: 2

Views: 923

Answers (2)

Rob Neuhaus
Rob Neuhaus

Reputation: 9290

I'll just answer about the machine learning part of this problem, and leave the data structure/efficient text matching stuff open.

First, if non-determinism is a problem, you can do an entire pass over the data with a given version of the blacklist/classifier before updating it. This way, you will get the same answer regardless of the order of the input. On the other hand, I don't think that determinism should be prized above all else. In machine learning, stochastic gradient descent is not order invariant, but it does well in practice, especially on large data sets.

Second, you probably want to be careful with a straight up blacklist, unless you are willing to manually filter proposed candidate terms (and still maybe even then), and might want to use a softer probabilistic approach. Naive bayes is the simplest approach that could work reasonably. This might let you label a medical article about viagra as non-spam, for example, even though it contains a spammy word, because it comes from a very trusted host (like webmd).

You could use your hand chosen blacklist terms to generate an initial set of negatively labelled URLs. Then you can come up with another set of positive labels (perhaps even picking randomly from your corpus, assuming it's mostly good), and use that to train an initial naive bayes classifier. Then you can leverage your idea of changing the classifier based on its own output on your specific corpus by doing semi supervised learning. This will give you the nice learning/avalanche effect from a small initial set of labelled documents.

Upvotes: 1

Timothy
Timothy

Reputation: 4487

Use a matrix to store the URLs.

  1. Firstly, break each URL into words by a Porter Stemmer, and put them in the matrix (one row for one URL, one item for one word).

  2. Then use TFIDF to score each word in the matrix, and remove the low-score words (they would be popular words like 'a','the' etc. which are not informative for judging spam).

  3. Initialize the blacklist manually (put some common black words into it).

  4. Just run the process as you given.

Upvotes: 1

Related Questions