ecole96
ecole96

Reputation: 33

Concepts to measure text "relevancy" to a subject?

I do side work writing/improving a research project web application for some political scientists. This application collects articles pertaining to the U.S. Supreme Court and runs analysis on them, and after nearly a year and half, we have a database of around 10,000 articles (and growing) to work with.

One of the primary challenges of the project is being able to determine the "relevancy" of an article - that is, the primary focus is the federal U.S. Supreme Court (and/or its justices), and not a local or foreign supreme court. Since its inception, the way we've addressed it is to primarily parse the title for various explicit references to the federal court, as well as to verify that "supreme" and "court" are keywords collected from the article text. Basic and sloppy, but it actually works fairly well. That being said, irrelevant articles can find their way into the database - usually ones with headlines that don't explicitly mention a state or foreign country (the Indian Supreme Court is the usual offender).

I've reached a point in development where I can focus on this aspect of the project more, but I'm not quite sure where to start. All I know is that I'm looking for a method of analyzing article text to determine its relevance to the federal court, and nothing else. I imagine this will entail some machine learning, but I've basically got no experience in the field. I've done a little reading into things like tf-idf weighting, vector space modeling, and word2vec (+ CBOW and Skip-Gram models), but I'm not quite seeing a "big picture" yet that shows me how just how applicable these concepts can be to my problem. Can anyone point me in the right direction?

Upvotes: 1

Views: 1556

Answers (3)

lunguini
lunguini

Reputation: 991

Framing the Problem

When starting a novel machine learning project like this there are a few fundamental questions to think through that can help you refine the problem and lit review + experiment more effectively.

  1. Do you have the right data to build a model? You have ~10,000 articles that will be your model input, however, to use a supervised learning approach you will need trustworthy labels for all articles that will be used in model training. It sounds like you already have done this.

  2. What metric(s) to use to quantify success. How can you measure if your model is doing what you want? In your specific case this sounds like a binary classification problem - you want to be able to label articles as relevant or not. You could measure your success using a standard binary classification metric like area under the ROC. Or since you have a specific issue with False Positives you could choose a metric like Precision.

  3. How well can you do with a random or naive approach. Once a dataset and metric have been established you can quantify how well you can do at your task with a basic approach. This could be a simple as calculating your metric for a model that chooses at random, but in your case you have your keyword parser model which is a perfect way to set a bench mark. Quantify how well your keyword parsing approach does for your dataset so you can determine when a machine learning model is doing well.

Sorry if this was obvious and basic to you but I wanted to make sure it was in the answer. In an innovative open ended project like this diving straight into machine learning experiments without thinking through these fundamentals can be inefficient.

Machine Learning Approaches

As suggested by Evan Mata and Stefan G, the best approach is to first reduce your articles into features. This could be done without machine learning (eg vector space model) or with machine learning (word2vec and other examples you cited). For your problem I think something like BOW makes sense to try as a starting point.

Once you have a feature representation of your articles you are almost done and there are a number of binary classification models that will do well. Experiment from here to find the best solution.

Wikipedia has a nice example of a simple way to use this two step approach in spam filtering, an analogous problem (See the Example Usage section of the article).

Good luck, sounds like a fun project!

Upvotes: 2

Stefan G.
Stefan G.

Reputation: 167

There are many many ways to do this, and the best method changes depending on the project. Perhaps the easiest way to do this is to keyword search in your articles and then empirically choose a cut off score. Although simple, this actually works pretty well, especially in a topic like this one where you can think of a small list of words that are highly likely to appear somewhere in a relevant article.

When a topic is more broad with something like 'business' or 'sports', keyword search can be prohibitive and lacking. This is when a machine learning approach might start to become the better idea. If machine learning is the way you want to go, then there are two steps:

  1. Embed your articles into feature vectors
  2. Train your model

Step 1 can be something simple like a TFIDF vector. However, embedding your documents can also be deep learning on its own. This is where CBOW and Skip-Gram come into play. A popular way to do this is Doc2Vec (PV-DM). A fine implementation is in the Python Gensim library. Modern and more complicated character, word, and document embeddings are much more of a challenge to start with, but are very rewarding. Examples of these are ELMo embeddings or BERT.

Step 2 can be a typical model, as it is now just binary classification. You can try a multilayer neural network, either fully-connected or convolutional, or you can try simpler things like logistic regression or Naive Bayes.

My personal suggestion would be to stick with TFIDF vectors and Naive Bayes. From experience, I can say that this works very well, is by far the easiest to implement, and can even outperform approaches like CBOW or Doc2Vec depending on your data.

Upvotes: 1

Evan Mata
Evan Mata

Reputation: 612

If you have sufficient labeled data - not only for "yes this article is relevant" but also for "no this article is not relevant" (you're basically making a binary model between y/n relevant - so I would research spam filters) then you can train a fair model. I don't know if you actually have a decent quantity of no-data. If you do, you could train a relatively simple supervised model by doing (pesudocode) the following:

Corpus = preprocess(Corpus) #(remove stop words, etc.)
Vectors = BOW(Corpus) #Or TFIDF or Whatever model you want to use 
SomeModel.train(Vectors[~3/4 of them], Labels[corresponding 3/4]) #Labels = 1 if relevant, 0 if not
SomeModel.evaluate(Vectors[remainder], Labels[remainder]) #Make sure the model doesn't overfit
SomeModel.Predict(new_document) 

The exact model will depend on your data. A simple Naive-Bayes could (probably will) work fine if you can get a decent number of no-documents. One note - you imply that you have two kinds of no-documents - those that are reasonably close (Indian Supreme Court) or those that are completely irrelevant (say Taxes). You should test training with "close" erroneous cases with "far" erroneous cases filtered out as you do now vs both "close" erroneous cases and "far" erroneous cases and see which one comes out better.

Upvotes: 1

Related Questions