Ailef
Ailef

Reputation: 7906

Learning to tag sentences with keywords based on examples

I have a set (~50k elements) of small text fragments (usually one or two sentences) each one tagged with a set of keywords chosen from a list of ~5k words.

How would I go to implement a system that, learning from this examples can then tag new sentences with the same set of keywords? I don't need code, I'm just looking for some pointers and methods/papers/possible ideas on how to implement this.

Upvotes: 2

Views: 1442

Answers (2)

giliev
giliev

Reputation: 3058

If I understood you well, what you need is a measure of similarity for a pair of documents. I have been recently using TF-IDF for clustering of documents and it worked quiet well. I think here you can use TF-IDF values and calculate a cosine similarity for the corresponding TF-IDF values for each of the documents.

  1. TF-IDF computation

TF-IDF stands for Term Frequency - Inverse Document Frequency. Here is a definition how it can be calculated:

Compute TF-IDF values for all words in all documents                                    
    - TF-IDF score of a word W in document D is

             TF-IDF(W, D) = TF(W, D) * IDF(W) 

      where TF(W, D) is frequency of word W in document D
            IDF(W) = log(N/(2 + #W))
            N  - number of documents
            #W - number of documents that contain word W
   
    - words contained in the title will count twice (means more important)
    - normalize TF-IDF values: sum of all TF-IDF(W, D)^2 in a document should be 1.

Depending of the technology You use, this may be achieved in different ways. I have had implemented it in Python using a nested dictionary. Firstly I use document name D as key and then for each document D I have a nested dictionary with word W as key and each word W has a corresponding numeric value which is the calculated TF-IDF.

  1. Similarity computation

Let say You have calculated the TF-IDF values already and You want to compare 2 documents W1 and W2 how similar they are. For that we need to use some similarity metric. There are many choices, each one having pros and cons. In this case, IMO, Jaccard similarity and cosine similarity would work good. Both functions would have TF-IDF and names of the 2 documentsW1 and W2 as its arguments and it would return a numeric value which indicates how similar the 2 documents are.

After computing the similarity between 2 documents you will obtain a numeric value. The greater the value, the more similar 2 documents W1 and W2 are. Now, depending on what You want to achieve, we have 2 scenarios.

  • If You want for 1 document to assign only the tags of the most similar document, then You compare it with all other documents and assign to the new document the tags of the most similar one.
  • You can set some threshold and You can assign all tags of documents which have similarity with the document in question greater than the threshold value. If You set threshold = 0.7, than all document W will have the tags of all already tagged documents V for which similarity(W, V) > 0.7.

I hope it helps.

Best of luck :)

Upvotes: 3

Sir Cornflakes
Sir Cornflakes

Reputation: 665

Given your description, you are looking for some form of supervised learning. There are many methods in that class, e.g. Naive Bayes classifiers, Support Vector Machines (SVM), k Nearest Neighbours (kNN) and many others.

For a numerical representation of your text, you can chose a bag-of-words or a frequency list (essentially, each text is represented by a vector on a high dimensional vectorspace spanned by all the words).

BTW, it is far easier to tag the texts with one keyword (a classification task) than to assign them up to five ones (the number of possible classes explodes combinatorically)

Upvotes: 1

Related Questions