Learning to tag sentences with keywords based on examples

Question

I have a set (~50k elements) of small text fragments (usually one or two sentences) each one tagged with a set of keywords chosen from a list of ~5k words.

How would I go to implement a system that, learning from this examples can then tag new sentences with the same set of keywords? I don't need code, I'm just looking for some pointers and methods/papers/possible ideas on how to implement this.

giliev · Accepted Answer

If I understood you well, what you need is a measure of similarity for a pair of documents. I have been recently using TF-IDF for clustering of documents and it worked quiet well. I think here you can use TF-IDF values and calculate a cosine similarity for the corresponding TF-IDF values for each of the documents.

TF-IDF computation

TF-IDF stands for Term Frequency - Inverse Document Frequency. Here is a definition how it can be calculated:

Compute TF-IDF values for all words in all documents                                    
    - TF-IDF score of a word W in document D is

             TF-IDF(W, D) = TF(W, D) * IDF(W) 

      where TF(W, D) is frequency of word W in document D
            IDF(W) = log(N/(2 + #W))
            N  - number of documents
            #W - number of documents that contain word W
   
    - words contained in the title will count twice (means more important)
    - normalize TF-IDF values: sum of all TF-IDF(W, D)^2 in a document should be 1.

Depending of the technology You use, this may be achieved in different ways. I have had implemented it in Python using a nested dictionary. Firstly I use document name D as key and then for each document D I have a nested dictionary with word W as key and each word W has a corresponding numeric value which is the calculated TF-IDF.

Similarity computation

Let say You have calculated the TF-IDF values already and You want to compare 2 documents W1 and W2 how similar they are. For that we need to use some similarity metric. There are many choices, each one having pros and cons. In this case, IMO, Jaccard similarity and cosine similarity would work good. Both functions would have TF-IDF and names of the 2 documentsW1 and W2 as its arguments and it would return a numeric value which indicates how similar the 2 documents are.

After computing the similarity between 2 documents you will obtain a numeric value. The greater the value, the more similar 2 documents W1 and W2 are. Now, depending on what You want to achieve, we have 2 scenarios.

If You want for 1 document to assign only the tags of the most similar document, then You compare it with all other documents and assign to the new document the tags of the most similar one.
You can set some threshold and You can assign all tags of documents which have similarity with the document in question greater than the threshold value. If You set threshold = 0.7, than all document W will have the tags of all already tagged documents V for which similarity(W, V) > 0.7.

I hope it helps.

Best of luck :)

Learning to tag sentences with keywords based on examples

Answers (2)

Related Questions