Reputation: 7906
I have a set (~50k elements) of small text fragments (usually one or two sentences) each one tagged with a set of keywords chosen from a list of ~5k words.
How would I go to implement a system that, learning from this examples can then tag new sentences with the same set of keywords? I don't need code, I'm just looking for some pointers and methods/papers/possible ideas on how to implement this.
Upvotes: 2
Views: 1442
Reputation: 3058
If I understood you well, what you need is a measure of similarity for a pair of documents. I have been recently using TF-IDF for clustering of documents and it worked quiet well. I think here you can use TF-IDF values and calculate a cosine similarity for the corresponding TF-IDF values for each of the documents.
TF-IDF
stands for Term Frequency - Inverse Document Frequency
. Here is a definition how it can be calculated:
Compute TF-IDF values for all words in all documents
- TF-IDF score of a word W in document D is
TF-IDF(W, D) = TF(W, D) * IDF(W)
where TF(W, D) is frequency of word W in document D
IDF(W) = log(N/(2 + #W))
N - number of documents
#W - number of documents that contain word W
- words contained in the title will count twice (means more important)
- normalize TF-IDF values: sum of all TF-IDF(W, D)^2 in a document should be 1.
Depending of the technology You use, this may be achieved in different ways. I have had implemented it in Python using a nested dictionary. Firstly I use document name D
as key and then for each document D
I have a nested dictionary with word W
as key and each word W has a corresponding numeric value which is the calculated TF-IDF
.
Let say You have calculated the TF-IDF
values already and You want to compare 2 documents W1
and W2
how similar they are. For that we need to use some similarity metric. There are many choices, each one having pros and cons. In this case, IMO, Jaccard similarity and cosine similarity would work good. Both functions would have TF-IDF
and names of the 2 documentsW1
and W2
as its arguments and it would return a numeric value which indicates how similar the 2 documents are.
After computing the similarity between 2 documents you will obtain a numeric value. The greater the value, the more similar 2 documents W1
and W2
are. Now, depending on what You want to achieve, we have 2 scenarios.
threshold = 0.7
, than all document W will have the tags of all already tagged documents V
for which similarity(W, V) > 0.7
.I hope it helps.
Best of luck :)
Upvotes: 3
Reputation: 665
Given your description, you are looking for some form of supervised learning. There are many methods in that class, e.g. Naive Bayes classifiers, Support Vector Machines (SVM), k Nearest Neighbours (kNN) and many others.
For a numerical representation of your text, you can chose a bag-of-words or a frequency list (essentially, each text is represented by a vector on a high dimensional vectorspace spanned by all the words).
BTW, it is far easier to tag the texts with one keyword (a classification task) than to assign them up to five ones (the number of possible classes explodes combinatorically)
Upvotes: 1