Reputation: 6259
I am working on a small research project. I am looking to write a program that a) Takes a large number of short texts (~100 words / several thousand texts) b) Identify keywords in the texts c) Presents all of them to a group of users who indicate if they found them interesting or not d) Have the software learn what keywords or combinations are likely to be preferable. Let's assume that the target group is uniform for this example.
Now, there are two main challenges. The first one I have an answer to, the second one I am looking for help with.
1) Keyword identification. Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.
2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords.
How would this problem commonly be addressed?
Upvotes: 2
Views: 256
Reputation: 16114
This is a way to start with:
1) Keyword identification. Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.
You can skip this part in the first model you built. Treat the sentence as bag of words(n-grams) to simplify the first working model. If you want, you can add this as feature weight later.
2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords
You can just use a dictionary mapping n-grams to integer ids. For each training example, the feature would be sparse hence you have training examples like below:
34, 68, 79293, 23232 -> 0 (negative label)
340, 608, 3, 232 -> 1 (positive label)
Imagine you have a dictionary (or vocabulary) mapping:
3: foo
34: movie
68: in-stock
232: bar
340: barz
...
TO use neural networks, you will need to have an embedding layer to turn sparse features into dense features by aggregating (for instance, averaging) the embedding vectors of all features. Use the same example as above, suppose we just use 4-dimensional embedding:
34 -> [0.1, 0.2, -0.3, 0]
68 -> [0, 0.1, -0.1, 0.2]
79293 -> [0.3, 0.0, 0.12, 0]
23232 -> [0.4, 0.0, 0.0, 0]
------------------------------- sum
sum -> [0.8, 0.3, -0.28, 0.2]
------------------------------- L1-normalize
l1 -> [0.8, 0.3, -0.28, 0.2] ./ (0.8 + 0.3 + 0.28 + 0.2)
-> [0.51,0.19,-0.18,0.13]
At prediction time, you will need to use the dictionary and the same way of feature extraction (cleanup/n-gram generation/mapping n-gram to ids) so that your model understands the input.
Upvotes: 1
Reputation: 3550
You can simply use sklearn to learn a TFIDF bag of words model of your texts which returns a sparse matrix n_samplesxn_features like this:
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = TfidfTransformer(smooth_idf=False)
X_train = vectorizer.fit_transform(list_of_texts)
print(X_train.shape)
X_train is a scipy csr sparse matrix. If your NN implementation doesn't support sparse matrices you can convert it to a numpy dense matrix but it might fill your RAM; better to use an implementation that supports sparseinput (e.g. I know Lasagne/Theano does that).
After training, you can use the parameters of the NN to find out which features have a high/low weight and so are more/less important for the particular label.
Upvotes: 1