Quantifying Text Keywords for Neural Network Analysis

Question

I am working on a small research project. I am looking to write a program that a) Takes a large number of short texts (~100 words / several thousand texts) b) Identify keywords in the texts c) Presents all of them to a group of users who indicate if they found them interesting or not d) Have the software learn what keywords or combinations are likely to be preferable. Let's assume that the target group is uniform for this example.

Now, there are two main challenges. The first one I have an answer to, the second one I am looking for help with.

1) Keyword identification. Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.

2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords.

How would this problem commonly be addressed?

greeness · Accepted Answer

This is a way to start with:

clean your input text (remove special tokens etc)
use n-grams as features (can just start with 1-gram).
treat user's feedback "preferrable or not" as a binary label.
learn a binary classifier (whatever model is fine, naive bayesian, logistic regression).

1) Keyword identification. Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.

You can skip this part in the first model you built. Treat the sentence as bag of words(n-grams) to simplify the first working model. If you want, you can add this as feature weight later.

2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords

You can just use a dictionary mapping n-grams to integer ids. For each training example, the feature would be sparse hence you have training examples like below:

34, 68, 79293, 23232   -> 0   (negative label)
340, 608, 3, 232       -> 1   (positive label)

Imagine you have a dictionary (or vocabulary) mapping:
3: foo
34: movie
68: in-stock
232: bar
340: barz
...

TO use neural networks, you will need to have an embedding layer to turn sparse features into dense features by aggregating (for instance, averaging) the embedding vectors of all features. Use the same example as above, suppose we just use 4-dimensional embedding:

34 ->    [0.1, 0.2, -0.3, 0]
68 ->    [0,   0.1, -0.1, 0.2]
79293 -> [0.3, 0.0, 0.12, 0]
23232 -> [0.4, 0.0, 0.0,  0]
------------------------------- sum
sum   -> [0.8, 0.3, -0.28, 0.2]
------------------------------- L1-normalize
l1    -> [0.8, 0.3, -0.28, 0.2] ./ (0.8 + 0.3 + 0.28 + 0.2)
      -> [0.51,0.19,-0.18,0.13]

At prediction time, you will need to use the dictionary and the same way of feature extraction (cleanup/n-gram generation/mapping n-gram to ids) so that your model understands the input.

Quantifying Text Keywords for Neural Network Analysis

Answers (2)

Related Questions