Michael
Michael

Reputation: 1839

Measuring wealth of information on text using NLP

Is there any metric that measures wealth of information on a text?

I am thinking in terms of anything that can reliably show unique information segments within a text. Simple metrics using frequency distributions or unique words are okay but they don't quite show unique information in sentences.

Using coding methods I would have to manually code each sentence/word or anything that would count as unique piece of information in a text but that could take a while. So, I wonder if I could use NLP as an alternative.

UPDATE

As an example:

Navtilos, a small volcanic islet of the Santorini volcano which was created in the eruption of 1928.

If I were to use coding analysis, I can count 4 unique information points: What is Navtilos, where is it, how it was created and when.

Obviously a human interprets text different than a computer. I just wonder if there is a measure that can identify unique information within sentences/texts. It does not have to produce the same result as mine but be reliable across different sentences.

A frequency distribution may work effectively but I wonder if there are other metrics for this.

Upvotes: 0

Views: 638

Answers (1)

vpekar
vpekar

Reputation: 3355

What you seem to be looking for is a keyword/term extractor (for a list of keyword extractors see, for example, this, "External Links"). An extractor will extract phrases consisting of one or more words that capture some notions mentioned in the text, but without classifying them into classes (as named entity recognisers would do).

See, for example, this demo. From the sentence in your example, it extracts:

small volcanic islet
Navtilos
Santorini

If you have lots of documents, you can then use the frequency distribution of each keyword across documents to measure how specific it is to each document (assuming that uniqueness of a keyword to a document reflects how well it describes the contents of the document). For this, you can use a measure like tf-idf.

Upvotes: 4

Related Questions