Chris Wilson
Chris Wilson

Reputation: 6719

Paring an index down to "interesting" words for future search terms

I have a list of about 18,000 unique words scraped from a database of government transcripts that I would like to make searchable in a web app. The catch: This web app must be client-side. (AJAX is permissible.)

All the original transcripts are in neat text files on my server, so the index file of words will list which files contain each word and how many times, like so:

ADMINSTRATION   {"16": 4, "11": 5, "29": 4, "14": 2}
ADMIRAL {"34": 12, "12": 2, "15": 9, "16": 71, "17": 104, "18": 37, "19": 23}
AMBASSADOR  {"2": 15, "3": 10, "5": 37, "8": 5, "41": 10, "10": 2, "16": 6, "17": 6, "50": 4, "20": 5, "22": 17, "40": 10, "25": 14}

I have this reduced to a trie-structure in its final form to save space and speed up retrieval, but even so, 18K words is about 5MB of data with the locations, even with stop words removed. But no one is reasonably going to search for out-of-context adjectives and subordinating conjunctions.

I realize this is something of a language question as much as a coding question, but I'm wondering if there is a common solution in NLP for reducing a text to words that are meaningful out of context.

I tried running each word through the Python NLTK POS tagger, but there's a high error rate when the words stand by themselves, as one would expect.

Upvotes: 3

Views: 129

Answers (2)

Blacksad
Blacksad

Reputation: 15422

I wouldn't try to reduce the size of the dictionary (your 18K words), because it's very hard to guess which words are "meaningful" for your application/user.

Instead, I would try to reduce the number of words each document puts in the index. For instance, if 50% of the documents have a given word W, it's maybe useless to index it (I can't be sure without seeing your documents and your domain of course!).

If that's the case, you can calculate TF-IDFs in your documents, and choose a threshold below which you don't bother to feed the index. You could even choose the max size of your index (say 1MB) and find a threshold that fits this requirement.

In any way, I would NEVER try to use POS-tagging. To paraphrase a famous quote about Regex:

You have a simple indexing problem. You try to use POS-tagging to solve it. Now you have two problems.

Upvotes: 1

dkar
dkar

Reputation: 2123

NLP is my area, and I'm afraid there is only one way to do that reliably: first POS-tag each sentence in your transcripts, and then extract your statistics for (word,pos-tag) tuples. So you will be able to distinguish instances of e.g. 'returned' as an adjective from the cases where this word is used as a verb. Finally, decide what to keep and what to discard (for example, keep only nouns and verbs and discard everything else).

Upvotes: 0

Related Questions