Steven
Steven

Reputation: 1

Method/Tool for Extracting Keywords from List of Sentences

I have a large list of sentences and would like to tag each of them with their own unique keywords, to help me identify which sentences are similar for grouping purposes.

As an example:

The dog ran fast. - tagged as: dog
The cat is sleeping - tagged as: cat
The German Sheppard is awake. - tagged as dog

I've been looking into tools like alchemy api and openCalais for the keyword extraction, however, it seems you moreso use these to extract meaning out of a block of data, like an entire document or paragraph rather than tagging 1000s of unique but similar individual sentences.

In short, ideally I'd like to:

  1. Take a sentence from a document or webpage (perhaps from a large spreadsheet or a list of tweets)
  2. Place a unique identifier on it (some type of keyword)
  3. Group the sentences together by keywrd

Upvotes: 0

Views: 1448

Answers (1)

Aamir Mushtaq
Aamir Mushtaq

Reputation: 306

I think what you mean by attaching a identifier is similar to nltk's POS-Tagging (Parts of Speech ) in conjunction with stemming. This is a link to the nltkbook that might help you out. The download instructions are here
The language of choice IMO should be Python I have a few examples that you might want to look into :

Stemming Words

>>>import nltk
>>>from nltk.stem import PorterStemmer
>>>stemmer = PorterStemmer()
>>>stemmer.stem('cooking')
#'cook' 

Creating a Part-of-Speech Tagged Word Corpus

>>> from nltk.corpus.reader import TaggedCorpusReader
>>> reader = TaggedCorpusReader('.', r'.*\.pos')
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> reader.tagged_words()
[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader.tagged_sents()
[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
>>> reader.paras()
[[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
>>> reader.tagged_paras()
[[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]

>>> from nltk.tokenize import SpaceTokenizer
>>> reader = TaggedCorpusReader('.', r'.*\.pos', word_tokenizer=SpaceTokenizer())
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]

>>> from nltk.tokenize import LineTokenizer
>>> reader = TaggedCorpusReader('.', r'.*\.pos', sent_tokenizer=LineTokenizer())
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]

>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=lambda t: t.lower())
>>> reader.tagged_words(simplify_tags=True)
[('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), ...]

>>> from nltk.tag import simplify
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_brown_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...]
>>> reader = TaggedCorpusReader('.', r'.*\.pos', tag_mapping_function=simplify.simplify_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]

The above two code samples are taken from the nltk's book examples. I have posted so that you may take it at face value whether it is of Use or not.
Think along the lines of both the features combined. Do they serve your purpose?
Also You may want to look into STOPWORDS for getting just the Dog out of the first sentence you gave.

Upvotes: 4

Related Questions