KJW
KJW

Reputation: 15251

Natural Language Process: Discover category of text?

What library exists that let's you determine whether a column full of text is a certain entity based on a list?

For example, given many lists consisting of text strings for training (each list may have seldom outlier strings that is noise), I want to establish some category for that list.

Now when there's a new text string given, I want to know which category or entity it belongs to.

What do you call this in natural language processing?

Upvotes: 0

Views: 99

Answers (4)

Daneel R.
Daneel R.

Reputation: 547

The question is not stated clearly, but I think what you are trying to do is the so-called classification of texts, on the basis of their features. While the word 'entity' might suggest part-of-speech tagging, the word 'category' suggests multiclass classification. I will go for the latter.

If I thus understand correctly, you have a training set that looks like this:

    label          text
0 'category_a'   'foo foo foo foo'
1 'category_a'   'foo foo bar'
2 'category_b'   'bar bar bar'

And you want to predict the label of each text on the basis of its underlying components. This is, in Machine Learning, a typical problem of supervised machine learning. I suggest you have a look at the CountVectorizer and TfidfVectorizer constructors in sklearn, that you can find here:

CountVectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

TfidfVectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

The classification task is usually performed by a pipeline consisting of one of those two for vectorization, plus a proper classifier such as Multinomial Naive Bayes for classification. If you add more details on your task, more accurate help can be given.

Upvotes: 1

alvas
alvas

Reputation: 122330

You can use the nltk.chunk.ne_chunk()

>>> from nltk.tokenize import word_tokenize
>>> from nltk.chunk import ne_chunk
>>> from nltk.tag import pos_tag
>>> from nltk.tree import Tree

>>> txt = 'Michael Jackson is eating at McDonalds, call him at +99-20392842'
# Get full tree of with Name Entities (NEs) chunks.

>>> ne_chunk(pos_tag(word_tokenize(txt)))
Tree('S', [Tree('PERSON', [('Michael', 'NNP')]), Tree('PERSON', [('Jackson', 'NNP')]), ('is', 'VBZ'), ('eating', 'VBG'), ('at', 'IN'), Tree('ORGANIZATION', [('McDonalds', 'NNP')]), (',', ','), ('call', 'NN'), ('him', 'PRP'), ('at', 'IN'), ('+99-20392842', '-NONE-')])

# Get only the NEs.
>>> [i for i in ne_chunk(pos_tag(word_tokenize(txt))) if isinstance(i, Tree)]
[Tree('PERSON', [('Michael', 'NNP')]), Tree('PERSON', [('Jackson', 'NNP')]), Tree('ORGANIZATION', [('McDonalds', 'NNP')])]

# Get only PERSON NEs
>>> [i for i in ne_chunk(pos_tag(word_tokenize(txt))) if isinstance(i, Tree) and i.node == 'PERSON']
[Tree('PERSON', [('Michael', 'NNP')]), Tree('PERSON', [('Jackson', 'NNP')])]

Upvotes: 1

norlesh
norlesh

Reputation: 1861

What you are want to do is a combination of text tagging for the things like phone number, email, address and others where the the type is identified by it's format. and named entity recognition for those things like person and business names which can only be determined by some kind of background knowldege.

Depending on what computer language you want to use I would recommend starting by looking at the NLTK library which is very well documented and includes a corresponding introductory book for beginners in the domain: Natural Language Processing with Python

Upvotes: 1

Jokester
Jokester

Reputation: 5617

named-entity recognition may be close to what you want.

Upvotes: 1

Related Questions