Classifying words inside a document

Question

The problem that I'm facing is: I want to read a document, get the raw string of this document, and classify the information. For example, I want to identify when the string is a "Name", or a "date" ou some other useful information.

Is it possible to use machine learning to do that? How may I approach the problem?

The most hard problem here is that I'm not trying to classify the document itself, but the String information inside the document.

rabbit · Accepted Answer

So it's all about how you think about your problem. I think your problem can be formulated as an entity extraction/recognition problem, where you have a document and want to identify specific entities within the text (where an entity might be a person, date, etc). Take a look at Conditional Random Fields and their applications to named entity recognition (NER for short), as there are some libraries & tools already implemented.

For example, check out StanfordNER.

Classifying words inside a document

Answers (1)

Related Questions