user62198
user62198

Reputation: 1874

Text Tagging in Natural Language Processing

I have the following project where I need to tag news items with company names to which these news items are relevant to (company names are mentioned in the news items and in many cases, in the headline of the news item).

For example: I have about 2000 news items (in XML format) tagged with company names and their relevance level (High/Low) to the story [this was done manually]. For each news item, I have the following fields:

story_ID, Headline; story_Text; company_name; relevance_level(H/L)

with the last two fields are put in manually.

I need to automate this tagging procedure i,e I need to tag an incoming news items with company names and their relevance with High(H)/Low(L).

Note:

  1. some of the news items are not relevant to any company and so these are not tagged.

  2. some of the news items are relevant to multiple companies and so these are tagged with multiple company names and their corresponding relevance level.

I am wondering what machine learning algorithms we can use. I am very new to Natural Language Processing. So I am not able to get a handle on how to go about solving the problem. I understand I need to use Multi-label/multi-class classification but I have never had to use multi-label classification.

Any help would be greatly appreciated.

Thank you.

Upvotes: 2

Views: 3671

Answers (2)

Scott Ge
Scott Ge

Reputation: 91

I wrote a blog to share the list of Best Key Phrase Extraction APIs in the Market. You can find commercial APIs, Open-source API and live demos.

Upvotes: -1

Luke
Luke

Reputation: 5564

1. Vector of words

Probably the best approach for you is a vector space technique. Basically, this is:

-build a list of the 25,000 most common words in your documents; put them in some fixed order (e.g. 0="the", 1="cat",...)

-for each document, make a vector of length 25,000. Each entry is the count of how many times that word appeared in the document. (Use a sparse vector representation for efficiency)

-take the cosine distance between document vectors. If the distance is small, they're discussing the same field. If a new document is within some threshold of a labeled training example, give it that tag.

A brief discussion is here: http://en.wikipedia.org/wiki/Vector_space_model

A presentation is here (the slides on "Distributional word representations"): http://web.stanford.edu/class/cs224u/

2. Named Entity Recognition

The most fine-grained Natural Language Processing approach is called Named Entity Recognition; a version is available here:

http://nlp.stanford.edu/software/CRF-NER.shtml

The algorithm tags words as specific entities (i.e. Apple Computer). You could run such an algorithm and check if your company of choice is mentioned. NER algorithms will be good at identifying a mention of Apple Computer in an article about a totally different topic (which would be hard for the vector space technique above, which looks at documents as a whole). But it sounds like you don't need that level of granularity, so the first approach is probably the best.

Upvotes: 3

Related Questions