ashingel
ashingel

Reputation: 494

Financial news headers classification to positive/negative classes

I'm doing a small research project where I should try to split financial news articles headers to positive and negative classes.For classification I'm using SVM approach.The main problem which I see now it that not a lot of features can be produced for ML. News articles contains a lot of Named Entities and other "garbage" elements (from my point of view of course).

Could you please suggest ML features which can be used for ML training? Current results are: precision =0.6, recall=0.8

Thanks

Upvotes: 2

Views: 1224

Answers (3)

Mark Unsworth
Mark Unsworth

Reputation: 3177

iliasfl is right, this is not a straightforward task.

I would use a bag of words approach but use a POS tagger first to tag each word in the headline. Then you could remove all of the named entities - which as you rightly point out don't affect the sentiment. Other words should appear frequently enough (if your dataset is big enough) to cancel themselves out from being polarised as either positive or negative.

One step further along, if you still aren't close could be to only select the adjectives and verbs from the tagged data as they are the words that tend to convey the emotion or mood.

I wouldn't be too disheartened in your precision and recall figures though, an F number of 0.8 and above is actually quite good.

Upvotes: 0

iliasfl
iliasfl

Reputation: 559

The task is not trivial at all.

The straightforward approach would be to find or create a training set. That is a set of headers with positive news and a set of headers with negative news. You turn the training set to a TF/IDF representation and then you train a Linear SVM to separate the two classes. Depending on the quality and size of your training set you can achieve something decent - not sure for 0.7 break even point.

Then, to get better results you need to go for NLP approaches. Try use a part-of-speech tagger to identify adjectives (trivial), and then score them using some sentiment DB like SentiWordNet.

There is an excellent overview on Sentiment Analysis by Bo Pang and Lillian Lee you should read:

Upvotes: 2

TakeS
TakeS

Reputation: 626

How about these features?

  1. Length of article header in words
  2. Average word length
  3. Number of words in a dictionary of "bad" words, e.g. dictionary = {terrible, horrible, downturn, bankruptcy, ...}. You may have to generate this dictionary yourself.
  4. Ratio of words in that dictionary to total words in sentence
  5. Similar to 3, but number of words in a "good" dictionary of words, e.g. dictionary = {boon, booming, employment, ...}
  6. Similar to 5, but use the "good"-word dictionary
  7. Time of the article's publication
  8. Date of the article's publication
  9. The medium through which it was published (you'll have to do some subjective classification)
  10. A count of certain punctuation marks, such as the exclamation point

If you're allowed access to the actual article, you could use surface features from the actual article, such as its total length and perhaps even the number of responses or the level of opposition to that article. You could also look at many other dictionaries online such as Ogden's 850 basic english dictionary, and see if bad/good articles would be likely to extract many words from those. I agree that it seems difficult to come up with a long list (e.g. 100 features) of useful features for this purpose.

Upvotes: 1

Related Questions