ltfishie
ltfishie

Reputation: 2987

Automatically recognize company names in text

Problem I have a list of company name/stock symbols and would like to recognized them in some text.

public interface AutoTaggingService () {
    public List<Tags> getTags(String fullText);
}

At the simplest implementation, it is possible to loop over all company names and do an exact match, but this is both slow (large list of companies) and would not do very well against spelling variation.

Possible Solution One way I can think of doing this is to feed the list of company name/stock symbols to Lucene/Solr index, and use the fullText as an query. Result of this query would be an list of documents (company) that matches the fullText, with relevancy scores. A threshold can be defined so only the companies with high score will be returned as tags. A custom stemmer and a list of synonyms to company names can be defined to improve accuracy.

Doubts When I used Lucene/Solr in the past, the document in the search index contain relatively long text (for example, collections of articles), and the query would be relatively short. For what I am looking to do now, the situation is reversed. Would this effect the index or relevancy and making this method unreliable?

Question

  1. Is my solutions a good way to approach this problem?
  2. Can I use an classifier and use the company list as the training data to achieve this?
  3. Any other suggestions on how this could be done efficiently and with high accuracy.

Upvotes: 1

Views: 760

Answers (1)

nickdos
nickdos

Reputation: 8414

I recently had a similar problem (kind of) and I ended up following the KISS principle and implemented the search part with Apache StringUtils library. You haven't provided much detail about either your stock codes (if they are all the same length) or how large the fulltext text is... But you could possibly use the indexOfAny(CharSequence str, CharSequence... searchStrs) method. Here's some pseudo-Java...

private String[] codes; // e.g. ["ABC",DEF","GHI"]
List<Tags> tagList;
int i = StringUtils.indexOfAny(fulltext, codes);

if (i >= 0) {
    // there's a match
    String code = fullText.substring(i, i + 3);
    tagList.add(doLookup(code)); // lookup util for code -> Tags
    // recursively search again with the substring remainder of the fullText
    callMyself(fullText.substring(i + 3));
}

The above example is incomplete and untested - its just to give you a general idea.

Upvotes: 3

Related Questions