Reputation: 8018
I am trying to extract keywords from a piece of text using nltk
and Stanford NLP
tools. After I run my code, I can get a list like this
companyA
companyB
companyC
Trend Analysis For companyA
This is all good but notice the last item. That is actually a heading that appears in the text. Since all the words for a heading are capitalized, my program thinks that those are all proper nouns and thus clubs them together as if they were a big company name.
The good thing is that as long as a company has been mentioned somewhere in the text, my program will pick it up, hence I get individual items like companyA
as well. These are coming from the actual piece of text that talks about that company.
Here is what I want to do.
In the list that I get above, is there a way to look at an item and determine if any previous items are a substring of the current one? For example, in this case when I come across
Trend Analysis For companyA
I can check whether I have seen any part of this before. So I can determine that I already have companyA
and thus I will ignore Trend Analysis For companyA
. I am confident that the text will mention any companies enough times for StanfordNER to pick it up. Thus I do not have to rely on headings to get what I need.
Does that make sense? Is this the correct approach? I am afraid that this will not be very efficient but i can't think of anything else.
Edit
Here is the code that i use
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
after that i simply use the StanfordNERTagger
on each sentence
result = []
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
for s in sentences:
taggedwords = stn.tag(s)
for tag, chunk in groupby(taggedWords, lambda x:x[1]):
if tag == "ORGANIZATION":
result.append((tag, " ".join(w for w, t in chunk)))
return result
in this way i can get all the ORGANIZATION
s.
To @Alvas's point about Truecasing, dont you think its a bit of an overkill here? When i studied the algorithm, it appears to me that they are trying to come up with the most likely spelling for each word. The likelihood will be based on a corpus. I dont think that i will need to build a corpus as i can use a dictionary like wordnet
or something like pyenchant
to figure out the appropriate spelling. Also, here i already have all the information i need i.e. i am picking up all the companies mentioned.
There is another problem. Consider the company name
American Eagle Outfitters
note that American
and american
are both proper spellings. Simlarl for Eagle
and eagle
. I am afraid that even if i employ Truecasing into my algorithm, it will end up lowercasing terms that should not be lowercased.
Again, my problem right now is that i have all the company names extracted, but i am also extracting the headings. The brute force way wold be to perform a substring check on the list of results. I was just wondering whether there is a more efficient way of doing this. Moreover, i dont think that any tweaking that i do will improve the tagging. I dont think i will be able to outperform StanfordNERTagger
Upvotes: 0
Views: 881
Reputation: 750
I did encounter a similar problem, but the whole text was uncapitalized (ASR output). In that case I did retrain NER model on uncapitalized annotated data to obtain better performances.
Here are a few options I would consider, by order of preference (and complexity):
I would personnaly consider first option, as you can easily artificially create an all-capitalized corpus (from correctly capitalized texts) and train a sequence labeller model (e.g. CRF) model to detect where capitalization should be removed.
Whatever approach you use, you'll indeed never end up with performances as good as for correctly capitalized texte. Your input can be considered as partly noisy, since a clue is missing both for POS tagging and NER.
Upvotes: 2