KubiK888
KubiK888

Reputation: 4731

Python NLTK named entity recognition depends by the (upper)case of first letter?

I am planning on using Python NLTK for academic research. In particular, I need a way of screening the Twitter users and tease out the ones who do not seem to be using a "real name" in their profile.

I am thinking about using the default NLTK's name-entity recognition to separate the Twitter users who use seemingly real name from those who aren't. Do you think it's worth the try? Or should I train the classifier by myself?

import nltk
import re
import time

##contentArray0 =['Health Alerts', "Kenna Hill"]

contentArray =['ICU nurse toronto']

##let the fun begin!##
def processLanguage():
    try:
        for item in contentArray:
            tokenized = nltk.word_tokenize(item)
            tagged = nltk.pos_tag(tokenized)
            print tagged

            namedEnt = nltk.ne_chunk(tagged)
            ##namedEnt.draw()

            time.sleep(1)

    except Exception, e:
        print str(e)


processLanguage()

Edit: I have done a bit of testing. It seems nltk recognizes a name entity primarily by whether or not the first letter of the word is capital? For example, "ICU Nurse Toronto" will be recognized with NNP while "ICU nurse toronto" will not. It seems overly-simplistic and not very useful for my purpose (twitter) since many Twitter users using real name could be using lower case while some commercial organization will be using capital first letter.

Upvotes: 3

Views: 2311

Answers (2)

alexis
alexis

Reputation: 50219

Definitely train one yourself. The NLTK's NE recognizer is trained to recognize named entities embedded in full sentences. But don't just retrain the nltk's NE recognizer on new data; it is a "sequential classifier", meaning it takes into account the surrounding words and POS tags and the named-entity classification of the preceding words. Since you already have the usernames, these will not be useful or relevant for your purposes.

I suggest you train a regular classifier (e.g., Naive Bayes), feed it whatever custom features you think might be relevant, and ask it to decide "is this a real name". To train, you must have a training corpus that contains examples of names and examples of non-names. Ideally the corpus should consist of what you're trying to classify: twitter handles.

Re the question in your comment, don't use entire words as features: your classifier can only reason with features it knows about, so census names can't help you with novel names unless your features are about parts of the name. Usually the features represent the endings (last letter, final bigram, final trigram), but you can try other things too like length and of course capitalization. The NLTK chapter discusses the task of recognizing the gender of names, and gives many examples of suffix features.

The catch, in your case, is that you have multiple words. So your classifier needs to be told somehow if some words are recognized as names and some are not. Somehow you must define your features in a way that preserves this information. E.g., you could set the feature "known names" to have the values "None", "One", "Several", "All". (Note that the NLTK's implementation treats feature values as "categories": They are simply distinct values. You can use 3 and 4 as feature values, but as far as the classifier is concerned you might as well have used "green" and "elevator".)

And don't forget to add a "bias" feature with constant value (see the NLTK chapter).

Upvotes: 4

N00bsie
N00bsie

Reputation: 469

You would definitely have to train a classifier yourself. As an example, since you are working on names, you could have a look at this NLTK chapter. The simple Naive Bayes classifier that the chapter describes to test whether a name is a 'male' or 'female' gives a good insight into the kind of features. Also your question on asking what features, is more of a problem and domain specific question. Apart from the generic features that all Information Extraction researchers use, there might be other kinds of features as well. But again these are purely dependent on your data. Do go through that chapter, it gives you all the basic tools to build your own classifier.

As an aside, since you mentioned Twitter user names, I would also suggest using a normalizer as most names could contain just letters. For example instead of "Tom", a user name could also be "T0m". Perhaps you are already doing this, which in case you are, I am sorry for repeating it again.

Upvotes: 1

Related Questions