nlp: alternate spelling identification

Question

Help by editing my question title and tags is greatly appreciated!

Sometimes one participant in my corpus of "conversations" will refer to another participant using a nickname, usually an abbreviation or misspelling, but hereafter I'll just say "nicknames". Let's say I'm willing to manually tell my software whether or not I think various possible nicknames are in fact nicknames, but I want software to come up with a list of possible matches between the handle's that identify people, and the potential nicknames. How would I go about doing that?

Background on me and then my corpus: I have no experience doing natural language processing but I'm a competent data analyst with R. My data is produced by 70 teams, each forecasting the likelihood of 100 distinct events occurring some time in the future. The result that I have 70 x 100 = 7000 text files, containing the stream of forecasts participants make and the comments they include with their forecasts. I'll paste a very short snip of one of these text files below, this one had to do with whether the Malian government would enter talks with the MNLA:

02/12/2013 20:10: past_returns answered Yes: (50%)

I hadn't done a lot of research when I put in my previous placeholder... I'm bumping up a lot due to DougL's forecast

02/12/2013 19:31: DougL answered Yes: (60%)

Weak President Traore wants talks if MNLA drops territorial claims. Mali's military may not want talks. France wants talks. MNLA sugggests it just needs autonomy. But in 7 weeks?

02/12/2013 10:59: past_returns answered No: (75%)

placeholder forecast... http://www.irinnews.org/Report/97456/What-s-the-way-forward-for-Mali

My initial thoughts: Obviously I can start by providing the names I'm looking to match things up with... in the above example they would be past_returns and DougL (though there is no use of nicknames in the above). I wouldn't think it'd be that hard to get a computer to guess at minor misspellings (though I wouldn't personally know where to start). I can imagine that other tricks could be used, like assuming that a string is more likely to be a nickname if it is used much much more by one team, than by other teams. A nickname is more likely to refer to someone who spoke recently than someone who spoke long ago, or not at all on regarding this question. And they should be used in sentences in a manner similar to the way the full name/screenname is typically used in the corpus. But I'm interested to hear about simple approaches, as well as ones that try to consider more sophisticated techniques.

nlp: alternate spelling identification

Answers (1)

Related Questions