Reputation: 17004
What techniques exist that can tell the difference betwen plain common phrases such as "to the", "and the" and set phrases and idioms which have their own lexical meanings such as "pick up", "fall in love", "red herring", "dead end"?
Are there techniques which are successful even without a dictionary, statistical methods HMMs train on large corpora for instance?
Or are there heuristics such as ignoring or weighting down "promiscuous" words which can co-occur with just about any word versus words which occur either alone or in a specific limited set of idiomatic phrases?
If there are such heuristics, how do we take into account set phrases and verbal phrases which do incorporate promiscuous words such as "up" in "beat up", "eat up", "sit up", "think up"?
UPDATE
I've found an interesting paper online: Unsupervised Type and Token Identification of Idiomatic Expressions
Upvotes: 1
Views: 590
Reputation:
Are you looking for collocation detection?
Take a look at this chapter in the excellent book, Fundamentals of natural language processing by Manning & Schütze.
Upvotes: 2