user1940902
user1940902

Reputation: 121

Regular expression to match top n occurences of words

I have a string in format:

word<class> word<class>...
For example:
I<Noun> like<verb> to<Function> eat<verb>...

Is it possible to use regex to find top n words that occur for each class, for example top 4 noun words using regular expression. It will output the list of words.

Thanks

Upvotes: 0

Views: 268

Answers (3)

VladL
VladL

Reputation: 13043

Your regex pattern is (\\s|^)([a-zA-Z]+?)<Noun>(\\s|$), in every found match you should use $2 to get the result

In c# you can achieve this by using following code:

     string type = "Noun";
     int top = 5;

     MatchCollection mc = Regex.Matches("I<Noun> like<verb> to<Function> eat<verb> an apple<Noun>", String.Format("(\\s|^)([a-zA-Z]+?)<{0}>(\\s|$)", type));

     List<string> res = new List<string>();

     for (int i = 0; i < mc.Count && i < top; i++)
     {
        res.Add(mc[i].Result("$2"));
     }

Upvotes: 0

specialscope
specialscope

Reputation: 4228

In order to accomplish what you are doing you need to use parts of speech tagger to classify what sort of words are used in the sentence. You can use any one of natural language processing libraries to do that. For eg. in python you have pynltk. http://answers.oreilly.com/topic/1091-how-to-use-an-nltk-part-of-speech-tagger/

After that you need group the words according to the parts of speech and count them. So totally out of scope for regular expressions.

Upvotes: 1

alinsoar
alinsoar

Reputation: 15803

Regular expression cannot be used for counting.

So no -- you cannot find top n words using regexps.

Upvotes: 3

Related Questions