Reputation: 121
I have a string in format:
word<class> word<class>...
For example:
I<Noun> like<verb> to<Function> eat<verb>...
Is it possible to use regex to find top n words that occur for each class, for example top 4 noun words using regular expression. It will output the list of words.
Thanks
Upvotes: 0
Views: 268
Reputation: 13043
Your regex pattern is (\\s|^)([a-zA-Z]+?)<Noun>(\\s|$)
, in every found match you should use $2
to get the result
In c# you can achieve this by using following code:
string type = "Noun";
int top = 5;
MatchCollection mc = Regex.Matches("I<Noun> like<verb> to<Function> eat<verb> an apple<Noun>", String.Format("(\\s|^)([a-zA-Z]+?)<{0}>(\\s|$)", type));
List<string> res = new List<string>();
for (int i = 0; i < mc.Count && i < top; i++)
{
res.Add(mc[i].Result("$2"));
}
Upvotes: 0
Reputation: 4228
In order to accomplish what you are doing you need to use parts of speech tagger to classify what sort of words are used in the sentence. You can use any one of natural language processing libraries to do that. For eg. in python you have pynltk. http://answers.oreilly.com/topic/1091-how-to-use-an-nltk-part-of-speech-tagger/
After that you need group the words according to the parts of speech and count them. So totally out of scope for regular expressions.
Upvotes: 1
Reputation: 15803
Regular expression cannot be used for counting.
So no -- you cannot find top n words using regexps.
Upvotes: 3