Andrea Nagar
Andrea Nagar

Reputation: 1263

Algorithm for sentence analysis and tokenization

I need to analyze a document and compile statistics as to how many times each a sequence of words is used (so the analysis is not on single words but of batch of recurring words). I read that compression algorithms do something similar to what I want - creating dictionaries of blocks of text with a piece of information reporting its frequency. It should be something similar to http://www.codeproject.com/KB/recipes/Patterns.aspx Do you have anything written in C#?

Upvotes: 1

Views: 1835

Answers (1)

Yin Zhu
Yin Zhu

Reputation: 17119

This is very simple to implement.

  1. Use Split(a member function of string class) to split the string into words. (you can use the delimiters in the codeproject url).

  2. A forloop to enumerate all the n-gram out and use Dictionary<string, int> to get the count.

Upvotes: 1

Related Questions