Reputation: 101
I am working on a video processing project in which i extract text from video given as input and save that text in a text file.I have the text which has garbage text as well as words , i now need to separate out meaningful words from the generated text and convert them into tags? can anyone suggest API/algorithm that can be use for this ?
Upvotes: 2
Views: 1816
Reputation: 952
You can use the SharpNLP with the SharpEntropy.dll and OpenNLP.dll for doing this along with the following snippet.
private OpenNLP.Tools.Tokenize.EnglishMaximumEntropyTokenizer mTokenizer;
private string[] Tokenize(string text)
{
if (mTokenizer == null)
{
mTokenizer = new OpenNLP.Tools.Tokenize.EnglishMaximumEntropyTokenizer(mModelPath + "EnglishTok.nbin");
}
return mTokenizer.Tokenize(text);
}
Now you will have a string array of tokens. I mean a string array containing all data. Junk may be included. Now you have to separate only the meaningful tokens. For this you can use the NHunspell.dll
public list<string> validate(string[] tokens)
{
Hunspell hunspell = new Hunspell("en_US.aff", "en_US.dic");
List<string> valid_tokens = new List<string>();
foreach (string token in tokens)
{
if (!hunspell.Spell(token))
{
valid_tokens.Add(token);
}
}
hunspell.Dispose();
return valid_tokens;
}
Now you will have a list valid_tokens that contain only valid tokens that have a meaning in English. Hope this solves your problem.
For a step by step way of integrating SharpNLP into your Visual Studio Project, go though this detailed article that I have written. Easy way of Integrating SharpNLP with a Visual Studio C# Project
Upvotes: 0
Reputation: 23236
You could take a look at Apache OpenNLP (natural language processing) and the C# derivative SharpNLP.
Upvotes: 1