Harry47
Harry47

Reputation: 101

separate meaningful words from text file using C# or any open source text mining API

I am working on a video processing project in which i extract text from video given as input and save that text in a text file.I have the text which has garbage text as well as words , i now need to separate out meaningful words from the generated text and convert them into tags? can anyone suggest API/algorithm that can be use for this ?

Upvotes: 2

Views: 1816

Answers (2)

You can use the SharpNLP with the SharpEntropy.dll and OpenNLP.dll for doing this along with the following snippet.

private OpenNLP.Tools.Tokenize.EnglishMaximumEntropyTokenizer mTokenizer;
private string[] Tokenize(string text)
{
    if (mTokenizer == null)
    {
        mTokenizer = new OpenNLP.Tools.Tokenize.EnglishMaximumEntropyTokenizer(mModelPath + "EnglishTok.nbin");
    }
    return mTokenizer.Tokenize(text);
}

Now you will have a string array of tokens. I mean a string array containing all data. Junk may be included. Now you have to separate only the meaningful tokens. For this you can use the NHunspell.dll

public list<string> validate(string[] tokens)
{
      Hunspell hunspell = new Hunspell("en_US.aff", "en_US.dic");
      List<string> valid_tokens = new List<string>();
      foreach (string token in tokens)
      {
           if (!hunspell.Spell(token))
           {
                valid_tokens.Add(token);
           }
      }
      hunspell.Dispose();
      return valid_tokens;
}

Now you will have a list valid_tokens that contain only valid tokens that have a meaning in English. Hope this solves your problem.

For a step by step way of integrating SharpNLP into your Visual Studio Project, go though this detailed article that I have written. Easy way of Integrating SharpNLP with a Visual Studio C# Project

Upvotes: 0

cfeduke
cfeduke

Reputation: 23236

You could take a look at Apache OpenNLP (natural language processing) and the C# derivative SharpNLP.

Upvotes: 1

Related Questions