Text tokenization with Stanford NLP : Filter unrequired words and characters

Question

I use Stanford NLP for string tokenization in my classification tool. I want to get only meaningful words, but I get non-word tokens (like ---, >, . etc.) and not important words like am, is, to (stop words). Does anybody know a way to solve this problem?

Jon Gauthier · Accepted Answer

This is a very domain-specific task that we don't perform for you in CoreNLP. You should be able to make this work with a regular expression filter and a stopword filter on top of the CoreNLP tokenizer.

Here's an example list of English stopwords.

Text tokenization with Stanford NLP : Filter unrequired words and characters

Answers (2)

Related Questions