How to avoid natural for node.js splitting words with special characters

Question

I'm using node natural tokenizer feature, which splits a sentence into words. Usually it's supposed to work as

var natural = require('natural'),
tokenizer = new natural.WordTokenizer();
console.log(tokenizer.tokenize("your dog has't flees."));
// Returns [ 'your', 'dog', 'has', 'n't, 'flees' ]

It works fine, however, when used with German or French words, it splits up the words into two, such as

var natural = require('natural'),
tokenizer = new natural.WordTokenizer();
console.log(tokenizer.tokenize("fußball"));
// Returns ['fu', 'ball']

Which is not correct.

Anyone knows how to avoid that?

Or maybe you know a simplier way to split sentences into words in JavaScript / Node.js?

Thanks!

Andy · Accepted Answer

var data = "your fußball, hasn't! flees.";

// Remove unwanted punctuation, in this case full-stops,
// commas, and exclamation marks.
data = data.replace(/[.,!]/g, '');

// split the words up
data.split(' '); // ["your", "fußball", "hasn't", "flees"]

Demo

How to avoid natural for node.js splitting words with special characters

Answers (2)

Related Questions