johnsonjp34
johnsonjp34

Reputation: 3299

NLP with less than 20 Words on google cloud

According to this documentation: the classifyText method requires at least 20 words.

https://cloud.google.com/natural-language/docs/classifying-text#language-classify-content-nodejs

If I send in less than 20 words I get this no matter how clear the content is:

Invalid text content: too few tokens (words) to process.

Looking for a way to force this without disrupting the NLP too much. Are there neutral vector words that can be appended to short phrases that would allow the classifyText to process anyways?

ex.

async function quickstart() {
    const language = require('@google-cloud/language');


    const client = new language.LanguageServiceClient();

  //less than 20 words. What if I append some other neutral words? 
//.. a, of , it, to or would it be better to repeat the phrase?


    const text = 'The Atlanta Braves is the best team.';


    const document = {
        content: text,
        type: 'PLAIN_TEXT',
    };


    const [classification] = await client.classifyText({document});
    console.log('Categories:');
    classification.categories.forEach(category => {
        console.log(`Name: ${category.name}, Confidence: ${category.confidence}`);
    });

}

quickstart();

Upvotes: 1

Views: 446

Answers (1)

Iñigo González
Iñigo González

Reputation: 3955

The problem with this is you're adding bias no matter what kind of text you send.

Your only chance is to fill up your string up to the minimum word limit with empty words that will be filtered out by the preprocessor and tokenizer before they go to the neural network.

I would try to add a string suffix at the end of the sentence with just stopwords from NLTK like this:

document.content += ". and ourselves as herserf for each all above into through nor me and then by doing"

Why the end? Because usually text has more information at the beginning.

In case Google does not filter stopwords behind the scenes (which I doubt), this would add just white noise where the network has no focus or attention.

Remember: DO NOT add this string when you have enough words because you are billed for 1K character blocks before they are filtered.

I would also add that string suffix to sencences in your train/test/validation set that have less than 20 words and see how it works. The network should learn to ignore the whole sentence.

Upvotes: 2

Related Questions