Thomas Schafer
Thomas Schafer

Reputation: 57

How can I send a batch of strings to the Google Cloud Natural Language API?

I have a Pandas dataframe containing a number of social media comments that I want to analyse using Google's NLP API. Google's documentation only discusses (as far as I can see) how to classify individual strings, rather than multiple strings in one request. Each request to the API, classifying one comment at a time, takes around half a second, which is very slow when I am trying to classify over 10,000 at any one time. Is there a way to have a list of strings separately classified, which I'm sure would be dramatically quicker?

This is the code I am currently using:

import numpy as np
import pandas as pd
from google.cloud import language

client = language.LanguageServiceClient()

def classify(string):
    document = language.types.Document(content=string, type=language.enums.Document.Type.PLAIN_TEXT)
    sentiment = client.analyze_sentiment(document=document).document_sentiment
    return (sentiment.score, sentiment.magnitude)

def sentiment_analysis_df(df):    
    df['sentiment_score'] = np.zeros(len(df))
    df['sentiment_magnitude'] = np.zeros(len(df))
    for i in range(len(df)):
        score, magnitude = classify(df['comment'].iloc[i])
        df['sentiment_score'].iloc[i] = score
        df['sentiment_magnitude'].iloc[i] = magnitude
    # Other steps including saving dataframe as CSV are done here

I have seen two other posts on here that ask similar questions, here and here, but the first assumes that full stops are used for string separation (not true in my case, as many strings are made up of multiple sentences) and the second only has answers discussing rate limiting and costs.

Upvotes: 3

Views: 932

Answers (1)

Albert Albesa
Albert Albesa

Reputation: 417

If you want to parallelise requests you can have a Spark job do it for you.

Here there's a code snippet I tried myself and worked:

from pyspark.context import SparkContext
from pyspark import SparkConf

from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types


def comment_analysis(comment):

    client = language.LanguageServiceClient()
    document = types.Document(
        content=comment,
        type=enums.Document.Type.PLAIN_TEXT)
    annotations = client.analyze_sentiment(document=document)
    total_score = annotations.document_sentiment.score
    return total_score


sc = SparkContext.getOrCreate(SparkConf())

expressions = sc.textFile("sentiment_lines.txt")

mapped_expressions = expressions.map(lambda comment: comment_analysis(comment))

(where sentiment_lines.txt is a plain text document with some comments)

Each element of mapped_expressions would be the overall sentiment to each "comment" in expressions.

In addition, remember you can have Dataproc run the Spark job so everything stays managed inside Google Cloud.

Upvotes: 1

Related Questions