Reputation: 57
I have a Pandas dataframe containing a number of social media comments that I want to analyse using Google's NLP API. Google's documentation only discusses (as far as I can see) how to classify individual strings, rather than multiple strings in one request. Each request to the API, classifying one comment at a time, takes around half a second, which is very slow when I am trying to classify over 10,000 at any one time. Is there a way to have a list of strings separately classified, which I'm sure would be dramatically quicker?
This is the code I am currently using:
import numpy as np
import pandas as pd
from google.cloud import language
client = language.LanguageServiceClient()
def classify(string):
document = language.types.Document(content=string, type=language.enums.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document).document_sentiment
return (sentiment.score, sentiment.magnitude)
def sentiment_analysis_df(df):
df['sentiment_score'] = np.zeros(len(df))
df['sentiment_magnitude'] = np.zeros(len(df))
for i in range(len(df)):
score, magnitude = classify(df['comment'].iloc[i])
df['sentiment_score'].iloc[i] = score
df['sentiment_magnitude'].iloc[i] = magnitude
# Other steps including saving dataframe as CSV are done here
I have seen two other posts on here that ask similar questions, here and here, but the first assumes that full stops are used for string separation (not true in my case, as many strings are made up of multiple sentences) and the second only has answers discussing rate limiting and costs.
Upvotes: 3
Views: 932
Reputation: 417
If you want to parallelise requests you can have a Spark job do it for you.
Here there's a code snippet I tried myself and worked:
from pyspark.context import SparkContext
from pyspark import SparkConf
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
def comment_analysis(comment):
client = language.LanguageServiceClient()
document = types.Document(
content=comment,
type=enums.Document.Type.PLAIN_TEXT)
annotations = client.analyze_sentiment(document=document)
total_score = annotations.document_sentiment.score
return total_score
sc = SparkContext.getOrCreate(SparkConf())
expressions = sc.textFile("sentiment_lines.txt")
mapped_expressions = expressions.map(lambda comment: comment_analysis(comment))
(where sentiment_lines.txt is a plain text document with some comments)
Each element of mapped_expressions would be the overall sentiment to each "comment" in expressions.
In addition, remember you can have Dataproc run the Spark job so everything stays managed inside Google Cloud.
Upvotes: 1