Reputation: 705
I have rows of sentences in different languages and the language code is in a separate column. I am specifying to only process certain languages (en, es, fr, or de) since I know AWS Comprehend does not support 'nl' (Dutch). For some reason I continue to get an error that 'nl' is not supported even though it is not listed in my when condition and should therefore not be getting sent through the Comprehend udf. Any ideas on what might be wrong?
Here is my code:
import pyspark.sql.functions as F
def detect_sentiment(text,language):
comprehend = boto3.client(service_name='comprehend', region_name='us-west-2')
sentiment_analysis = comprehend.detect_sentiment(Text=text, LanguageCode=language)
return sentiment_analysis
detect_sentiment_udf = F.udf(detect_sentiment)
reviews_4 = reviews_3.withColumn('RAW_SENTIMENT_SCORE', \
F.when( (F.col('LANGUAGE')=='en') | (F.col('LANGUAGE')=='es') | (F.col('LANGUAGE')=='fr') | (F.col('LANGUAGE')=='de') , \
detect_sentiment_udf('SENTENCE', 'LANGUAGE')).otherwise(None) )
reviews_4.show(50)
I get this error:
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the DetectSentiment operation:
Value 'nl' at 'languageCode'failed to satisfy constraint: Member must satisfy enum value set: [ar, hi, ko, zh-TW, ja, zh, de, pt, en, it, fr, es]
Upvotes: 0
Views: 143
Reputation: 705
Still unsure why my code above wouldn't work but I found the following work around to be effective.
def detect_sentiment(text,language):
comprehend = boto3.client(service_name='comprehend', region_name='us-west-2')
if (language == 'en') | (language == 'es') | (language == 'fr') | (language == 'de') :
sentiment_analysis = comprehend.detect_sentiment(Text=text, LanguageCode=language)
return sentiment_analysis
else:
return None
detect_sentiment_udf = F.udf(detect_sentiment)
reviews_4 = reviews_3.withColumn('RAW_SENTIMENT_SCORE', detect_sentiment_udf('SENTENCE','LANGUAGE'))
reviews_4.show(50)
Upvotes: 0