Eric J
Eric J

Reputation: 65

Google Translate outputs HTML Entities in Python 3

When I try to write French characters to a file, some characters look like

j'ai

I didn't have any issues with Spanish characters. What could I be doing wrong?

"""Translates text into the target language.

Make sure your project is whitelisted.

Target must be an ISO 639-1 language code.
See https://g.co/cloud/translate/v2/translate-reference#supported_languages
"""
from google.cloud import translate

# Instantiates a client
translate_client = translate.Client()


# The target language
target = 'fr'

# Create a list of strings to translate. 
test_list = []
new_list = []
for i in range(1) :
    test_list.insert(i, 'I said, you know what, something, I\'m going to drop everything else off that I was doing and go through a period of a dry spell just to properly give it a chance when I started using it. ')

# Send 128 items per translation request and concatenate resulting translations into one list. (The max items per request for Google translate is 128.)
concat_result = []
for j in range(0, len(test_list), 128):
    new_result = translate_client.translate(
        test_list[j:j + 128], target_language=target)
    concat_result += new_result

count = 0
for list in concat_result :
    print(count, concat_result[count]['translatedText'])
    count += 1

Print result:

0 J'ai dit, vous savez quoi, quelque chose, je vais laisser tomber tout ce que je faisais et traverser une période de sécheresse simplement pour lui donner une chance de bien commencer à l'utiliser.

Please ignore that I am translating a list of strings instead of a string. I was testing sending batch requests.

Upvotes: 1

Views: 1795

Answers (1)

DallaRosa
DallaRosa

Reputation: 5815

EDIT


OK, as expected the problem was with the strings and not with the subtitle generation.

The Google Translate API specifies that it defaults its output to HTML. That's why you're getting HTML entities instead of the raw characters.

You need to specify in the call of the translate method that you want the format to be text instead of HTML.

Something like:

translate_client.translate(
        test_list[j:j + 128], 
        target_language=target,
        format="text")

You can find more info on the parameters at: https://cloud.google.com/translate/docs/reference/translate?hl=ja

and more details on the Python API itself reading its source code here: https://github.com/googleapis/google-cloud-python/blob/master/translate/google/cloud/translate_v2/client.py#L164

END OF EDIT


Before I answer, I'm gonna give you some advice, as you seem to be new here: If you need help with code you should provide a fully working example. It's really hard to help someone when they don't provide all the context and information needed.

So, let's move to the answer...

I'm going to start with a wild guess here:

You are creating subtitle files with the srt library found at: https://github.com/cdown/srt

--

I've just tested it with the code below:

subtitle_generator = srt.parse('''\
   1
   00:31:37,894 --> 00:31:39,928
   Je m'appelle Francisco

   ''')

subtitles = list(subtitle_generator)

with open("a_fr.srt" , "w", encoding='utf-8') as f:
    f.write(srt.compose(subtitles))

And it showed the apostrophe just fine.

You should check the contents of subs and the original text being used at the parse function. There's a high probability the problem is with the original text and not with the python printing as there's nothing in the writing process that automatically transforms characters into HTML entities.

Upvotes: 3

Related Questions