Reputation: 18107
Doing i18n work with Python using Jinja2 and Pyramid. Seems to have a problem knowing how it should translate %%. I'm beginning to suspect the bug is in Jinja2.
So I've done some more investigation and it appears the problem is more with gettext than with jinja2 as illustrated with the repl
>>>gettext.gettext("98%% off %s sale") % ('holiday')
'98% off holiday sale'
>>>gettext.gettext("98%% off sale")
'98%% off sale'
>>>gettext.gettext("98% off %s sale") % ('holiday')
Traceback (most recent call last):
Python Shell, prompt 13, line 1
TypeError: %o format: a number is required, not str
It seems to be a chicken/egg problem.
All this means the translators (most of whom are not computer programmers) have to be very careful in how they do the translation and everyone needs to be very careful with translations that include %.
Seems like we are doing this wrong (somehow) and there should be a more straightforward and uniform format for doing this. Right now we are coping by simply injecting a % as a format parameter.
Is there a better way to do this, or is this as good as it gets?
There is a .po file at the bottom
Unit test pretty much says it all, why is the last assertion failing? Is this a bug with Jinja2, or do I need to be dealing with this differently.
class Jinja2Tests(TestCase):
def test_percent_percent(self):
""" i18n(gettext) expresses 98% as 98%% only in some versions of jinja2 that has not
worked as expected. This is to make sure that it is working. """
env = Environment(extensions=['jinja2.ext.i18n'])
lang = gettext.translation('messages', path.abspath(path.join(path.dirname(__file__), 'data')))
env.install_gettext_translations(lang)
template = env.from_string(source="{{ _('98%% off %(name)s sale') | format(name='holiday') }}")
result = template.render()
self.assertEqual('98% off holiday sale(translated)', result)
template = env.from_string(source="{{ _('98%% off sale') }}")
result = template.render()
# THE LINE BELOW FAILS WITH:
# AssertionError: '98% off sale(translated)' != u'98%% off sale(translated)'
self.assertEqual('98% off sale(translated)', result)
And the MO file you have to compile to a PO file to run the above code.
# This file is distributed under the same license as the Uniregistrar project.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2016.
#
msgid ""
msgstr ""
"Project-Id-Version: Uniregistrar 1.0\n"
"Report-Msgid-Bugs-To: [email protected]\n"
"POT-Creation-Date: 2016-12-22 15:22-0500\n"
"PO-Revision-Date: 2016-11-14 16:42-0500\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <[email protected]>\n"
"Plural-Forms: nplurals=2; plural=(n != 1)\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.3.3\n"
#: uniregistrar/constants.py:90
msgid "98%% off sale"
msgstr "98%% off sale(translated)"
#: uniregistrar/constants.py:90
msgid "98%% off %(name)s sale"
msgstr "98%% off %(name)s sale(translated)"
Upvotes: 6
Views: 1624
Reputation: 2340
As far as I understand the question, this is your main concern:
All this means the translators (most of whom are not computer programmers) have to be very careful in how they do the translation and everyone needs to be very careful with translations that include %.
tl;dr: This is why msgfmt
has an option --check
. That option causes msgfmt
to check whether a translation is safe to run through the string interpolation facilities of the target language. The mother of all of these problems is C's printf()
which is easy to crash, when called with the wrong parameters:
printf("Bonjour, %s!");
The printf()
function is a variadic function. The %s
causes it to pop another argument from the stack. In the above example there is no additional argument than the string literal. That means that the string that gets interpolated into %s
can be considered to come from an arbitrary address, for example 0. That would be a null pointer exception in most modern languages. In C it is a null pointer dereferencing that can be often be exploited to run arbitrary code, bad.
Let's assume that the code looked like this:
printf(gettext("Hello, world!"));
That is safe as long as the translation that gettext()
comes up with does not contain any "%" characters. But if the French translator translates "Hello, world!" with "Bonjour, %s!" the program will crash.
Well, it will not crash if the maintainer of the software uses the standard translation workflow. In that case, xgettext
(in Python it is probably something like "pybabel extract") would produce this entry in the .po
file:
#: filename.c:1
#, c-format
msgid "Hello, world!"
msgstr ""
Read the line "#, c-format" as "this is a printf format string"!
Say, the French translator translates this into this entry:
#: filename.c:1
#, c-format
msgid "Hello, world!"
msgstr "Bonjour, %s!"
If you run this through msgfmt whatever.po
it will be accepted. But this is not the recommended workflow. You should run it through msgfmt --check whatever.po
. And now you get an error:
messages.po:23: number of format specifications in 'msgid' and 'msgstr' does not match
msgfmt: found 1 fatal error
This is because for every language that GNU gettext supports, a format checker is implemented that checks exactly for that problem. It ensures that the translation will not cause run-time problems.
You may now argue that a malicious translator would simple remove the "c-format" qualifier from the .po
file. But your build system should ensure that translations coming back from external sources are always merged with the current set of messages, usually called YOURPROJECT.pot
, and then such modifications of .po
file would simply be discarded.
So, in theory you don't have a point. In practice you may have one because there are a lot of projects and software out there that use .po
files directly for run-time translations. This is a bad idea, see my answer to a similar question.
I don't know to what extend that applies to your problem because you did not mention how you extract the strings into your .pot
file and how you compile that into the binary .mo
file. The above explanation should make it clear that this is crucial: The extraction step should add automatic comments to the .po
file about the string interpolation method used, and the compiliation of the .po
file into an .mo
file should enable format string checks. If you fail to do this, your build system has a flaw.
Upvotes: 0