Reputation: 1690

Internationalization of strings

Currently we are storing Country, City, State names, handful of lookup tables in SQL Server database. For internationalization of strings stored in these tables, what are the best practices to follow?

Couple of thoughts were to store them as flat json files and have different files to store the internationalized text and pick them based on the locale, but it will be a maintenance nightmare.

Upvotes: 2

Answers (1)

Paweł Dyda

Reputation: 18662

Actually, from maintenance point of view, database localization could quickly become a nightmare. Clearly, you don't want to give translators an access to your database (not even indirectly). I'll explain it in a minute.

The most typical pattern is to externalize all the strings to the so called resource files and load them based on the locale - the maintenance nightmare as you call it.
The typical resource file formats are:

Java .properties files
Gettext .pot/.po files
XML / XLIFF

As you can see, JSON is not necessary on my short list.
Anyways, if you tend to use standard file format, the translators could use their tools:

MT - machine translation to translate untranslated parts of the file (and then manually fix to match the context)
TM - translation memory to store previous translations and leverage them when required (this is actually what will be used first...)

In the typical process, the changes should be made only to the English resource file (no manual modification to language files). If this is the case, it is easy enough to recreate the language files using the TM tool I mentioned earlier.
Now, what if you need to change the translation (i.e. fix some nasty localization bug)?
Obviously, you want to change it in the translation tool (not the file!), so that changes will be leveraged with each new version of the English file.

Is it still going to be a nightmare? :)

The process I mentioned is the standard one. It follows 80/20 rule, that is that kind of process is good for 80% of the projects. However, there are 20% of projects that does not fit well into the ideal process - the ones that use so called dynamic localization.
By dynamic localization, I mean that English strings change very often, and are usually supplied by the users of the system.

If this is the case, the DB localization, with lookup tables is simply the most straightforward approach. But unfortunately, there is always a catch.
The catch is, it is really hard to implement it correctly. And if the users has any means of modifying the database contents using free form text, your system is at risk. Nevermind the typical SQL injection vulnerabilities; those you can prevent. But what if the DB engine itself has a critical zero day flaw that will let the users to elevate their privileges and execute some arbitrary SQL statement? You'll never know.
Of course, safety is only one concern. The other concerns are:

How do I let my users provide the ability to translate strings?
How do I track and ensure the completeness of the translation?
How do I ensure the correctness of the translation?
How do I motivate my users to actually provide those translations?
How to implement the translation engine correctly?

These things are not to be taken lightly.
Facebook gives you an ability to translate the UI into your language. They created special tool that lets you translate text on the screen (and you can use different forms based on gender and cardinality - i.e. multiple plural forms). And you know what? Even though they have so many contributors, the site is still not 100% translated into (I believe) most of the languages it supports. Is it translated correctly? Well, most of the time yes.
The biggest problem of the croudsourcing is vandalism. There are people who will intentionally break the translations (or the contents), Wikipedia anyone? You have to prevent it somehow.

Now on to implementation details. These problems may also be present in the typical localizability scenario, with the resource files. However, they are very common and in many cases the engine resolves them for you (Gettext is the best example). While implementing your own localizability engine, you need to consider these issues:

Language fallback. Let's say your system is translated to German and your default language is English. If the user from Austria (de-AT) comes in, she should be able to see the UI in German (de), not English. This one is simple.
The more problematic one would be Chinese Simplified and Chinese Traditional. If it happens that your system has translations for both of these languages (locales zh-Hans and zh-Hant respectively), you'll need to make sure that proper fall back is in place: zh-CN (China) and zh-SG (Singapore) should fall back to zh-Hans, whereas zh-TW (Taiwan), zh-HK (Hong Kong) as well as zh-MO (Macao) needs to fall back to zh-Hant. In case of pure zh, it probably should be Chinese Simplified again.
You want to reuse common strings (i.e. OK, Cancel as button captions), but at the same time you don't want to reuse the strings in the different context (you might be tempted to do so, but it will create an i18n bug). The first part is easy, you'll simply use the same resource key for each of the repeated string. Provided that you have resource keys.
The most common bug I've seen when people try to implement DB-based localization engine is, they use English string as the key into the database.
Don't do that.
This won't allow for different translations of the same strings in different context. For instance, let me bring Save dialog example. In Polish, "Save" on the button caption is an action and should be translated in the imperative mood ("Zapisz"). The same "Save" in the window title informs about what is going to happen and should be translated as "Zapisywanie".
To make it worse, many languages has more than one plural form, so unless you re-phrase the English sentence to avoid the problem, you'll have to take that into account. That means more than one (up to six) translation for the same key... It's not that hard, you would simply use the composite primary key (resource_id, locale_id, cardinality) with the cardinality being one of: zero, one, two, few, many, other.
Gender could be a source of the problem. You may want to keep your system as gender-neutral, as possible, but in some cases translations will actually be different based on gender. If you let your users to translate the messages, this is something you would most likely have to handle.
On the other hand, if you want to use the service of the professional translation providers, you can't just send out the SQL file for translation. You'll most likely need to create import/export mechanism to create one of the standard resource file format that translators could use to practice their art.
Of course, you can send basically any file format, but it has its consequences. The most obvious is, non-standard file format (Excel file for example) requires manual effort and as such, is prone to errors. Since it requires manual effort, translators will charge you premium and... it will take longer to translate your strings.
OK, you can integrate the DB directly with TM and MT systems (however you should never let text translated by the machine land on your UI before linguistic verification), but it will also be quite an effort.

Does your project fit 80% of use cases? You have to answer this question yourself.

Edit: How to avoid redeployment when resources change

If resources change more frequently than code does, it is a clear indicator that one should implement dynamic localization (basically DB-driven one).

On the other hand, sometimes we don't want to redeploy the whole application just because resource files have changed. That's perfectly understandable.
There are many ways to handle such situation, the easiest way would probably be to create a microservice that would read properties files and return them on demand. It could be used by some part of the application's code to drive localization (i.e. generate JSON files on demand). Of course, that means additional complexity and the need to redeploy the microservice, but application's code (war, ear, jar?) would stay intact.

In Java 8 different approach is possible: the ResourceBundle.Control class implements Service Provider Interface, so in theory one could simply create a specific JAR file with a custom ResourceBundle.Control implementation that would read the resource files from different places (disk, the jar, web service, anywhere in fact). This could be used to ensure that only resource files would need to be redeployed, not the entire application.

Unfortunately, as always everything depends on context; in certain technologies different approaches would work. And as usual, avoiding one thing means increasing complexity of another one.

Upvotes: 4

Internationalization of strings

Answers (1)

Related Questions