Blog Entry: Thoughts about internationalization (i18n)

It comes to no surprise for Oregami to face internationalization (i18n) issues sooner or later, since we are a project that wants to become the authoritative and free data pool for the global video gaming community. While English is a good choice for the first language of the project, and German as our native language a good choice for the second, a complete documentation of video games requires spreading the work and expertise to more languages. Also, the results of our work need to be made as accessible as possible, so there also is the need to add more languages to the pool when it comes to the UI and help pages of our website. And last, but in no way least, the textual data itself come in many languages, too.

So, quite some time ago, we already discussed some internationalization (i18n) issues and now want to summarize what we talked about.

The main point we discussed is the differences between regional titles, text translations, and text transliterations, and how to implement those in our data model:

  1. Regional titles mean that a game or platform is released under different names in different regions.
  2. A translation means bringing a text from one language to another.
  3. A transliteration means bringing a text from one script to another without changing the language.

Let's start with an example to illustrate these issues. Taking a look at the game Secret of Mana which is the US release of the Japanese original Seiken Densetsu 2.

As we can see in the English entry for the game at Wikipedia, "Secret of Mana" is not the literal translation of "Seiken Densetsu 2". These are two different regional titles, which lead us to the following scheme for the game's regional naming:

Release Name (Region)English TranslationJapanese TranslationLatin TransliterationJapanese Transliteration
Secret of Mana (USA)Secret of Manaマナの秘密Secret of Manaシークリット・オブ・マーナー
聖剣伝説2 (Japan)Legend of the Sacred Sword 2聖剣伝説2Seiken Densetsu 2聖剣伝説2

As you can see here, we have two regional titles for the game, which can both be translated to every language imaginable, and both be transliterated to each of the eight scripts that are, in our humble opinion, important for a video game database:

  1. Latin
  2. Arabic
  3. Cyrillic
  4. Japanese
  5. Chinese
  6. Korean
  7. Greek
  8. Hebrew

The first thing to record is that every text - regardless of it being a person's name, game title, game description, screenshot caption, or whatever - is written in a certain script and a certain language. The separation between script and language is very important, so let's take another look at the two titles of Secret of Mana to make this distinction clear:

String ScriptLanguage
聖剣伝説2JapaneseJapanese
Legend of the Sacred Sword 2LatinEnglish
Seiken Densetsu 2LatinJapanese
Secret of ManaLatinEnglish
マナの秘密JapaneseJapanese
シークリット・オブ・マーナーJapaneseEnglish

For "normal" texts like a game description, a transliteration prolly won't be needed. But one could think that for personal or geographical names only the transliteration (and thus, the script attribute) is needed, but that's not true. The "translation" to another language is important here, too. Let's take a look at another two examples for this:

String ScriptLanguage
Михаил Сергеевич ГорбачёвCyrillicRussian
Mikhail Sergeyevich GorbachevLatinEnglish
Michail Sergeevič GorbačёvLatinRussian
Michail Sergejewitsch GorbatschowLatinGerman
東京JapaneseJapanese
TokyoLatinEnglish
TōkyōLatinJapanese
TokioLatinGerman

When we talk about translation here, we don't mean transferring the meaning of the name to the other language - Tokyo would be "Eastern Capital" in English, then - but using the official spelling of that other language.

So, technically, every text object of our database could be defined as a meta object consisting of n strings with the two attributes (script, language) assigned to it.

Next problem is that we will have to pick one (or more) of these strings for display every time our meta text object is used. But some strings are needed for our video game documentation, some are only informal, and some are necessary to make a game more available within the database. So which one to pick, and how? Let's revisit the Secret of Mana example with this in mind:

String ScriptLanguage
聖剣伝説2JapaneseJapanese
Secret of ManaLatinEnglish

These are official release titles of the game for a certain region, so they need to be assigned to all releases using them.

StringScriptLanguage
Legend of the Sacred Sword 2LatinEnglish
マナの秘密JapaneseJapanese

These are only informal translations, which could be shown to users when hovering over the other language's title.

StringScriptLanguage
Seiken Densetsu 2LatinJapanese
シークリット・オブ・マーナーJapaneseEnglish

These are transliterations of the official release titles, and therefore more important than the informal translations above. Exemplary, Seiken Densetsu 2 is needed for Latin users searching for Japanese games, or for game lists in Latin script.

So, having said that, should we label some strings in the text object as "leading" or "important" to begin with, so those will show up when no other context is specified? Or shouldn't we do this, leaving the meta text object unaware of its content's importance, thus having to provide context every time we use the text object?

Not really sure, but the gut feeling is that a labeling of one string as being "leading" is too unflexible to solve future problems. So, for now, let's assume the solution is to always use a text object in context, and thus manually pick the right string from its contents based on that context.

Exemplary, if we connect the above text object

聖剣伝説2 (Japanese, Japanese)
Legend of the Sacred Sword 2 (Latin, English)
Seiken Densetsu 2 (Latin, Japanese)

as Japanese release title to the respective game, we need to specify that "聖剣伝説2" is the actual string used for this release. On the other hand, if a Latin user requires a list of all SNES games released in Japan, we would need to pick "Seiken Densetsu 2" as the string to show for this list.

So much for some basics to this complex issue, but there's one important question about i18n we didn't touch, yet. How to handle the different language versions of Oregami? While it may be rather easy to translate the (static) UI and help to another language, there's also the textual content (descriptions, screenshot captions, etc.), i.e. the data. We will only start a new data language once this language's Oregami community has grown to a critical mass of native contributors / approvers. But which way to go after we started more languages besides English? There's two basic ways:

1) The Wikipedia way: every language grows alone, more or less based on common standards. The quality of the texts may differ severely from one language to another, nonetheless.

2) English is the central language, so every other language's text is translated from and to it, common standards apply strictly. The quality level is comparable in every language.

Personally, I would prefer the second way, but this way also seems more difficult to go down. Time will tell.