Support OpenKore:
Learn about
the Fund Pool

OpenKore internationalization support

English | 正體中文


This document clarifies some things about how to properly handle internationalization (different character sets/text encoding) in OpenKore. See also Perl type definitions and Unicode processing issues in Perl and how to cope with it.

If you're not familiar with Unicode/UTF-8, read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)


Contents

The String type

In OpenKore, all variables that contain human-readable text (in other words, all character strings) must be of the String type. These strings are internally encoded as UTF-8.

The most important distinction is that strings are a sequence of characters, not a sequence of bytes.


Converting Bytes to String

Because all human-readable text variables in OpenKore must be of the String type, you must convert data sent by external sources (such as messages to the RO server) to Strings.


Converting String to Bytes

You also need to convert Strings to UTF-8 or Unicode or whatever before you can pass the string's data to an external entity.


Data files

Data files, such as config.txt and items.txt, must be encoded in UTF-8. So we require the user to write everything in UTF-8. (That is possible: Notepad supports UTF-8, and virtually all desktop text editors on Linux support UTF-8. And Linux has the iconv utility, allowing you to easily convert anything to UTF-8.)

To read from data files, you must open the file in UTF-8 mode, by passing ':utf8' to open(). This is demonstrated by the following example:

use encoding 'utf8';

open(F, "<:utf8", "data.txt");
... read data file ...

Writing strings to data files is similar. Pass ':utf8' to open(). For example:

use encoding 'utf8';

my $string = getStringToWrite(); # Type: String
open(F, ">:utf8", "data.txt");
print F $string;
close(F);


Setting the utf8 encoding pragma

In every Perl module that handles strings, you should set the utf8 encoding pragma:

use encoding 'utf8';

Otherwise, you may get annoying warnings like 'Wide character in print at xxxxx line yy.', and some strings may become garbled.


Predefined encoding name aliases

The I18N module defines a few encoding name aliases which are easier to remember.

Alias name Maps to
Western ISO-8859-1
Tagalog ISO-8859-1
Simplified Chinese GBK
Traditional Chinese Big5
Korean CP949
Russian CP1251
Cyrillic CP1251
Japanese Shift_JIS
Thai ISO-8859-11

Chinese support note: Traditional Chinese is usually encoded as Big5, while simplified Chinese is usually encoded in CP936 (GBK). Since 2000, all Chinese products in the mainland must support the the GB18030 standard. GB18030 is fully compatible with GBK, but requires a large third party Perl module, so use GBK for the alias instead.


Handling binary data

Sometimes you don't want to treat a variable as a string, but as binary data (for example, when you're dealing with raw socket data). Use the following pragmas to force Perl to do that:

use bytes;
no encoding 'utf8';

The first pragma forces Perl to treat all strings as byte strings instead of character strings. The second pragma ensures that any strings you create within the current lexical context will not be marked as UTF-8 strings.