UTF-8 support in v2019

Happy holidays!

with recent changes( https://github.com/fluffos/fluffos/pull/544 and https://github.com/fluffos/fluffos/pull/550) merged in, FluffOS is on track to have full UTF-8 support.

What does this mean:

  1. FluffOS depends on ICU library, which is the most widely used and robust framework to handle unicode data.
  2. LPC compiler will only accept source code in UTF-8 encoding. If it is not, you will see compiler errors about invalid UTF-8 string.
  3. LPC string is stored internally using UTF-8 encoding, unlike old programming languages like java, javascript, fluffos doesn’t use UTF-16LE, we are more like rust.
  4. FluffOS,(unlike ldmud) fully supports extended grapheme clusters, which means strlen() returns EGC counts in the string, and substrings operations like str[0…1], will correctly slice at EGC boundaries, not codepoint boundary . This means full support for multi-codepoint emojis!
  5. for maximum backward compatibility reason string index operation like str[0] still works for single codepoint EGC, both as rvalue and lvalue, that means you can still write str[0] = ‘a’ , and it will do the right thing TM.
  6. Also there are fixes to sprintf to consider character width . which means padding and justification works as expected, treating wide characters to be 2 column and not 1.

There are still some rough edges to be worked out. If you see a case not covered in https://github.com/fluffos/fluffos/blob/master/testsuite/single/tests/compiler/utf8.c feel free to chime in!

Cheers and happy holiday.

Also, There are full support of input/output transparent transcoding, the default input/output encoding is no translation at all, which means UTF-8

set_encoding

set input/output encoding for current user.

query_encoding

Use for query the current encoding of the user.

And now we have

Common UTF-8 questions

in v2019, the driver is fully utf-8 native, meaning that all “string” is valid UTF-8.

LPC Source code

if you see Invalid UTF-8 string .... in your LPC errors, you should save your LPC code in UTF-8 format without BOM (which is the default).

Please look at https://github.com/fluffos/gbk2utf8 for an quick conversation tool.

If your lib code is in other encoding like gb18030, you can use iconv to convert them:

iconv -f gb18030 -o utf8 test.c > test_utf8.c , unfortunately iconv doesn’t support in-place conversion, so you will need some sort of script to do this in batch.

Make sure to use gb18030 instead of gbk or gb2312 , otherwise some characters may not be converted.

String indexing

You can still use str[i] to get characters from string, you can even still use it to insert single-codepoint(1-4 bytes) characters, this is for keeping full compatibility with old lib. the type returned by str[i] is an 32bit int.

However, if your string has multi-codepoint characters (like most emoji character), str[i] can not give you the character, since it requires multiple codepoints, doing this will give you an runtime error, string index doesn't work with multi codepoint characters , instead , you should use str[i..i] instead, the returned value type is string, which will contain the full character.

strlen() and strwidth()

strlen() will give you how many characters in the string. this is regardless how many bytes they actually occupy. and it supports emojis!

strwidth() will give you how many column that the string on terminal will occupy, this by default also ignore ANSI color codes, it will correctly calculate width of visible characters.

printf/sprintf with ANSI codes

printf and sprintf by default ignore ANSI codes in the string when calculating layout, so it will work as you would expected.

printf/sprintf “%c”

You can still use printf("%c", X) to print out character, however, it only supports valid utf-8 single codepoint character.