UTF-8 support in v2019

thefallentree · December 23, 2019, 11:20pm

Happy holidays!

with recent changes( https://github.com/fluffos/fluffos/pull/544 and https://github.com/fluffos/fluffos/pull/550) merged in, FluffOS is on track to have full UTF-8 support.

What does this mean:

FluffOS depends on ICU library, which is the most widely used and robust framework to handle unicode data.
LPC compiler will only accept source code in UTF-8 encoding. If it is not, you will see compiler errors about invalid UTF-8 string.
LPC string is stored internally using UTF-8 encoding, unlike old programming languages like java, javascript, fluffos doesn’t use UTF-16LE, we are more like rust.
FluffOS,(unlike ldmud) fully supports extended grapheme clusters, which means strlen() returns EGC counts in the string, and substrings operations like str[0…1], will correctly slice at EGC boundaries, not codepoint boundary . This means full support for multi-codepoint emojis!
for maximum backward compatibility reason string index operation like str[0] still works for single codepoint EGC, both as rvalue and lvalue, that means you can still write str[0] = ‘a’ , and it will do the right thing TM.
Also there are fixes to sprintf to consider character width . which means padding and justification works as expected, treating wide characters to be 2 column and not 1.

There are still some rough edges to be worked out. If you see a case not covered in https://github.com/fluffos/fluffos/blob/master/testsuite/single/tests/compiler/utf8.c feel free to chime in!

Cheers and happy holiday.

thefallentree · December 30, 2019, 4:41pm

Also, There are full support of input/output transparent transcoding, the default input/output encoding is no translation at all, which means UTF-8

set_encoding

set input/output encoding for current user.

query_encoding

Use for query the current encoding of the user.

thefallentree · January 14, 2020, 6:12pm

And now we have

thefallentree · April 19, 2020, 9:18pm

Common UTF-8 questions

in v2019, the driver is fully utf-8 native, meaning that all “string” is valid UTF-8.

LPC Source code

if you see Invalid UTF-8 string .... in your LPC errors, you should save your LPC code in UTF-8 format without BOM (which is the default).

Please look at https://github.com/fluffos/gbk2utf8 for an quick conversation tool.

If your lib code is in other encoding like gb18030, you can use iconv to convert them:

iconv -f gb18030 -o utf8 test.c > test_utf8.c , unfortunately iconv doesn’t support in-place conversion, so you will need some sort of script to do this in batch.

Make sure to use gb18030 instead of gbk or gb2312 , otherwise some characters may not be converted.

String indexing

You can still use str[i] to get characters from string, you can even still use it to insert single-codepoint(1-4 bytes) characters, this is for keeping full compatibility with old lib. the type returned by str[i] is an 32bit int.

However, if your string has multi-codepoint characters (like most emoji character), str[i] can not give you the character, since it requires multiple codepoints, doing this will give you an runtime error, string index doesn't work with multi codepoint characters , instead , you should use str[i..i] instead, the returned value type is string, which will contain the full character.

strlen() and strwidth()

strlen() will give you how many characters in the string. this is regardless how many bytes they actually occupy. and it supports emojis!

strwidth() will give you how many column that the string on terminal will occupy, this by default also ignore ANSI color codes, it will correctly calculate width of visible characters.

printf/sprintf with ANSI codes

printf and sprintf by default ignore ANSI codes in the string when calculating layout, so it will work as you would expected.

printf/sprintf “%c”

You can still use printf("%c", X) to print out character, however, it only supports valid utf-8 single codepoint character.

hhsiao · July 27, 2021, 4:47pm

Hi @thefallentree,

I am trying to use v2019 with NT7 to output to GBK/BIG5/UTF8 simultaneously. However, I can’t find the location to add the BIG5 conversion. Can you give me some pointers?

I tried to use read_buffer(string_encode(utf_string, "big5")) but that clearly doesn’t do what I think it does.

thefallentree · July 27, 2021, 5:00pm

If you want to change the output/input encoding for a given user(network connection), just call "set_encoding(“big5”) " on your login object somewhere in your login logic. Note: this method affect the current network connection so it has to be called by the object itself, not some other object.

sfly · November 16, 2021, 8:51am

大佬,有没有可能在代码里支持中文,比如
或者修改源码里的那一部分可以使其支持中文,希望大佬给指条路

sdong · May 24, 2023, 6:07pm

Is there any way to make it backward compatible with old LPC such as GBK?