Text 36359, 192 rader
Skriven 2016-09-06 13:13:05 av Michiel van der Vlist (2:280/5555)
Ärende: UTF-8 part 2
====================
A PLEA FOR UTF-8 IN FIODONET Part 2
By Michiel van der Vlist, 2:280/5555
Last week I discussed the various ways to have a computer deal with more than
just the ASCII character set. Now let us look at some of the technical details
regarding what the WWW has adopted as the preferential encoding scheme and what
I think FidoNet should evolve into as well: UTF-8.
As mentioned last week, Unicode is based on The Universal Character Set.
Characters in the Universal Character Set are identified by their Code Point.
This is a simple scalar unsigned integer value. The usual way of representing
it is in the form U+wxyz, where wxyz is a four, five or six digit hexadecimal
value. So the code point for the Roman letter 'A' is U+0041. The code point for
the Cyrillic capital 'A' is U+0410, and the code point for the Euro sign is
U+20AC.
Table 1 tells us how to transform the value of the code point into a variable
length byte stream of one to four bytes. In theory the transformation covers up
to six bytes, encoding a 31 bit integer, but for now it stops at four bytes,
coding for a 21 bit integer.
Table 1. UTF-7 bit distibution
----------------------------------------------------------------------
Scalar value | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte
----------------------------------------------------------------------
00000000 0xxxxxxx | 0xxxxxxx | | |
----------------------------------------------------------------------
00000yyy yyxxxxxx | 110yyyyy | 10xxxxxx | |
----------------------------------------------------------------------
zzzzyyyy yyxxxxxx | 1110zzzz | 10yyyyyy | 10xxxxxx |
----------------------------------------------------------------------
000uuuuu zzzzyyyy yyxxxxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx
----------------------------------------------------------------------
As you can see, the transformation algorithm is relatively straightforward.
Code points U+0000 through u+007f are represented by a single byte. The first
128 characters of the Universal Character Set coincide with the ASCII set and
the coding is the same.
Code points U+0080 to U+07FF are coded into two bytes, Code points U+0800 -
U+FFFF into three bytes and above that into four bytes.
For the multibyte sequences the number of consecutive '1's of the most
significant bits in the first byte tells us the length of the sequence. The
first byte always has the two most significant bits up. Subsequent bytes have
the two most significant bits set to '10'. This makes it relatively easy keep
track of the number of characters that are represented by a byte stream and
also easy to resync when synchronisation is lost due to temporary loss of bits.
For example when one needs to know the length of a UTF-8 string, not in terms
of the number of bytes, but in terms of the number of characters that will
appear on the screen, it suffices to just count the bytes with the most
significant bit set to 0 plus the bytes that have the two most significant bits
set to 1.
int Str8len(unsigned char *s)
{
int i=0, j=0;
unsigned char c;
while (s[j] && (j<=MAXINT))
{
c=s[j++];
if ((c & 0x80)==0) || (c & 0xC0)==0xC0)) i++;
}
return (i);
}
This may be useful for screen wrapping in a message editor.
So what else would we need to make FidoNet work with UTF-8? As far as the
transport layer is concerned nothing really. FidoNet is fully 8 bit transparent
except for the NULL as the terminating character for strings. There is no
conflict as in UTF-8 the NULL character has the exact same meaning as in ASCII.
Oh wait, there is this tiny little snag: the archaic soft return. In their
infinite wisdom, the founding fathers decided that the character 0x8D had
special meaning; that of soft return. Probably a remnant from the Wordstar
days. In hindsight totally superfluous and a conflict with many code page
schemes that treat it as a printable character. It also conflicts with UTF-8.
0x8D is a valid byte in a well formed UTF-8 string. Fortunately most bronze age
software allows configuring 0x8D as a printable character instead of soft
return, so this should no longer be a problem. Be sure however to configure
your tosser to not strip soft returns.
So the transport layer - mailers and tossers - are not a problem. But that is
where it ends. AFAIK there is no dedicated FidoNet message reader that can
properly handle UTF-8.
There is a solution: JAMNNTP. A programme by Jon Billings that converts a JAM
message base to something that can be read by a standard News Reader.
Thunderbird, InterNet Explorer, Firefox, you name it. All News readers, except
some from the previous millenium support UTF-8, so there you go...
Now about FidoNet readers. It is not the OS or the console that is the problem.
Not from Windows NT/XP on anyway. The Window XP console supports unicode and
UTF-8 encoding. It even has a pseudo code page for it. Code page 65001.
Be sure however to set the font for the cmd window to "Lucida Console". None of
what comes next will work otherwise.
Consider this simple C programme:
=== Cut ======= Begin HWT8.C ====
#include <stdio.h>
int main(void)
{
printf("Hello World\n");
printf("Hall%c%c Bj%c%crn\n",0xC3,0xA5,0xC3,0xB6);
printf("%c%c%c%c%c%c%c%c%c%c%c%c %c%c%c%c%c%c%c%c%c%c%c%c\n",
0xD0,0x9F,0xD1,0x80,0xD0,0xB8,0xD0,0xB2,0xD0,0xB5,0xD1,0x82,
0xD1,0x82,0xD1,0x80,0xD0,0xBE,0xD0,0xB9,0xD0,0xBA,0xD0,0xB0);
printf("Saluton %c%c%c\n",0xE2,0x82,0xAC);
return(0);
}
=== Cut ===
Compile it and run it in a cmd window. Or if you do not have a compiler,
download http://www.vlist.eu/downloads/hwt8.bin and rename it to hwt8.exe.
You will see the line "Hello world" followed by some lines of garbage. It is a
DOS programma, so this works in any DOS or Windows command window.
But in XP or up we can do more. First do a:
chcp 65001
... and run it again. You may se some garbage on the screen the first time, I
haven't yet figured out why, but if you run it a second time, you will see four
lines of text, the first one being our famous "Hello world". The next three
lines will show something in Swedish, something in Russian en something in
Esperanto. And lo and behold: Cyrillic and Swedish characters on the same
screen. Plus a Euro sign as a bonus.
You can also divert the output of hwt8 to a text file with the redirect
command:
hwt8 >hwt8.utf
and then display it with the type command:
type hwt8.utf
So the limitation is not in the OS or the cmd console. They can handle utf-8.
Unfortuntely the trick of changing the code page to 65001 does not work with
any of the FidoNet readers/editors that I have tried. The InterMail editor and
Msged just go in a flat spin when run under cp 65001. Golded runs without
problems, but for some reason its output does not go into utf-8 mode. It keeps
displaying a message containing utf-8 as if the local code page was set to 850.
Even if one disables all character translation. The odd thing however is that
if one writes the message containing garbage to a file with the Alt-W command,
and subsequently leaves golded and displays the file with the type command,
then one gets the proper characters on the screen. So the problem is in how
Golded handles its output.
Well, enough for this week. Obviously some more reseacrh in the inner workings
of Golded and possibly other readers is required to get this to work. This may
be the subject of some future article. Or not. Time will tell and no promises.
References.
Jukka Korpela, A tutorial on character code issues.
http://www.cs.tut.fi/~jkorpela/chars.html
Roman Czyborra, czyborra.com
http://czyborra.com/
Roman Czyborra, Der Globalzeichensatz Unicode im Betriebssystem Unix
http://www.unicodecharacter.com/
Tom Jennings, An annotated history of some character codes or ASCII: American
Standard Code for Information Infiltration
http://wps.com/projects/codes/index.html
The Unicode Consortium, Official unicode website.
http://www.unicode.org/
¸ Michiel van der Vlist, all rights reserved.
Permission to publish in the FIDONEWS file echo and the FIDONEWS
discussion echo as originating from 2:2/2
--- GoldED+/W32-MSVC 1.1.5-b20130111
* Origin: http://www.vlist.org (2:280/5555)
|