Tillbaka till svenska Fidonet
English   Information   Debug  
ENET.SOFT   0/11701
ENET.SYSOP   33904
ENET.TALKS   0/32
ENGLISH_TUTOR   0/2000
EVOLUTION   0/1335
FDECHO   0/217
FDN_ANNOUNCE   0/7068
FIDONEWS   24128
FIDONEWS_OLD1   0/49742
FIDONEWS_OLD2   0/35949
FIDONEWS_OLD3   0/30874
FIDONEWS_OLD4   0/37224
FIDO_SYSOP   12852
FIDO_UTIL   0/180
FILEFIND   0/209
FILEGATE   0/212
FILM   0/18
FNEWS_PUBLISH   4408
FN_SYSOP   41679
FN_SYSOP_OLD1   71952
FTP_FIDO   0/2
FTSC_PUBLIC   0/13599
FUNNY   0/4886
GENEALOGY.EUR   0/71
GET_INFO   105
GOLDED   0/408
HAM   0/16070
HOLYSMOKE   0/6791
HOT_SITES   0/1
HTMLEDIT   0/71
HUB203   466
HUB_100   264
HUB_400   39
HUMOR   0/29
IC   0/2851
INTERNET   0/424
INTERUSER   0/3
IP_CONNECT   719
JAMNNTPD   0/233
JAMTLAND   0/47
KATTY_KORNER   0/41
LAN   0/16
LINUX-USER   0/19
LINUXHELP   0/1155
LINUX   0/22093
LINUX_BBS   0/957
mail   18.68
mail_fore_ok   249
MENSA   0/341
MODERATOR   0/102
MONTE   0/992
MOSCOW_OKLAHOMA   0/1245
MUFFIN   0/783
MUSIC   0/321
N203_STAT   926
N203_SYSCHAT   313
NET203   321
NET204   69
NET_DEV   0/10
NORD.ADMIN   0/101
NORD.CHAT   0/2572
NORD.FIDONET   189
NORD.HARDWARE   0/28
NORD.KULTUR   0/114
NORD.PROG   0/32
NORD.SOFTWARE   0/88
NORD.TEKNIK   0/58
NORD   0/453
OCCULT_CHAT   0/93
OS2BBS   0/787
OS2DOSBBS   0/580
OS2HW   0/42
OS2INET   0/37
OS2LAN   0/134
OS2PROG   0/36
OS2REXX   0/113
OS2USER-L   207
OS2   0/4786
OSDEBATE   0/18996
PASCAL   0/490
PERL   0/457
PHP   0/45
POINTS   0/405
POLITICS   24680/29554
POL_INC   0/14731
PSION   103
R20_ADMIN   1121
R20_AMATORRADIO   0/2
R20_BEST_OF_FIDONET   13
R20_CHAT   0/893
R20_DEPP   0/3
R20_DEV   399
R20_ECHO2   1379
R20_ECHOPRES   0/35
R20_ESTAT   0/719
R20_FIDONETPROG...
...RAM.MYPOINT
  0/2
R20_FIDONETPROGRAM   0/22
R20_FIDONET   0/248
R20_FILEFIND   0/24
R20_FILEFOUND   0/22
R20_HIFI   0/3
R20_INFO2   3221
R20_INTERNET   0/12940
R20_INTRESSE   0/60
R20_INTR_KOM   0/99
R20_KANDIDAT.CHAT   42
R20_KANDIDAT   28
R20_KOM_DEV   112
R20_KONTROLL   0/13273
R20_KORSET   0/18
R20_LOKALTRAFIK   0/24
R20_MODERATOR   0/1852
R20_NC   76
R20_NET200   245
R20_NETWORK.OTH...
...ERNETS
  0/13
R20_OPERATIVSYS...
...TEM.LINUX
  0/44
R20_PROGRAMVAROR   0/1
R20_REC2NEC   534
R20_SFOSM   0/340
R20_SF   0/108
R20_SPRAK.ENGLISH   0/1
R20_SQUISH   107
R20_TEST   2
R20_WORST_OF_FIDONET   12
RAR   0/9
RA_MULTI   106
RA_UTIL   0/162
REGCON.EUR   0/2056
REGCON   0/13
SCIENCE   0/1206
SF   0/239
SHAREWARE_SUPPORT   0/5146
SHAREWRE   0/14
SIMPSONS   0/169
STATS_OLD1   0/2539.065
STATS_OLD2   0/2530
STATS_OLD3   0/2395.095
STATS_OLD4   0/1692.25
SURVIVOR   0/495
SYSOPS_CORNER   0/3
SYSOP   0/84
TAGLINES   0/112
TEAMOS2   0/4530
TECH   0/2617
TEST.444   0/105
TRAPDOOR   0/19
TREK   0/755
TUB   0/290
UFO   0/40
UNIX   0/1316
USA_EURLINK   0/102
USR_MODEMS   0/1
VATICAN   0/2740
VIETNAM_VETS   0/14
VIRUS   0/378
VIRUS_INFO   0/201
VISUAL_BASIC   0/473
WHITEHOUSE   0/5187
WIN2000   0/101
WIN32   0/30
WIN95   0/4288
WIN95_OLD1   0/70272
WINDOWS   0/1517
WWB_SYSOP   0/419
WWB_TECH   0/810
ZCC-PUBLIC   0/1
ZEC   4

 
4DOS   0/134
ABORTION   0/7
ALASKA_CHAT   0/506
ALLFIX_FILE   0/1313
ALLFIX_FILE_OLD1   0/7997
ALT_DOS   0/152
AMATEUR_RADIO   0/1039
AMIGASALE   0/14
AMIGA   0/331
AMIGA_INT   0/1
AMIGA_PROG   0/20
AMIGA_SYSOP   0/26
ANIME   0/15
ARGUS   0/924
ASCII_ART   0/340
ASIAN_LINK   0/651
ASTRONOMY   0/417
AUDIO   0/92
AUTOMOBILE_RACING   0/105
BABYLON5   0/17862
BAG   135
BATPOWER   0/361
BBBS.ENGLISH   0/382
BBSLAW   0/109
BBS_ADS   0/5290
BBS_INTERNET   0/507
BIBLE   0/3563
BINKD   0/1119
BINKLEY   0/215
BLUEWAVE   0/2173
CABLE_MODEMS   0/25
CBM   0/46
CDRECORD   0/66
CDROM   0/20
CLASSIC_COMPUTER   0/378
COMICS   0/15
CONSPRCY   0/899
COOKING   32956
COOKING_OLD1   21704/24719
COOKING_OLD2   0/40862
COOKING_OLD3   0/37489
COOKING_OLD4   0/35496
COOKING_OLD5   9370
C_ECHO   0/189
C_PLUSPLUS   0/31
DIRTY_DOZEN   0/201
DOORGAMES   0/2061
DOS_INTERNET   0/196
duplikat   6002
ECHOLIST   0/18295
EC_SUPPORT   0/318
ELECTRONICS   0/359
ELEKTRONIK.GER   1534
ENET.LINGUISTIC   0/13
ENET.POLITICS   0/4
Möte FIDONEWS_OLD3, 30874 texter
 lista första sista föregående nästa
Text 24213, 207 rader
Skriven 2012-01-02 01:47:53 av FidoNews Robot (2:2/2.0)
Ärende: FidoNews 29:01 [02/05]: General Articles
================================================
=================================================================
                        GENERAL ARTICLES
=================================================================

                        A PLEA FOR UTF-8 IN FIODONET   Part 2
                        By Michiel van der Vlist, 2:280/5555

Last week I discussed the various ways to have a computer deal with
more than just the ASCII character set. Now let us look at some of the
technical details regarding what the WWW has adopted as the
preferential encoding scheme and what I think FidoNet should evolve
into as well: UTF-8.

As mentioned last week, Unicode is based on The Universal Character
Set. Characters in the Universal Character Set are identified by their
Code Point. This is a simple scalar unsigned integer value. The usual
way of representing it is in the form U+wxyz, where wxyz is a four,
five or six digit hexadecimal value. So the code point for the Roman
letter 'A' is U+0041. The code point for the Cyrillic capital 'A' is
U+0410, and the code point for the Euro sign is U+20AC.

Table 1 tells us how to transform the value of the code point into a
variable length byte stream of one to four bytes. In theory the
transformation covers up to six bytes, encoding a 31 bit integer, but
for now it stops at four bytes, coding for a 21 bit integer.

     Table 1.  UTF-7 bit distibution

----------------------------------------------------------------------
Scalar value               | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte
----------------------------------------------------------------------
00000000 0xxxxxxx          | 0xxxxxxx |          |          |
----------------------------------------------------------------------
00000yyy yyxxxxxx          | 110yyyyy | 10xxxxxx |          |
----------------------------------------------------------------------
zzzzyyyy yyxxxxxx          | 1110zzzz | 10yyyyyy | 10xxxxxx |
----------------------------------------------------------------------
000uuuuu zzzzyyyy yyxxxxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx
----------------------------------------------------------------------

As you can see, the transformation algorithm is relatively
straightforward. Code points U+0000 through u+007f are represented by
a single byte. The first 128 characters of the Universal Character Set
coincide with the ASCII set and the coding is the same.

Code points U+0080 to U+07FF are coded into two bytes, Code points
U+0800 - U+FFFF into three bytes and above that into four bytes.

For the multibyte sequences the number of consecutive '1's of the most
significant bits in the first byte tells us the length of the
sequence. The first byte always has the two most significant bits up.
Subsequent bytes have the two most significant bits set to '10'. This
makes it relatively easy keep track of the number of characters that
are represented by a byte stream and also easy to resync when
synchronisation is lost due to temporary loss of bits.

For example when one needs to know the length of a UTF-8 string, not
in terms of the number of bytes, but in terms of the number of
characters that will appear on the screen, it suffices to just count
the bytes with the most significant bit set to 0 plus the bytes that
have the two most significant bits set to 1.

int Str8len(unsigned char *s)
{
        int i=0, j=0;
        unsigned char c;

        while (s[j] && (j<=MAXINT))
        {
               c=s[j++];
               if ((c & 0x80)==0) || (c & 0xC0)==0xC0)) i++;
        }
        return (i);
}

This may be useful for screen wrapping in a message editor.

So what else would we need to make FidoNet work with UTF-8? As far as
the transport layer is concerned nothing really. FidoNet is fully 8
bit transparent except for the NULL as the terminating character for
strings. There is no conflict as in UTF-8 the NULL character has the
exact same meaning as in ASCII. Oh wait, there is this tiny little
snag: the archaic soft return. In their infinite wisdom, the founding
fathers decided that the character 0x8D had special meaning; that of
soft return. Probably a remnant from the Wordstar days. In hindsight
totally superfluous and a conflict with many code page schemes that
treat it as a printable character. It also conflicts with UTF-8. 0x8D
is a valid byte in a well formed UTF-8 string. Fortunately most bronze
age software allows configuring 0x8D as a printable character instead
of soft return, so this should no longer be a problem. Be sure however
to configure your tosser to not strip soft returns.

So the transport layer - mailers and tossers - are not a problem. But
that is where it ends. AFAIK there is no dedicated FidoNet message
reader that can properly handle UTF-8.

There is a solution: JAMNNTP. A programme by Jon Billings that
converts a JAM message base to something that can be read by a
standard News Reader. Thunderbird, InterNet Explorer, Firefox, you
name it. All News readers, except some from the previous millenium
support UTF-8, so there you go...

Now about FidoNet readers. It is not the OS or the console that is the
problem. Not from Windows NT/XP on anyway. The Window XP console
supports unicode and UTF-8 encoding. It even has a pseudo code page
for it. Code page 65001.

Be sure however to set the font for the cmd window to "Lucida
Console". None of what comes next will work otherwise.

Consider this simple C programme:

=== Cut ======= Begin HWT8.C  ====


#include <stdio.h>

int main(void)
{
   printf("Hello World\n");
   printf("Hall%c%c Bj%c%crn\n",0xC3,0xA5,0xC3,0xB6);
   printf("%c%c%c%c%c%c%c%c%c%c%c%c %c%c%c%c%c%c%c%c%c%c%c%c\n",
       0xD0,0x9F,0xD1,0x80,0xD0,0xB8,0xD0,0xB2,0xD0,0xB5,0xD1,0x82,
       0xD1,0x82,0xD1,0x80,0xD0,0xBE,0xD0,0xB9,0xD0,0xBA,0xD0,0xB0);
   printf("Saluton %c%c%c\n",0xE2,0x82,0xAC);
   return(0);
}

=== Cut ===

Compile it and run it in a cmd window. Or if you do not have a
compiler, download http://www.vlist.eu/downloads/hwt8.bin and rename
it to hwt8.exe.

You will see the line "Hello world" followed by some lines of garbage.
It is a DOS programma, so this works in any DOS or Windows command
window.

But in XP or up we can do more. First do a:

chcp 65001

 ... and run it again. You may se some garbage on the screen the first
time, I haven't yet figured out why, but if you run it a second time,
you will see four lines of text, the first one being our famous "Hello
world". The next three lines will show something in Swedish, something
in Russian en something in Esperanto. And lo and behold: Cyrillic and
Swedish characters on the same screen. Plus a Euro sign as a bonus.

You can also divert the output of hwt8 to a text file with the
redirect command:

hwt8 >hwt8.utf

and then display it with the type command:

type hwt8.utf

So the limitation is not in the OS or the cmd console. They can handle
utf-8.

Unfortuntely the trick of changing the code page to 65001 does not
work with any of the FidoNet readers/editors that I have tried. The
InterMail editor and Msged just go in a flat spin when run under cp
65001. Golded runs without problems, but for some reason its output
does not go into utf-8 mode. It keeps displaying a message containing
utf-8 as if the local code page was set to 850. Even if one disables
all character translation. The odd thing however is that if one writes
the message containing garbage to a file with the Alt-W command, and
subsequently leaves golded and  displays the file with the type
command, then one gets the proper characters on the screen. So the
problem is in how Golded handles its output.

Well, enough for this week. Obviously some more reseacrh in the inner
workings of Golded and possibly other readers is required to get this
to work. This may be the subject of some future article. Or not. Time
will tell and no promises.


References.

Jukka Korpela, A tutorial on character code issues.
http://www.cs.tut.fi/~jkorpela/chars.html

Roman Czyborra, czyborra.com
http://czyborra.com/

Roman Czyborra, Der Globalzeichensatz Unicode im Betriebssystem Unix
http://www.unicodecharacter.com/

Tom Jennings, An annotated history of some character codes or ASCII:
American Standard Code for Information Infiltration
http://wps.com/projects/codes/index.html

The Unicode Consortium, Official unicode website.
http://www.unicode.org/



© Michiel van der Vlist, all rights reserved.
Permission to publish in the FIDONEWS file echo and the FIDONEWS
discussion echo as originating from 2:2/2

-----------------------------------------------------------------

--- Azure/NewsPrep 3.0
 * Origin: Home of the Fidonews (2:2/2.0)