Text 23986, 152 rader
Skriven 2011-12-26 03:34:25 av FidoNews Robot (2:2/2.0)
Ärende: FidoNews 28:52 [02/05]: General Articles
================================================
=================================================================
GENERAL ARTICLES
=================================================================
A PLEA FOR UTF-8 IN FIDONET Part 1
By Michiel van der Vlist. 2:280/5555
First there was the spoken word. That was long time ago, nobody knows
exactly how long, but it must have been in the order some hundred
thousand years ago. Later, much, much later came the written word. In
the order five thousand years ago. To get a message from one place to
another. A messenger needed to physically transport an object with the
text written on it from A to B.
Forget about the semaphore and let us jump straight to transporting
messages over electric wire. With that came the need for an encoding
scheme. One of the first encoding schemes was Morse Code. Named after
its (co) inventor Samual Morse. This was around 1840. Since this was
invented in the western World, mostly the USA, it is no surprise that
Morse code only covers the digits 0-9, a few special characters, such
as the question mark and the period, plus 26 letters of the Latin
alphabet. Nowadays Morse Code is used only by a small group of radio
amateurs but for over a century, it was a mainstream coding method for
telecommunication.
Next step was Baudot code. Used in the telex communication system. A
five bit code that covered the 26 letters of the Roman alphabet plus
the digits 0-9 and some punctuation and control signals. Like Morse
code, no distinction between upper and lower case.
In the fifties of the previous century, the first computers entered
the scene. At first these were bulky pieces of machinery filling an
entire room. They were programmed by entering the binary code directly
into memory by so called sense switches. This was cumbersome and error
prone. Soon the need developed to have a way to directly enter the
mnemonics used to memorise the instructions into the computer and let
the commputer itself do the translation into binary form instead of
the operator manually entering the binary code.
With that came the need for a character encoding scheme for computers.
Several encoding schemes were used in the beginning, but in the end it
converged into an 8 bit code that seemed to fit computers like a
glove. Or to be more precise a seven bit code. Used on 8 bit transport
media, but only the lower seven bits were used for encoding text. The
highest bit was used as an error detection mechanism: the parity bit.
This was ASCII, The American Standard Code for Information
Interchange. First introduced in 1960.
The "A" is "ASCII" stands for "American". So it is no surprise that as
far as the letters go, once again it only covers the 26 letters found
in American English. ASCII is much richer that all of its
predecessors, it has many punctuation and special characters, 32 - now
mostly obsolete - control codes and as a new feature, the distinction
between upper and lower case.
That the character set is limited to what is found in American
English, was no great limitation in the beginning of the history of
data processing. Computers becasue of their bulk and cost were only to
be found at government institutes, large companies and universities.
They were used by scientists and engineers. Those could deal with
ASCII only mnachines.
What nobody could foresee when ASCII was devised, happened some two
decades later. Computers became small enough and cheap enough to allow
individuals to have their own private computer ( a PC ) all for
themselves in their own homes. With affordable home computers, came
affordable printers and that was the end of the classic type writer.
Computer use was no longer limited to research workers who's employers
could afford tons of research equipment, but by people that could
afford type writers. And then when those New type writers" spread
around the world came the need for more than just ASCII. While ASCII
was enough for US Americans using type writers, it was not enough for
the rest of the world. ASCII only became a stranglehold. Those new
computer users wanted to write in their own language. A language that
used characters with accents, umlauts, slashes and even characters not
at all resembling the Roman alphabet. Cyrillics, or even more complex
Asian and Arabic languages.
Microsoft and IBM were quick to respond. They introduced the concept
of code pages. ASCII is seven bit, but computers store information in
lumps of eight bits called a byte. The most significant bit,
originally meant as a parity bit, but obsoleted by more robust error
checking mechanisms, was free to define another 128 characters. IBM
choose to not only include language specific characters in that set of
128, but to also include some 30+ so called "graphic characters" for
line drawing. That may have been a good idea at the time, but in
retrospect it may have been a waste of valuable coding space.
Anyway, at the end of the DOS era, there were dozens of code pages,
covering the needs for hundreds of languages. One could write in
German, Swedish, Russian and Greek without problems. Well, one could
not write in Greek and Russian in the same article because on e could
not change code pages in mid stream. But who wanted that?
And then came the InterNet. And with the Internet came the World Wide
Web. In the beginning the web just copied the solution to language
issues from DOS. code pages and more code pages. It did not take much
more than a decade to realise that the eight bit barrier was the
second stranglehold. Not being able to write Russian and Greek in one
and the same article was NOT acceptable. Eight bits for a character
set was NOT good enough.
Fortunately the price of memory had also dropped spectacularly. Also
the price of transporting bits had dropped steadily. Memory had become
so cheap that it became affordable to store pictures in digital form.
Pictures take orders of magnitude more storing space than text. So
increasing the required storing space for text by a factor of two by
going from a one byte character encoding scheme to a multi byte
encoding scheme, no longer met with economic restrictions.
Enter Unicode.
Unicode introduces the concept of The Universal Character Set. It is
not a static entity, it is still growing. Presently there are over a
million characters defined. While in the code page concept, character
set and character encoding scheme are one and the same, in Unicode
they are decoupled. There is ONE charceter set: the Universal
Character Set. There are several encoding schemes that all have their
merits.
First there is UTF-7. Designed for stone age transport layers that are
7 bits only. Next there is UTF-8. This is an 8 byte multibyte encoding
that takes one to six bytes to encode a character. Next there is
UTF-16. Not suitable for byte onrientated transport media that use
NULL as a special character, but is is used internally by Windows from
XP and up. And finally there is UTF-32.
The obvious choice for FidoNet is UTF-8. The transport layer of
FidoNet is fully 8 bit transparent, with the exception of the NULL
byte that is used as a termination character. Since UTF-8 is fully
downward compatible with ASCII, the first 127 characters in the
Universal Character set are the same as the ASCII set and they are
encoded in exactly the same way. So the NULL in UTF-8 is the same as
the NULL in ASCII, so no problem. Also there will be no conflict with
those that have no need for anything other than good old 7 bit ASCII.
They can keep using the software that they have been using all the
time and everyone will see the same text on his/her screen.
Next week we will go into some details on how to get UTF-8 encoded
FidoNet message on your screen.
To be continued....
© Michiel van der Vlist, all rights reserved.
Permission to publish in the FIDONEWS file scho and the FIDONEWS
discussion echo as originating from 2:2/2
-----------------------------------------------------------------
--- Azure/NewsPrep 3.0
* Origin: Home of the Fidonews (2:2/2.0)
|