Text 36358, 146 rader
Skriven 2016-09-06 13:10:48 av Michiel van der Vlist (2:280/5555)
Ärende: A plea for UTF-8
========================
Hello All,
This is a rehatch of a Fidonews article I worte 5 years ago. On Request of Lee.
============= begin =====
A PLEA FOR UTF-8 IN FIDONET Part 1
By Michiel van der Vlist. 2:280/5555
First there was the spoken word. That was long time ago, nobody knows exactly
how long, but it must have been in the order some hundred thousand years ago.
Later, much, much later came the written word. In the order five thousand years
ago. To get a message from one place to another. A messenger needed to
physically transport an object with the text written on it from A to B.
Forget about the semaphore and let us jump straight to transporting messages
over electric wire. With that came the need for an encoding scheme. One of the
first encoding schemes was Morse Code. Named after its (co) inventor Samual
Morse. This was around 1840. Since this was invented in the western World,
mostly the USA, it is no surprise that Morse code only covers the digits 0-9, a
few special characters, such as the question mark and the period, plus 26
letters of the Latin alphabet. Nowadays Morse Code is used only by a small
group of radio amateurs but for over a century, it was a mainstream coding
method for telecommunication.
Next step was Baudot code. Used in the telex communication system. A five bit
code that covered the 26 letters of the Roman alphabet plus the digits 0-9 and
some punctuation and control signals. Like Morse code, no distinction between
upper and lower case.
In the fifties of the previous century, the first computers entered the scene.
At first these were bulky pieces of machinery filling an entire room. They were
programmed by entering the binary code directly into memory by so called sense
switches. This was cumbersome and error prone. Soon the need developed to have
a way to directly enter the mnemonics used to memorise the instructions into
the computer and let the commputer itself do the translation into binary form
instead of the operator manually entering the binary code.
with that came the need for a character encoding scheme for computers. Several
encoding schemes were used in the beginning, but in the end it converged into
an 8 bit code that seemed to fit computers like a glove. Or to be more precise
a seven bit code. Used on 8 bit transport media, but only the lower seven bits
were used for encoding text. The highest bit was used as an error detection
mechanism: the parity bit. This was ASCII, The American Standard Code for
Information Interchange. First introduced in 1960.
The "A" is "ASCII" stands for "American". So it is no surprise that as far as
the letters go, once again it only covers the 26 letters found in American
English. ASCII is much richer that all of its predecessors, it has many
punctuation and special characters, 32 - now mostly obsolete - control codes
and as a new feature, the distinction between upper and lower case.
That the character set is limited to what is found in American English, was no
great limitation in the beginning of the history of data processing. Computers
becasue of their bulk and cost were only to be found at government institutes,
large companies and universities. They were used by scientists and engineers.
Those could deal with ASCII only mnachines.
What nobody could foresee when ASCII was devised, happened some two decades
later. Computers became small enough and cheap enough to allow individuals to
have their own private computer ( a PC ) all for themselves in their own homes.
With affordable home computers, came affordable printers and that was the end
of the classic type writer. Computer use was no longer limited to research
workers who's employers could afford tons of research equipment, but by people
that could afford type writers. And then when those New type writers" spread
around the world came the need for more than just ASCII. While ASCII was enough
for US Americans using type writers, it was not enough for the rest of the
world. ASCII only became a stranglehold. Those new computer users wanted to
write in their own language. A language that used characters with accents,
umlauts, slashes and even characters not at all resembling the Roman alphabet.
Cyrillics, or even more complex Asian and Arabic languages.
Microsoft and IBM were quick to respond. They introduced the concept of code
pages. ASCII is seven bit, but computers store information in lumps of eight
bits called a byte. The most significant bit, originally meant as a parity bit,
but obsoleted by more robust error checking mechanisms, was free to define
another 128 characters. IBM choose to not only include language specific
characters in that set of 128, but to also include some 30+ so called "graphic
characters" for line drawing. That may have been a good idea at the time, but
in retrospect it may have been a waste of valuable coding space.
Anyway, at the end of the DOS era, there were dozens of code pages, covering
the needs for hundreds of languages. One could write in German, Swedish,
Russian and Greek without problems. Well, one could not write in Greek and
Russian in the same article because on e could not change code pages in mid
stream. But who wanted that?
And then came the InterNet. And with the Internet came the World Wide Web. In
the beginning the web just copied the solution to language issues from DOS.
code pages and more code pages. It did not take much more than a decade to
realise that the eight bit barrier was the second stranglehold. Not being able
to write Russian and Greek in one and the same article was NOT acceptable.
Eight bits for a character set was NOT good enough.
Fortunately the price of memory had also dropped spectacularly. Also the price
of transporting bits had dropped steadily. Memory had become so cheap that it
became affordable to store pictures in digital form. Pictures take orders of
magnitude more storing space than text. So increasing the required storing
space for text by a factor of two by going from a one byte character encoding
scheme to a multi byte encoding scheme, no longer met with economic
restrictions.
Enter Unicode.
Unicode introduces the concept of The Universal Character Set. It is not a
static entity, it is still growing. Presently there are over a million
characters defined. While in the code page concept, character set and character
encoding scheme are one and the same, in Unicode they are decoupled. There is
ONE charceter set: the Universal Character Set. There are several encoding
schemes that all have their merits.
First there is UTF-7. Designed for stone age transport layers that are 7 bits
only. Next there is UTF-8. This is an 8 byte multibyte encoding that takes one
to six bytes to encode a character. Next there is UTF-16. Not suitable for byte
onrientated transport media that use NULL as a special character, but is is
used internally by Windows from XP and up. And finally there is UTF-32.
The obvious choice for FidoNet is UTF-8. The transport layer of FidoNet is
fully 8 bit transparent, with the exception of the NULL byte that is used as a
termination character. Since UTF-8 is fully downward compatible with ASCII, the
first 127 characters in the Universal Character set are the same as the ASCII
set and they are encoded in exactly the same way. So the NULL in UTF-8 is the
same as the NULL in ASCII, so no problem. Also there will be no conflict with
those that have no need for anything other than good old 7 bit ASCII. They can
keep using the software that they have been using all the time and everyone
will see the same text on his/her screen.
Next week we will go into some details on how to get UTF-8 encoded FidoNet
message on your screen.
To be continued....
¸ Michiel van der Vlist, all rights reserved.
Permission to publish in the FIDONEWS file scho and the FIDONEWS
discussion echo as originating from 2:2/2
======= end ======
Cheers, Michiel
--- GoldED+/W32-MSVC 1.1.5-b20130111
* Origin: http://www.vlist.org (2:280/5555)
|