Text 2641, 590 rader
Skriven 2005-02-20 13:10:40 av Rich (1:379/45)
Kommentar till text 2639 av Ellen K. (1:379/45)
Ärende: Re: ESB / XML / Unicode vs 8-bit characters ?
=====================================================
From: "Rich" <@>
This is a multi-part message in MIME format.
------=_NextPart_000_085D_01C5174D.937C1370
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
The Spanish accented characters are not part of ASCII. They are part =
of Windows calls ANSI of which ASCII is the subset (0x00 to 0x7F). Any =
character in the 0x80 to 0xFF range is not compatible between ANSI and = UTF-8.
Rich
"Ellen K." <72322.1016@compuserve.com> wrote in message =
news:7ouh119ivmuk26icg3mqqqk2ss1lfm5c10@4ax.com...
Should not have any non-ASCII characters, as previously noted all the
special Spanish characters are available in the ASCII character set.
And since the company is built on our understanding of the Hispanic
market, I don't see any use of, say, pictograph-based languages in the
foreseeable future. If 10 years down the road something like that
happens, well, by then we will no longer need compatibility with the
current legacy system because it will long since have been replaced.
On Sun, 20 Feb 2005 12:52:25 -0800, "Rich" <@> wrote in message
<4218f849$1@w3.nls.net>:
> From what you describe below, if the values you emit to XML have =
non-ASCII characters I would expect you to have a problem.
>
>Rich
>
> "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:eanh11h4vv6b9v21fiaounii3f5dunjl3g@4ax.com...
> On Sat, 19 Feb 2005 23:32:37 -0800, "Rich" <@> wrote in message
> <42183ccd@w3.nls.net>:
>
> > The UTF in UTF-8/16/32 stands for Unicode Transformation =
Format. You can find these defined in section 2.5 of =
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf.
>
> THANK YOU SO MUCH!!! :)
> >
> > It's not clear to me how you are creating the XML from the =
templates. If ANSI data is emitted into an XML document declared as = UTF-8
then you would have problems only for non-ASCII characters. UTF-8 = and
Windows-1252 are identical for 0x00 to 0x7F which is ASCII in both.
>
> I don't have a copy of a template here at home, but I have them =
create
> it by string concatenation because that seems to be the only way to =
be
> able to have CDATA attributes, which I have to have because in the
> legacy data numeric-appearing identifiers are actually 10-character
> strings with leading spaces, and if these are not specified as =
CDATA
> the spaces go lost even with "xml:space=3D"preserve"" included in =
the
> header. Here is a code snippet from one of my apps that creates an =
XML
> document which is passed as a parameter to a SQL Server stored
> procedure:
>
> > strXM =3D "<?xml version =3D" & Chr(34) & "1.0" & Chr(34) & =
" encoding=3D" & Chr(34) & "UTF-8" & Chr(34) & "?>" & vbCrLf _
> > & "<ROOT xml:space=3D" & Chr(34) & "preserve" & Chr(34) & =
">" & vbCrLf
> >
> > Do While Not .EOF
> > strXM =3D strXM & "<M><A>" & !Ofc & "</A><B><![CDATA[" & =
!Contract & "]]></B><C>" & !TCode & "</C><D>" & !Date & "</D>" _
> > & "<E><![CDATA[" & !TransNo & "]]></E></M>" & =
vbCrLf
> > .MoveNext
> > Loop
> >
> > strXM =3D strXM & "</ROOT>"
>
> (The vbCrLf's are there so if there is a problem the document can =
be
> printed to a text file and be easier for humans to read -- SQL =
Server
> ignores them. The single-character aliases for entity and =
attribute
> names are for performance -- for most of the stuff we use these for =
it
> doesn't really matter because we are only sending a few rows, but =
the
> first time I did it it was for something that was sending about =
5000
> rows and there it made a huge difference, so I stuck with it. We
> comment both the front-end code and the stored procedure with the
> mappings of these aliases.)
>
> > I do not know how SQL Server maps from char to nchar, =
specifically what conversion is performed. Also, in some (maybe all =
released) versions of SQL Server nchar and nvarchar are encoded in = UCS-2.
UCS-2 is a 16-bit encoding like UTF-16. It dates back to when = Unicode was
defined as having 2**16 characters instead of the 2**20+ = that it has now.
You can not express characters >=3D U+10000 in UCS-2 = not that you care about
these.
>
> Thankfully, no. :)
> >
> > I don't know if whether those systems you describe being =
written in java make a difference. They can do what they want. The = native
java string is Unicode though I don't remember if it is UCS-2 or = UTF-16. My
guess is that it was once the former and is now the latter. = One of the
documents on this on sun's site suggests that java used UCS-2 = until the
recently released 1.5 which is the first to use UTF-16.
>
> The Java native string being unicode is exactly what made me start
> worrying -- when I was learning Java a couple of years ago (because =
I
> wanted to port an app to it so as to be able to run it right on the =
Unix
> box where the Oracle database was) I was horrified the first time I
> tried reading back what I had written to a text file when I saw =
spaces
> between all the characters.
> >
> >Rich
> >
> > "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com...
> > UTF-8 is unicode?!? Sheesh, all this time I thought it meant =
8-bit.
> > In fact I could swear I read that somewhere.
> >
> > My question was coming from the database perspective, where I =
always use
> > char and varchar, as opposed to nchar and nvarchar. I give the
> > front-end guys little templates for creating the XML documents =
for all
> > my SQL Server stored procedures that take XML input, and I =
always
> > specify UTF-8 in the header... and my char and varchar columns =
always
> > end up normal, so since you're now telling me UTF-8 is really =
unicode, I
> > guess that would answer my question for XML data I would be =
getting from
> > the apps...? Or would the answer be different if the incoming =
XML is
> > some other encoding?
> >
> > To simulate getting nvarchar data from somewhere, I just tried =
creating
> > two dummy tables, one with an nvarchar column and the other with =
a
> > varchar column, typed stuff into the nvarchar one, then inserted =
to the
> > varchar one select from the nvarchar one and it looks normal. =20
> >
> > If all this means I was worrying about nothing, excellent! =
OTOH, is
> > there something I should be worrying about that I didn't ask?
> >
> > The only pieces whose names I know so far are Sonic and =
SalesForce, both
> > of which are written in Java, if that makes any difference. I =
know
> > there is at least one other external piece but I think that is =
the next
> > phase.
> >
> > On Sat, 19 Feb 2005 21:37:15 -0800, "Rich" <@> wrote in message
> > <421821c1$1@w3.nls.net>:
> >
> > > You need to be more specific than "8-bit characters". There =
are many 8-bit character encodings. If you are using Windows to = generate
your data you most likely are using Windows-1252 which is the = default 8-bit
character set for U.S. English in Windows. Windows = supports many 8-bit
encodings so you could be using something else too.
> > >
> > > Unicode is a character set not an encoding. There are =
multiple encodings the main ones being UTF-8, UTF-16, and UTF-32. You = can
use any of these for XML as well as non-Unicode encodings. For =
interoperability you should use Unicode preferably UTF-8.
> > >
> > > What comes out when the XML is parsed depends on the XML =
parser. XML is logically expressed in Unicode. The Windows XML parsers =
provide a Unicode interface. Other parsers could do differently.
> > >
> > >Rich
> > >
> > >
> > > "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:4o2g11pu048kafbdilg46u77vs5ls0be55@4ax.com...
> > > Our new enterprise system is going to be built around an =
Enterprise
> > > Service Bus. I don't have the full specs yet but as I =
understand it the
> > > main apps (starting with SalesForce) are going to be out on =
the internet
> > > and the Sonic ESB will be the messaging piece. There will =
be an
> > > Operational Data Store in house that will get updated every =
night on a
> > > batch basis from the main apps. =20
> > >
> > > My data warehouse will continue to be the data warehouse and =
will remain
> > > in house. The dimensions will stay the same but I might have =
to create
> > > separate measures for the data from the new apps and then =
create views
> > > to keep everything transparent to the users. =20
> > >
> > > I'm thinking if we're going to have an ODS in house already, =
I may as
> > > well do the ETL from there. But I'm worrying that the new =
data will
> > > probably be unicode (because Java defaults to that and =
SalesForce is
> > > written in Java). Right now I am storing everything (except =
our blobs
> > > of course) in 8-bit characters. =20
> > >
> > > Anyone here who's up on this stuff, can the XML that goes =
back and forth
> > > convert between unicode and 8-bit characters, or am I gonna =
have to
> > > redefine all my data? For example, if unicode data is put =
into an XML
> > > document that specifies UTF-8, what comes out when the =
document is
> > > parsed? How about vice versa? If this is too simplistic to =
work, what
> > > is needed?
> > >
> > > (We actually have no substantive need for unicode -- we are =
bilingual
> > > Spanish but all the special Spanish characters exist in the =
ascii
> > > character set.)
------=_NextPart_000_085D_01C5174D.937C1370
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.3790.1289" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2> The Spanish accented =
characters are=20
not part of ASCII. They are part of Windows calls ANSI of which = ASCII
is=20
the subset (0x00 to 0x7F). Any character in the 0x80 to 0xFF range = is
not=20
compatible between ANSI and UTF-8.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>Rich</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<BLOCKQUOTE=20
style=3D"PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; =
BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<DIV>"Ellen K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:7ouh119ivmuk26icg3mqqqk2ss1lfm5c10@4ax.com">news:7ouh119ivmu=
k26icg3mqqqk2ss1lfm5c10@4ax.com</A>...</DIV>Should=20
not have any non-ASCII characters, as previously noted all =
the<BR>special=20
Spanish characters are available in the ASCII character set.<BR>And =
since the=20
company is built on our understanding of the Hispanic<BR>market, I =
don't see=20
any use of, say, pictograph-based languages in the<BR>foreseeable=20
future. If 10 years down the road something like =
that<BR>happens,=20
well, by then we will no longer need compatibility with the<BR>current =
legacy=20
system because it will long since have been replaced.<BR><BR>On Sun, =
20 Feb=20
2005 12:52:25 -0800, "Rich" <@> wrote in message<BR><<A=20
=
href=3D"mailto:4218f849$1@w3.nls.net">4218f849$1@w3.nls.net</A>>:<BR><=
BR>> =20
From what you describe below, if the values you emit to XML have =
non-ASCII=20
characters I would expect you to have a=20
problem.<BR>><BR>>Rich<BR>><BR>> "Ellen K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:eanh11h4vv6b9v21fiaounii3f5dunjl3g@4ax.com">news:eanh11h4vv6=
b9v21fiaounii3f5dunjl3g@4ax.com</A>...<BR>> =20
On Sat, 19 Feb 2005 23:32:37 -0800, "Rich" <@> wrote in=20
message<BR>> <<A=20
=
href=3D"mailto:42183ccd@w3.nls.net">42183ccd@w3.nls.net</A>>:<BR>><=
BR>> =20
> The UTF in UTF-8/16/32 stands for Unicode =
Transformation=20
Format. You can find these defined in section 2.5 of <A=20
=
href=3D"http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf">http://www=
.unicode.org/versions/Unicode4.0.0/ch02.pdf</A>.<BR>><BR>> =20
THANK YOU SO MUCH!!! :)<BR>> =
><BR>> =20
> It's not clear to me how you are creating the XML =
from the=20
templates. If ANSI data is emitted into an XML document declared =
as=20
UTF-8 then you would have problems only for non-ASCII =
characters. UTF-8=20
and Windows-1252 are identical for 0x00 to 0x7F which is ASCII in=20
both.<BR>><BR>> I don't have a copy of a template here at =
home,=20
but I have them create<BR>> it by string concatenation =
because that=20
seems to be the only way to be<BR>> able to have CDATA =
attributes,=20
which I have to have because in the<BR>> legacy data=20
numeric-appearing identifiers are actually 10-character<BR>> =
strings=20
with leading spaces, and if these are not specified as =
CDATA<BR>> the=20
spaces go lost even with "xml:space=3D"preserve"" included in =
the<BR>> =20
header. Here is a code snippet from one of my apps that creates =
an=20
XML<BR>> document which is passed as a parameter to a SQL =
Server=20
stored<BR>> procedure:<BR>><BR>> =20
> strXM =3D "<?xml version =3D" =
& Chr(34)=20
& "1.0" & Chr(34) & " encoding=3D" & Chr(34) =
& "UTF-8"=20
& Chr(34) & "?>" & vbCrLf _<BR>> =20
> & "<ROOT =
xml:space=3D"=20
& Chr(34) & "preserve" & Chr(34) & ">" &=20
vbCrLf<BR>> ><BR>> =
> Do=20
While Not .EOF<BR>> =
> =20
strXM =3D strXM & "<M><A>" & !Ofc &=20
"</A><B><![CDATA[" & !Contract &=20
"]]></B><C>" & !TCode & "</C><D>" =
&=20
!Date & "</D>" _<BR>> =20
=
> &nb=
sp; =20
& "<E><![CDATA[" & !TransNo &=20
"]]></E></M>" & vbCrLf<BR>> =20
> .MoveNext<BR>> =
> Loop<BR>> =
><BR>> =20
> strXM =3D strXM &=20
"</ROOT>"<BR>><BR>> (The vbCrLf's are there so if =
there is a=20
problem the document can be<BR>> printed to a text file and =
be easier=20
for humans to read -- SQL Server<BR>> ignores them. The =
single-character aliases for entity and attribute<BR>> names =
are for=20
performance -- for most of the stuff we use these for it<BR>> =
doesn't=20
really matter because we are only sending a few rows, but =
the<BR>> =20
first time I did it it was for something that was sending about=20
5000<BR>> rows and there it made a huge difference, so I =
stuck with=20
it. We<BR>> comment both the front-end code and the =
stored=20
procedure with the<BR>> mappings of these=20
aliases.)<BR>><BR>> > I do not know how SQL =
Server=20
maps from char to nchar, specifically what conversion is =
performed. =20
Also, in some (maybe all released) versions of SQL Server nchar and =
nvarchar=20
are encoded in UCS-2. UCS-2 is a 16-bit encoding like =
UTF-16. It=20
dates back to when Unicode was defined as having 2**16 characters =
instead of=20
the 2**20+ that it has now. You can not express characters =
>=3D U+10000=20
in UCS-2 not that you care about these.<BR>><BR>> =
Thankfully,=20
no. :)<BR>> ><BR>> > I =
don't=20
know if whether those systems you describe being written in java make =
a=20
difference. They can do what they want. The native java =
string is=20
Unicode though I don't remember if it is UCS-2 or UTF-16. My =
guess is=20
that it was once the former and is now the latter. One of the =
documents=20
on this on sun's site suggests that java used UCS-2 until the recently =
released 1.5 which is the first to use UTF-16.<BR>><BR>> =
The Java=20
native string being unicode is exactly what made me =
start<BR>> =20
worrying -- when I was learning Java a couple of years ago (because=20
I<BR>> wanted to port an app to it so as to be able to run it =
right=20
on the Unix<BR>> box where the Oracle database was) I was =
horrified=20
the first time I<BR>> tried reading back what I had written =
to a text=20
file when I saw spaces<BR>> between all the =
characters.<BR>> =20
><BR>> >Rich<BR>> ><BR>> =
> "Ellen=20
K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com">news:aqag115606i=
9g8bmh3lst66une1f1sotth@4ax.com</A>...<BR>> =20
> UTF-8 is unicode?!? Sheesh, all this time I =
thought it=20
meant 8-bit.<BR>> > In fact I could swear I read =
that=20
somewhere.<BR>> ><BR>> > My question was =
coming=20
from the database perspective, where I always use<BR>> =
> =20
char and varchar, as opposed to nchar and nvarchar. I give=20
the<BR>> > front-end guys little templates for =
creating the=20
XML documents for all<BR>> > my SQL Server stored =
procedures=20
that take XML input, and I always<BR>> > specify =
UTF-8 in=20
the header... and my char and varchar columns always<BR>> =
> =20
end up normal, so since you're now telling me UTF-8 is really unicode, =
I<BR>> > guess that would answer my question for XML =
data I=20
would be getting from<BR>> > the =
apps...? =20
Or would the answer be different if the incoming XML is<BR>> =20
> some other encoding?<BR>> ><BR>> =
> To=20
simulate getting nvarchar data from somewhere, I just tried=20
creating<BR>> > two dummy tables, one with an =
nvarchar=20
column and the other with a<BR>> > varchar column, =
typed=20
stuff into the nvarchar one, then inserted to the<BR>> =
> =20
varchar one select from the nvarchar one and it looks normal. =20
<BR>> ><BR>> > If all this means I was =
worrying=20
about nothing, excellent! OTOH, is<BR>> =
> there=20
something I should be worrying about that I didn't ask?<BR>> =20
><BR>> > The only pieces whose names I know so =
far are=20
Sonic and SalesForce, both<BR>> > of which are =
written in=20
Java, if that makes any difference. I know<BR>> =
> =20
there is at least one other external piece but I think that is the=20
next<BR>> > phase.<BR>> ><BR>> =20
> On Sat, 19 Feb 2005 21:37:15 -0800, "Rich" <@> wrote =
in=20
message<BR>> > <<A=20
=
href=3D"mailto:421821c1$1@w3.nls.net">421821c1$1@w3.nls.net</A>>:<BR>&=
gt; =20
><BR>> > > You need to be more =
specific=20
than "8-bit characters". There are many 8-bit character =
encodings. =20
If you are using Windows to generate your data you most likely are =
using=20
Windows-1252 which is the default 8-bit character set for U.S. English =
in=20
Windows. Windows supports many 8-bit encodings so you could be =
using=20
something else too.<BR>> > ><BR>> =
> =20
> Unicode is a character set not an encoding. =
There are=20
multiple encodings the main ones being UTF-8, UTF-16, and =
UTF-32. You=20
can use any of these for XML as well as non-Unicode encodings. =
For=20
interoperability you should use Unicode preferably =
UTF-8.<BR>> =20
> ><BR>> > > What comes =
out when=20
the XML is parsed depends on the XML parser. XML is logically =
expressed=20
in Unicode. The Windows XML parsers provide a Unicode =
interface. =20
Other parsers could do differently.<BR>> > =20
><BR>> > >Rich<BR>> > =20
><BR>> > ><BR>> > > =
"Ellen=20
K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:4o2g11pu048kafbdilg46u77vs5ls0be55@4ax.com">news:4o2g11pu048=
kafbdilg46u77vs5ls0be55@4ax.com</A>...<BR>> =20
> > Our new enterprise system is going to be built =
around an=20
Enterprise<BR>> > > Service Bus. I =
don't have=20
the full specs yet but as I understand it the<BR>> > =
> main apps (starting with SalesForce) are going to be out on =
the=20
internet<BR>> > > and the Sonic ESB will be =
the=20
messaging piece. There will be an<BR>> > =
> Operational Data Store in house that will get updated every =
night=20
on a<BR>> > > batch basis from the main =
apps. =20
<BR>> > ><BR>> > > My =
data=20
warehouse will continue to be the data warehouse and will =
remain<BR>> =20
> > in house. The dimensions will stay the =
same but I=20
might have to create<BR>> > > separate =
measures for=20
the data from the new apps and then create views<BR>> =
> =20
> to keep everything transparent to the users. =20
<BR>> > ><BR>> > > I'm =
thinking if we're going to have an ODS in house already, I may=20
as<BR>> > > well do the ETL from =
there. =20
But I'm worrying that the new data will<BR>> > =
> =20
probably be unicode (because Java defaults to that and SalesForce=20
is<BR>> > > written in Java). Right =
now I am=20
storing everything (except our blobs<BR>> > =
> of=20
course) in 8-bit characters. <BR>> > =20
><BR>> > > Anyone here who's up on this =
stuff,=20
can the XML that goes back and forth<BR>> > =
> =20
convert between unicode and 8-bit characters, or am I gonna have=20
to<BR>> > > redefine all my =
data? For=20
example, if unicode data is put into an XML<BR>> > =20
> document that specifies UTF-8, what comes out when the =
document=20
is<BR>> > > parsed? How about vice=20
versa? If this is too simplistic to work, what<BR>> =
> =20
> is needed?<BR>> > ><BR>> =
> =20
> (We actually have no substantive need for unicode -- we are =
bilingual<BR>> > > Spanish but all the =
special=20
Spanish characters exist in the ascii<BR>> > =
> =20
character set.)<BR></BLOCKQUOTE></BODY></HTML>
------=_NextPart_000_085D_01C5174D.937C1370--
--- BBBS/NT v4.01 Flag-5
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)
|