Text 2636, 495 rader
Skriven 2005-02-20 12:52:24 av Rich (1:379/45)
Kommentar till text 2629 av Ellen K. (1:379/45)
Ärende: Re: ESB / XML / Unicode vs 8-bit characters ?
=====================================================
From: "Rich" <@>
This is a multi-part message in MIME format.
------=_NextPart_000_0837_01C5174B.06C691F0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
From what you describe below, if the values you emit to XML have =
non-ASCII characters I would expect you to have a problem.
Rich
"Ellen K." <72322.1016@compuserve.com> wrote in message =
news:eanh11h4vv6b9v21fiaounii3f5dunjl3g@4ax.com...
On Sat, 19 Feb 2005 23:32:37 -0800, "Rich" <@> wrote in message
<42183ccd@w3.nls.net>:
> The UTF in UTF-8/16/32 stands for Unicode Transformation Format. =
You can find these defined in section 2.5 of =
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf.
THANK YOU SO MUCH!!! :)
>
> It's not clear to me how you are creating the XML from the =
templates. If ANSI data is emitted into an XML document declared as = UTF-8
then you would have problems only for non-ASCII characters. UTF-8 = and
Windows-1252 are identical for 0x00 to 0x7F which is ASCII in both.
I don't have a copy of a template here at home, but I have them create
it by string concatenation because that seems to be the only way to be
able to have CDATA attributes, which I have to have because in the
legacy data numeric-appearing identifiers are actually 10-character
strings with leading spaces, and if these are not specified as CDATA
the spaces go lost even with "xml:space=3D"preserve"" included in the
header. Here is a code snippet from one of my apps that creates an =
XML
document which is passed as a parameter to a SQL Server stored
procedure:
> strXM =3D "<?xml version =3D" & Chr(34) & "1.0" & Chr(34) & " =
encoding=3D" & Chr(34) & "UTF-8" & Chr(34) & "?>" & vbCrLf _
> & "<ROOT xml:space=3D" & Chr(34) & "preserve" & Chr(34) & ">" =
& vbCrLf
>
> Do While Not .EOF
> strXM =3D strXM & "<M><A>" & !Ofc & "</A><B><![CDATA[" & =
!Contract & "]]></B><C>" & !TCode & "</C><D>" & !Date & "</D>" _
> & "<E><![CDATA[" & !TransNo & "]]></E></M>" & vbCrLf
> .MoveNext
> Loop
>
> strXM =3D strXM & "</ROOT>"
(The vbCrLf's are there so if there is a problem the document can be
printed to a text file and be easier for humans to read -- SQL Server
ignores them. The single-character aliases for entity and attribute
names are for performance -- for most of the stuff we use these for it
doesn't really matter because we are only sending a few rows, but the
first time I did it it was for something that was sending about 5000
rows and there it made a huge difference, so I stuck with it. We
comment both the front-end code and the stored procedure with the
mappings of these aliases.)
> I do not know how SQL Server maps from char to nchar, specifically =
what conversion is performed. Also, in some (maybe all released) = versions of
SQL Server nchar and nvarchar are encoded in UCS-2. UCS-2 = is a 16-bit
encoding like UTF-16. It dates back to when Unicode was = defined as having
2**16 characters instead of the 2**20+ that it has = now. You can not express
characters >=3D U+10000 in UCS-2 not that you = care about these.
Thankfully, no. :)
>
> I don't know if whether those systems you describe being written =
in java make a difference. They can do what they want. The native java =
string is Unicode though I don't remember if it is UCS-2 or UTF-16. My = guess
is that it was once the former and is now the latter. One of the = documents
on this on sun's site suggests that java used UCS-2 until the = recently
released 1.5 which is the first to use UTF-16.
The Java native string being unicode is exactly what made me start
worrying -- when I was learning Java a couple of years ago (because I
wanted to port an app to it so as to be able to run it right on the =
Unix
box where the Oracle database was) I was horrified the first time I
tried reading back what I had written to a text file when I saw spaces
between all the characters.
>
>Rich
>
> "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com...
> UTF-8 is unicode?!? Sheesh, all this time I thought it meant =
8-bit.
> In fact I could swear I read that somewhere.
>
> My question was coming from the database perspective, where I =
always use
> char and varchar, as opposed to nchar and nvarchar. I give the
> front-end guys little templates for creating the XML documents for =
all
> my SQL Server stored procedures that take XML input, and I always
> specify UTF-8 in the header... and my char and varchar columns =
always
> end up normal, so since you're now telling me UTF-8 is really =
unicode, I
> guess that would answer my question for XML data I would be getting =
from
> the apps...? Or would the answer be different if the incoming =
XML is
> some other encoding?
>
> To simulate getting nvarchar data from somewhere, I just tried =
creating
> two dummy tables, one with an nvarchar column and the other with a
> varchar column, typed stuff into the nvarchar one, then inserted to =
the
> varchar one select from the nvarchar one and it looks normal. =20
>
> If all this means I was worrying about nothing, excellent! OTOH, =
is
> there something I should be worrying about that I didn't ask?
>
> The only pieces whose names I know so far are Sonic and SalesForce, =
both
> of which are written in Java, if that makes any difference. I know
> there is at least one other external piece but I think that is the =
next
> phase.
>
> On Sat, 19 Feb 2005 21:37:15 -0800, "Rich" <@> wrote in message
> <421821c1$1@w3.nls.net>:
>
> > You need to be more specific than "8-bit characters". There =
are many 8-bit character encodings. If you are using Windows to = generate
your data you most likely are using Windows-1252 which is the = default 8-bit
character set for U.S. English in Windows. Windows = supports many 8-bit
encodings so you could be using something else too.
> >
> > Unicode is a character set not an encoding. There are multiple =
encodings the main ones being UTF-8, UTF-16, and UTF-32. You can use = any of
these for XML as well as non-Unicode encodings. For = interoperability you
should use Unicode preferably UTF-8.
> >
> > What comes out when the XML is parsed depends on the XML =
parser. XML is logically expressed in Unicode. The Windows XML parsers =
provide a Unicode interface. Other parsers could do differently.
> >
> >Rich
> >
> >
> > "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:4o2g11pu048kafbdilg46u77vs5ls0be55@4ax.com...
> > Our new enterprise system is going to be built around an =
Enterprise
> > Service Bus. I don't have the full specs yet but as I =
understand it the
> > main apps (starting with SalesForce) are going to be out on the =
internet
> > and the Sonic ESB will be the messaging piece. There will be =
an
> > Operational Data Store in house that will get updated every =
night on a
> > batch basis from the main apps. =20
> >
> > My data warehouse will continue to be the data warehouse and =
will remain
> > in house. The dimensions will stay the same but I might have to =
create
> > separate measures for the data from the new apps and then create =
views
> > to keep everything transparent to the users. =20
> >
> > I'm thinking if we're going to have an ODS in house already, I =
may as
> > well do the ETL from there. But I'm worrying that the new data =
will
> > probably be unicode (because Java defaults to that and =
SalesForce is
> > written in Java). Right now I am storing everything (except our =
blobs
> > of course) in 8-bit characters. =20
> >
> > Anyone here who's up on this stuff, can the XML that goes back =
and forth
> > convert between unicode and 8-bit characters, or am I gonna have =
to
> > redefine all my data? For example, if unicode data is put into =
an XML
> > document that specifies UTF-8, what comes out when the document =
is
> > parsed? How about vice versa? If this is too simplistic to =
work, what
> > is needed?
> >
> > (We actually have no substantive need for unicode -- we are =
bilingual
> > Spanish but all the special Spanish characters exist in the =
ascii
> > character set.)
------=_NextPart_000_0837_01C5174B.06C691F0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.3790.1289" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2> From what you describe =
below, if the=20
values you emit to XML have non-ASCII characters I would expect you to = have
a=20
problem.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>Rich</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<BLOCKQUOTE=20
style=3D"PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; =
BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<DIV>"Ellen K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:eanh11h4vv6b9v21fiaounii3f5dunjl3g@4ax.com">news:eanh11h4vv6=
b9v21fiaounii3f5dunjl3g@4ax.com</A>...</DIV>On=20
Sat, 19 Feb 2005 23:32:37 -0800, "Rich" <@> wrote in =
message<BR><<A=20
=
href=3D"mailto:42183ccd@w3.nls.net">42183ccd@w3.nls.net</A>>:<BR><BR>&=
gt; =20
The UTF in UTF-8/16/32 stands for Unicode Transformation Format. =
You can=20
find these defined in section 2.5 of <A=20
=
href=3D"http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf">http://www=
.unicode.org/versions/Unicode4.0.0/ch02.pdf</A>.<BR><BR>THANK=20
YOU SO MUCH!!! :)<BR>><BR>> It's =
not clear=20
to me how you are creating the XML from the templates. If ANSI =
data is=20
emitted into an XML document declared as UTF-8 then you would have =
problems=20
only for non-ASCII characters. UTF-8 and Windows-1252 are =
identical for=20
0x00 to 0x7F which is ASCII in both.<BR><BR>I don't have a copy of a =
template=20
here at home, but I have them create<BR>it by string concatenation =
because=20
that seems to be the only way to be<BR>able to have CDATA attributes, =
which I=20
have to have because in the<BR>legacy data numeric-appearing =
identifiers are=20
actually 10-character<BR>strings with leading spaces, and if these are =
not=20
specified as CDATA<BR>the spaces go lost even with =
"xml:space=3D"preserve""=20
included in the<BR>header. Here is a code snippet from one of my =
apps=20
that creates an XML<BR>document which is passed as a parameter to a =
SQL Server=20
stored<BR>procedure:<BR><BR>> strXM =
=3D=20
"<?xml version =3D" & Chr(34) & "1.0" & Chr(34) & =
" =20
encoding=3D" & Chr(34) & "UTF-8" & Chr(34) & "?>" =
&=20
vbCrLf _<BR>> & =
"<ROOT=20
xml:space=3D" & Chr(34) & "preserve" & Chr(34) & =
">" &=20
vbCrLf<BR>><BR>> Do While Not=20
.EOF<BR>> strXM =3D strXM =
&=20
"<M><A>" & !Ofc & =
"</A><B><![CDATA[" &=20
!Contract & "]]></B><C>" & !TCode &=20
"</C><D>" & !Date & "</D>"=20
=
_<BR>> &nbs=
p; =20
& "<E><![CDATA[" & !TransNo &=20
"]]></E></M>" &=20
vbCrLf<BR>> =20
.MoveNext<BR>> =20
Loop<BR>><BR>> strXM =3D strXM =
&=20
"</ROOT>"<BR><BR>(The vbCrLf's are there so if there is a =
problem the=20
document can be<BR>printed to a text file and be easier for humans to =
read --=20
SQL Server<BR>ignores them. The single-character aliases for =
entity and=20
attribute<BR>names are for performance -- for most of the stuff we use =
these=20
for it<BR>doesn't really matter because we are only sending a few =
rows, but=20
the<BR>first time I did it it was for something that was sending about =
5000<BR>rows and there it made a huge difference, so I stuck with =
it. =20
We<BR>comment both the front-end code and the stored procedure with=20
the<BR>mappings of these aliases.)<BR><BR>> I do not =
know how=20
SQL Server maps from char to nchar, specifically what conversion is=20
performed. Also, in some (maybe all released) versions of SQL =
Server=20
nchar and nvarchar are encoded in UCS-2. UCS-2 is a 16-bit =
encoding like=20
UTF-16. It dates back to when Unicode was defined as having =
2**16=20
characters instead of the 2**20+ that it has now. You can not =
express=20
characters >=3D U+10000 in UCS-2 not that you care about=20
these.<BR><BR>Thankfully, no. =
:)<BR>><BR>> I=20
don't know if whether those systems you describe being written in java =
make a=20
difference. They can do what they want. The native java =
string is=20
Unicode though I don't remember if it is UCS-2 or UTF-16. My =
guess is=20
that it was once the former and is now the latter. One of the =
documents=20
on this on sun's site suggests that java used UCS-2 until the recently =
released 1.5 which is the first to use UTF-16.<BR><BR>The Java native =
string=20
being unicode is exactly what made me start<BR>worrying -- when I was =
learning=20
Java a couple of years ago (because I<BR>wanted to port an app to it =
so as to=20
be able to run it right on the Unix<BR>box where the Oracle database =
was) I=20
was horrified the first time I<BR>tried reading back what I had =
written to a=20
text file when I saw spaces<BR>between all the=20
characters.<BR>><BR>>Rich<BR>><BR>> "Ellen K." =
<<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com">news:aqag115606i=
9g8bmh3lst66une1f1sotth@4ax.com</A>...<BR>> =20
UTF-8 is unicode?!? Sheesh, all this time I thought it =
meant=20
8-bit.<BR>> In fact I could swear I read that=20
somewhere.<BR>><BR>> My question was coming from the =
database=20
perspective, where I always use<BR>> char and varchar, as =
opposed to=20
nchar and nvarchar. I give the<BR>> front-end guys =
little=20
templates for creating the XML documents for all<BR>> my SQL =
Server=20
stored procedures that take XML input, and I always<BR>> =
specify=20
UTF-8 in the header... and my char and varchar columns =
always<BR>> =20
end up normal, so since you're now telling me UTF-8 is really unicode, =
I<BR>> guess that would answer my question for XML data I =
would be=20
getting from<BR>> the apps...? Or would the =
answer=20
be different if the incoming XML is<BR>> some other=20
encoding?<BR>><BR>> To simulate getting nvarchar data from =
somewhere, I just tried creating<BR>> two dummy tables, one =
with an=20
nvarchar column and the other with a<BR>> varchar column, =
typed stuff=20
into the nvarchar one, then inserted to the<BR>> varchar one =
select=20
from the nvarchar one and it looks normal. =
<BR>><BR>> If all=20
this means I was worrying about nothing, excellent! OTOH,=20
is<BR>> there something I should be worrying about that I =
didn't=20
ask?<BR>><BR>> The only pieces whose names I know so far =
are Sonic=20
and SalesForce, both<BR>> of which are written in Java, if =
that makes=20
any difference. I know<BR>> there is at least one other =
external piece but I think that is the next<BR>> =20
phase.<BR>><BR>> On Sat, 19 Feb 2005 21:37:15 -0800, =
"Rich"=20
<@> wrote in message<BR>> <<A=20
=
href=3D"mailto:421821c1$1@w3.nls.net">421821c1$1@w3.nls.net</A>>:<BR>&=
gt;<BR>> =20
> You need to be more specific than "8-bit =
characters". =20
There are many 8-bit character encodings. If you are using =
Windows to=20
generate your data you most likely are using Windows-1252 which is the =
default=20
8-bit character set for U.S. English in Windows. Windows =
supports many=20
8-bit encodings so you could be using something else =
too.<BR>> =20
><BR>> > Unicode is a character set not an=20
encoding. There are multiple encodings the main ones being =
UTF-8,=20
UTF-16, and UTF-32. You can use any of these for XML as well as=20
non-Unicode encodings. For interoperability you should use =
Unicode=20
preferably UTF-8.<BR>> ><BR>> > =
What comes=20
out when the XML is parsed depends on the XML parser. XML is =
logically=20
expressed in Unicode. The Windows XML parsers provide a Unicode=20
interface. Other parsers could do differently.<BR>> =20
><BR>> >Rich<BR>> ><BR>> =20
><BR>> > "Ellen K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:4o2g11pu048kafbdilg46u77vs5ls0be55@4ax.com">news:4o2g11pu048=
kafbdilg46u77vs5ls0be55@4ax.com</A>...<BR>> =20
> Our new enterprise system is going to be built around an=20
Enterprise<BR>> > Service Bus. I don't have =
the full=20
specs yet but as I understand it the<BR>> > main =
apps=20
(starting with SalesForce) are going to be out on the =
internet<BR>> =20
> and the Sonic ESB will be the messaging piece. There=20
will be an<BR>> > Operational Data Store in =
house that=20
will get updated every night on a<BR>> > batch basis =
from=20
the main apps. <BR>> ><BR>> > My =
data=20
warehouse will continue to be the data warehouse and will =
remain<BR>> =20
> in house. The dimensions will stay the same but I =
might have=20
to create<BR>> > separate measures for the data from =
the new=20
apps and then create views<BR>> > to keep everything =
transparent to the users. <BR>> =
><BR>> =20
> I'm thinking if we're going to have an ODS in house =
already, I may=20
as<BR>> > well do the ETL from there. =
But I'm=20
worrying that the new data will<BR>> > probably be =
unicode=20
(because Java defaults to that and SalesForce is<BR>> =
> =20
written in Java). Right now I am storing everything (except our=20
blobs<BR>> > of course) in 8-bit =
characters. =20
<BR>> ><BR>> > Anyone here who's up on =
this=20
stuff, can the XML that goes back and forth<BR>> > =
convert=20
between unicode and 8-bit characters, or am I gonna have =
to<BR>> =20
> redefine all my data? For example, if unicode =
data is=20
put into an XML<BR>> > document that specifies =
UTF-8, what=20
comes out when the document is<BR>> > parsed? =
How=20
about vice versa? If this is too simplistic to work, =
what<BR>> =20
> is needed?<BR>> ><BR>> > (We =
actually=20
have no substantive need for unicode -- we are bilingual<BR>> =
> Spanish but all the special Spanish characters exist in the =
ascii<BR>> > character =
set.)<BR></BLOCKQUOTE></BODY></HTML>
------=_NextPart_000_0837_01C5174B.06C691F0--
--- BBBS/NT v4.01 Flag-5
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)
|