Text 2625, 325 rader
Skriven 2005-02-19 23:32:36 av Rich (1:379/45)
Kommentar till text 2624 av Ellen K. (1:379/45)
Ärende: Re: ESB / XML / Unicode vs 8-bit characters ?
=====================================================
From: "Rich" <@>
This is a multi-part message in MIME format.
------=_NextPart_000_07D6_01C516DB.4BF0FA60
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
The UTF in UTF-8/16/32 stands for Unicode Transformation Format. You =
can find these defined in section 2.5 of =
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf.
It's not clear to me how you are creating the XML from the templates. =
If ANSI data is emitted into an XML document declared as UTF-8 then you =
would have problems only for non-ASCII characters. UTF-8 and = Windows-1252
are identical for 0x00 to 0x7F which is ASCII in both.
I do not know how SQL Server maps from char to nchar, specifically =
what conversion is performed. Also, in some (maybe all released) = versions of
SQL Server nchar and nvarchar are encoded in UCS-2. UCS-2 = is a 16-bit
encoding like UTF-16. It dates back to when Unicode was = defined as having
2**16 characters instead of the 2**20+ that it has = now. You can not express
characters >=3D U+10000 in UCS-2 not that you = care about these.
I don't know if whether those systems you describe being written in =
java make a difference. They can do what they want. The native java = string
is Unicode though I don't remember if it is UCS-2 or UTF-16. My = guess is
that it was once the former and is now the latter. One of the = documents on
this on sun's site suggests that java used UCS-2 until the = recently released
1.5 which is the first to use UTF-16.
Rich
"Ellen K." <72322.1016@compuserve.com> wrote in message =
news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com...
UTF-8 is unicode?!? Sheesh, all this time I thought it meant 8-bit.
In fact I could swear I read that somewhere.
My question was coming from the database perspective, where I always =
use
char and varchar, as opposed to nchar and nvarchar. I give the
front-end guys little templates for creating the XML documents for all
my SQL Server stored procedures that take XML input, and I always
specify UTF-8 in the header... and my char and varchar columns always
end up normal, so since you're now telling me UTF-8 is really unicode, =
I
guess that would answer my question for XML data I would be getting =
from
the apps...? Or would the answer be different if the incoming XML =
is
some other encoding?
To simulate getting nvarchar data from somewhere, I just tried =
creating
two dummy tables, one with an nvarchar column and the other with a
varchar column, typed stuff into the nvarchar one, then inserted to =
the
varchar one select from the nvarchar one and it looks normal. =20
If all this means I was worrying about nothing, excellent! OTOH, is
there something I should be worrying about that I didn't ask?
The only pieces whose names I know so far are Sonic and SalesForce, =
both
of which are written in Java, if that makes any difference. I know
there is at least one other external piece but I think that is the =
next
phase.
On Sat, 19 Feb 2005 21:37:15 -0800, "Rich" <@> wrote in message
<421821c1$1@w3.nls.net>:
> You need to be more specific than "8-bit characters". There are =
many 8-bit character encodings. If you are using Windows to generate = your
data you most likely are using Windows-1252 which is the default = 8-bit
character set for U.S. English in Windows. Windows supports many = 8-bit
encodings so you could be using something else too.
>
> Unicode is a character set not an encoding. There are multiple =
encodings the main ones being UTF-8, UTF-16, and UTF-32. You can use = any of
these for XML as well as non-Unicode encodings. For = interoperability you
should use Unicode preferably UTF-8.
>
> What comes out when the XML is parsed depends on the XML parser. =
XML is logically expressed in Unicode. The Windows XML parsers provide = a
Unicode interface. Other parsers could do differently.
>
>Rich
>
>
> "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:4o2g11pu048kafbdilg46u77vs5ls0be55@4ax.com...
> Our new enterprise system is going to be built around an Enterprise
> Service Bus. I don't have the full specs yet but as I understand =
it the
> main apps (starting with SalesForce) are going to be out on the =
internet
> and the Sonic ESB will be the messaging piece. There will be an
> Operational Data Store in house that will get updated every night =
on a
> batch basis from the main apps. =20
>
> My data warehouse will continue to be the data warehouse and will =
remain
> in house. The dimensions will stay the same but I might have to =
create
> separate measures for the data from the new apps and then create =
views
> to keep everything transparent to the users. =20
>
> I'm thinking if we're going to have an ODS in house already, I may =
as
> well do the ETL from there. But I'm worrying that the new data =
will
> probably be unicode (because Java defaults to that and SalesForce =
is
> written in Java). Right now I am storing everything (except our =
blobs
> of course) in 8-bit characters. =20
>
> Anyone here who's up on this stuff, can the XML that goes back and =
forth
> convert between unicode and 8-bit characters, or am I gonna have to
> redefine all my data? For example, if unicode data is put into an =
XML
> document that specifies UTF-8, what comes out when the document is
> parsed? How about vice versa? If this is too simplistic to work, =
what
> is needed?
>
> (We actually have no substantive need for unicode -- we are =
bilingual
> Spanish but all the special Spanish characters exist in the ascii
> character set.)
------=_NextPart_000_07D6_01C516DB.4BF0FA60
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.3790.1289" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2> The UTF in UTF-8/16/32 =
stands for=20
Unicode Transformation Format. You can find these defined in = section
2.5=20
of <A=20
href=3D"http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf">http://www=
.unicode.org/versions/Unicode4.0.0/ch02.pdf</A>.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2> It's not clear to me how =
you are=20
creating the XML from the templates. If ANSI data is emitted into = an
XML=20
document declared as UTF-8 then you would have problems only for = non-ASCII=20
characters. UTF-8 and Windows-1252 are identical for 0x00 to 0x7F = which
is=20
ASCII in both.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2> I do not know how SQL =
Server maps from=20
char to nchar, specifically what conversion is performed. Also, in =
some=20
(maybe all released) versions of SQL Server nchar and nvarchar are = encoded
in=20
UCS-2. UCS-2 is a 16-bit encoding like UTF-16. It dates back = to
when=20
Unicode was defined as having 2**16 characters instead of the 2**20+ = that it
has=20
now. You can not express characters >=3D U+10000 in UCS-2 not = that
you=20
care about these.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2> I don't know if whether =
those systems=20
you describe being written in java make a difference. They can do =
what=20
they want. The native java string is Unicode though I don't = remember if
it=20
is UCS-2 or UTF-16. My guess is that it was once the former and is = now
the=20
latter. One of the documents on this on sun's site suggests that = java
used=20
UCS-2 until the recently released 1.5 which is the first to use=20
UTF-16.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>Rich</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<BLOCKQUOTE=20
style=3D"PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; =
BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<DIV>"Ellen K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com">news:aqag115606i=
9g8bmh3lst66une1f1sotth@4ax.com</A>...</DIV>UTF-8=20
is unicode?!? Sheesh, all this time I thought it meant=20
8-bit.<BR>In fact I could swear I read that somewhere.<BR><BR>My =
question was=20
coming from the database perspective, where I always use<BR>char and =
varchar,=20
as opposed to nchar and nvarchar. I give the<BR>front-end guys =
little=20
templates for creating the XML documents for all<BR>my SQL Server =
stored=20
procedures that take XML input, and I always<BR>specify UTF-8 in the =
header...=20
and my char and varchar columns always<BR>end up normal, so since =
you're now=20
telling me UTF-8 is really unicode, I<BR>guess that would answer my =
question=20
for XML data I would be getting from<BR>the apps...? =
Or=20
would the answer be different if the incoming XML is<BR>some other=20
encoding?<BR><BR>To simulate getting nvarchar data from somewhere, I =
just=20
tried creating<BR>two dummy tables, one with an nvarchar column and =
the other=20
with a<BR>varchar column, typed stuff into the nvarchar one, then =
inserted to=20
the<BR>varchar one select from the nvarchar one and it looks =
normal. =20
<BR><BR>If all this means I was worrying about nothing, =
excellent! =20
OTOH, is<BR>there something I should be worrying about that I didn't=20
ask?<BR><BR>The only pieces whose names I know so far are Sonic and=20
SalesForce, both<BR>of which are written in Java, if that makes any=20
difference. I know<BR>there is at least one other external piece =
but I=20
think that is the next<BR>phase.<BR><BR>On Sat, 19 Feb 2005 21:37:15 =
-0800,=20
"Rich" <@> wrote in message<BR><<A=20
=
href=3D"mailto:421821c1$1@w3.nls.net">421821c1$1@w3.nls.net</A>>:<BR><=
BR>> =20
You need to be more specific than "8-bit characters". There are =
many=20
8-bit character encodings. If you are using Windows to generate =
your=20
data you most likely are using Windows-1252 which is the default 8-bit =
character set for U.S. English in Windows. Windows supports many =
8-bit=20
encodings so you could be using something else=20
too.<BR>><BR>> Unicode is a character set not an=20
encoding. There are multiple encodings the main ones being =
UTF-8,=20
UTF-16, and UTF-32. You can use any of these for XML as well as=20
non-Unicode encodings. For interoperability you should use =
Unicode=20
preferably UTF-8.<BR>><BR>> What comes out when the =
XML is=20
parsed depends on the XML parser. XML is logically expressed in=20
Unicode. The Windows XML parsers provide a Unicode =
interface. =20
Other parsers could do=20
differently.<BR>><BR>>Rich<BR>><BR>><BR>> "Ellen =
K."=20
<<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:4o2g11pu048kafbdilg46u77vs5ls0be55@4ax.com">news:4o2g11pu048=
kafbdilg46u77vs5ls0be55@4ax.com</A>...<BR>> =20
Our new enterprise system is going to be built around an=20
Enterprise<BR>> Service Bus. I don't have the full =
specs yet=20
but as I understand it the<BR>> main apps (starting with =
SalesForce)=20
are going to be out on the internet<BR>> and the Sonic ESB =
will be=20
the messaging piece. There will be an<BR>> =
Operational=20
Data Store in house that will get updated every night on =
a<BR>> batch=20
basis from the main apps. <BR>><BR>> My data =
warehouse will=20
continue to be the data warehouse and will remain<BR>> in=20
house. The dimensions will stay the same but I might have to=20
create<BR>> separate measures for the data from the new apps =
and then=20
create views<BR>> to keep everything transparent to the=20
users. <BR>><BR>> I'm thinking if we're going =
to have=20
an ODS in house already, I may as<BR>> well do the ETL from=20
there. But I'm worrying that the new data =
will<BR>> =20
probably be unicode (because Java defaults to that and SalesForce=20
is<BR>> written in Java). Right now I am storing =
everything=20
(except our blobs<BR>> of course) in 8-bit =
characters. =20
<BR>><BR>> Anyone here who's up on this stuff, can the XML =
that=20
goes back and forth<BR>> convert between unicode and 8-bit=20
characters, or am I gonna have to<BR>> redefine all my=20
data? For example, if unicode data is put into an=20
XML<BR>> document that specifies UTF-8, what comes out when =
the=20
document is<BR>> parsed? How about vice versa? If =
this is=20
too simplistic to work, what<BR>> is =
needed?<BR>><BR>> =20
(We actually have no substantive need for unicode -- we are=20
bilingual<BR>> Spanish but all the special Spanish characters =
exist=20
in the ascii<BR>> character =
set.)<BR></BLOCKQUOTE></BODY></HTML>
------=_NextPart_000_07D6_01C516DB.4BF0FA60--
--- BBBS/NT v4.01 Flag-5
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)
|