Text 2677, 1173 rader
Skriven 2005-02-22 08:51:44 av Rich (1:379/45)
Kommentar till text 2675 av Ellen K. (1:379/45)
Ärende: Re: ESB / XML / Unicode vs 8-bit characters ?
=====================================================
From: "Rich" <@>
This is a multi-part message in MIME format.
------=_NextPart_000_0A9F_01C518BB.BCF745C0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
I don't know where SQL Server supported encodings are documented. =
Those supported by MSXML can be found at =
http://support.microsoft.com/default.aspx?scid=3Dkb;en-us;275883. I = believe
many more actually work but this may be the supportd subset. = The easiest way
to see if something works is to try it.
Rich
"Ellen K." <72322.1016@compuserve.com> wrote in message =
news:8fkm111cj6anlhe0p1qi8mta9pflapbgav@4ax.com...
On Mon, 21 Feb 2005 20:52:26 -0800, "Rich" <@> wrote in message
<421aba24@w3.nls.net>:
> I don't understand why oracle using UTF-8 wouldn't work and oracle =
using UTF-16 would. That sounds like an oracle problem.
That would be no surprise to me, I am not an Oracle fan. It drove me
nuts when I had to use it at Kaiser. That parallel query processor =
they
are so proud of crashed if you looked at it cross-eyed. My SQL =
Server
OLTP databases each have over 600 users now, one on a single-processor
box, the other on a dual-processor one, zero performance problems and
almost no tuning ever required. If they were on Oracle we would need
more CPUs and a full-time DBA just to babysit it. If I had a monster
multi-terabyte database that was too big for SQL Server I would go =
with
DB2 or possibly Terabyte. =20
>
> There are two XML issues I see.
>
> One, and the only bug in what we have discussed, is that the =
encoding declared in the ?xml PI must match the actual encoding of the = file.
You can use any encoding that is supported as long as your XML = document is
correctly encoded in whatever is declared.
>
> The second issue is what encodings does your software support. I =
believe all XML implementations to claim compliance must support UTF-8 = and
UTF-16. Most should support more. I haven't tried Microsoft SQL = Server but
I would expect it to handle any of the encodings with support = installed on
that host machine. Windows supports many encodings. I = just checked my
machine and see more than 100 supported encodings.
Where do I look to see what encodings are supported?
>
> My suggestion is to stick with UTF-8 for the stuff you generate if =
you can. For what you have to consume from other software simply has to =
work. If not you can go back to the folks responsible for creating it = and
find a way to make it valid.
For stuff I will have to generate, the FOR XML clause doesn't have a
parameter for the encoding and I'm not finding anything in BooksOnline
about how to set the encoding, or about a default encoding for output
generated using FOR XML. ??? Clearly this is something about which =
I
need to learn much more, until now I have only been consuming XML, not
outputting it. Somewhere I saw that SS2005 can automatically create
SOAP objects, maybe that is something to investigate.
Based on Sunday's experiments and the fact that Oracle's default is
UTF-16, it's looking to me like that would be the most foolproof, but
OTOH being that every character will take 2 bytes it could slow things
down. I had originally planned to communicate directly with Oracle =
but
I will likely have to use Sonic for situations requiring the =
equivalent
of a distributed transaction where related data need to be written to
SQL Server, Oracle and one or more of the purchased apps. Our Oracle
guy's strength is the Oracle Financials apps as opposed to stuff like
interoperability. Sonic and the purchased apps are being installed =
and
configured by an outside consulting outfit -- I don't wanna get =
involved
with trying to make either the outside consultants or the vendors fix
stuff that doesn't work, I want it to work from day one and continue =
to
work going forward. I am going to try to talk with them about this =
but
based on what I've seen so far I can't say my expectations are very
high.
=20
>
>Rich
>
> "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:qt9l115cio5kc8ijjumaubpvb6n9ig50ir@4ax.com...
> Argggh, I replied from work using Dale's mirror, it doesn't show up
> here, waitaminnit...
>
> OK, here we go:
>
> First of all I have to say you have once again earned my eternal
> gratitude, I am so glad you are here. :)
>
> I still have to make a decision regarding the standard I want to
> request. Are you saying that if for example I get XML generated by =
say
> SalesForce with UTF-8 specified and it includes these characters, I
> would not have a problem? Stuff I write myself I'm not worried =
about
> because I have the opportunity to tweak it, what I'm worried about =
is
> what the ESB will try to feed my databases. Since even the =
hand-typed
> XML was accepted when UTF-16 was specified, I'm kind of leaning =
toward
> that, especially since the Oracle guy told me that the TJ =
accounting
> manager used to complain that the Spanish characters weren't coming
> out on reports when we were on Oracle 8.x which used UTF-8 but as =
soon
> as we moved to 9.x which uses UTF-16 there were no more problems. =
What
> (if any) do you see as a potential downside to UTF-16?
>
> Thanks again. :)
> =20
> =20
>
> On Mon, 21 Feb 2005 00:24:53 -0800, "Rich" <@> wrote in message
> <42199a9e@w3.nls.net>:
>
> > The problem is not that UTF-8 won't work for what you call =
Spanish characters. UTF-8 can encode anything that you consider a = character.
The problem is that when these characters are present they = are not being
encoded in UTF-8. There are two straight forward = solutions I see. One is to
encode the XML correctly in UTF-8. The = other is to encode the XML in the
Windows ANSI encoding I suspect you = are using and to tag it correctly. On a
Windows U.S. English system = this would be "Windows-1252". I would expect SQL
Server to support = this. Other applications may or may not. You could also
use UTF-16 as = long as you generate your XML in UTF-16. I don't know if this
is simple = for your application or not. If you are using VB 7.0 it should be.
> >
> >Rich
> >
> > "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:of2j11lqf4r4h1dsnvok8i1dv7c9e3rlb2@4ax.com...
> > Well. You just helped me learn a lot. For one thing, I guess =
I
> > thought that because the Spanish characters can be expressed as =
8 bits
> > they were ASCII. =20
> >
> > I just made a little test stored procedure taking an XML =
document as a
> > parameter, created the document manually in Notepad with UTF-8 =
specified
> > in the header and tried including some of the Spanish =
characters... and
> > it failed. SQL Server could not execute sp_xml_preparedocument =
because
> > "an invalid character was found in text content". Just to make =
sure
> > that was the problem I substituted non-Spanish characters for =
the
> > Spanish ones and it executed fine. However, I can manually type =
the
> > text with the Spanish characters into the varchar field if I =
open the
> > table in EnterpriseManager and SQL Server is perfectly happy. =
OTOH, if
> > I specify UTF-16, I get "Switch from current encoding to =
specified
> > encoding not supported." Next thing I tried was cloning the =
sproc to
> > write to the table with the nvarchar column, still no joy, same =
error
> > message... but on changing the datatype of the input parameter =
from text
> > to ntext it worked fine. BUT here's the surprise (OK, to me it =
was a
> > surprise): If I again clone the sproc to point to the table =
with the
> > varchar column, but leave the input parameter as ntext and =
specify
> > UTF-16 in the document header, it works. In other words, a =
varchar (and
> > presumably a char) column can successfully accept unicode data =
even
> > though char and varchar are explicitly defined as non-unicode =
datatypes!
> >
> > Now I have to understand whether I have a problem at work. I =
never
> > experienced this problem in real life because none of the data =
we
> > currently send using XML includes any of the Spanish characters. =
Is
> > the problem only going to occur if the XML document is =
constructed using
> > the concatenated-string method? Or would it happen any time an =
XML
> > document specified as UTF-8 included Spanish characters? (The =
data
> > sent by SalesForce to the ODS is likely to include Spanish =
characters,
> > but it probably creates the XML some other way.) Do I need to =
tell the
> > consulting outfit to specify all XML as UTF-16? =20
> >
> > For the ETL from the ODS to the data warehouse I am not planning =
to use
> > Sonic, but rather probably to link the databases and use a bunch =
of
> > stored procedures controlled by some VB code, IOW I will not =
need XML
> > because all the extract sprocs will look like INSERT =
INTO....SELECT
> > FROM.
> >
> > I don't yet understand why UTF-8 can't work for the Spanish
> > characters... (unless it only doesn't work when the characters =
are
> > manually typed into the document). If I correctly understand =
the
> > document to which you referred me, an 8-bit character can't fit =
in one
> > UTF-8 byte because the first bit is reserved for indicating =
which is the
> > first byte of a UTF-8 multi-byte character. (This was your =
point about
> > not greater than 0x7F.) But why wouldn't it just make two =
bytes out of
> > the Spanish characters then? The documentation says UTF-8 uses
> > multiple bytes for the characters that it can't fit into one =
byte. =20
> >
> > ???
> >
> >
> >
> > On Sun, 20 Feb 2005 13:10:40 -0800, "Rich" <@> wrote in message
> > <4218fc91@w3.nls.net>:
> >
> > > The Spanish accented characters are not part of ASCII. They =
are part of Windows calls ANSI of which ASCII is the subset (0x00 to = 0x7F).
Any character in the 0x80 to 0xFF range is not compatible = between ANSI and
UTF-8.
> > >
> > >Rich
> > >
> > > "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:7ouh119ivmuk26icg3mqqqk2ss1lfm5c10@4ax.com...
> > > Should not have any non-ASCII characters, as previously noted =
all the
> > > special Spanish characters are available in the ASCII =
character set.
> > > And since the company is built on our understanding of the =
Hispanic
> > > market, I don't see any use of, say, pictograph-based =
languages in the
> > > foreseeable future. If 10 years down the road something =
like that
> > > happens, well, by then we will no longer need compatibility =
with the
> > > current legacy system because it will long since have been =
replaced.
> > >
> > > On Sun, 20 Feb 2005 12:52:25 -0800, "Rich" <@> wrote in =
message
> > > <4218f849$1@w3.nls.net>:
> > >
> > > > From what you describe below, if the values you emit to =
XML have non-ASCII characters I would expect you to have a problem.
> > > >
> > > >Rich
> > > >
> > > > "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:eanh11h4vv6b9v21fiaounii3f5dunjl3g@4ax.com...
> > > > On Sat, 19 Feb 2005 23:32:37 -0800, "Rich" <@> wrote in =
message
> > > > <42183ccd@w3.nls.net>:
> > > >
> > > > > The UTF in UTF-8/16/32 stands for Unicode =
Transformation Format. You can find these defined in section 2.5 of =
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf.
> > > >
> > > > THANK YOU SO MUCH!!! :)
> > > > >
> > > > > It's not clear to me how you are creating the XML from =
the templates. If ANSI data is emitted into an XML document declared as =
UTF-8 then you would have problems only for non-ASCII characters. UTF-8 = and
Windows-1252 are identical for 0x00 to 0x7F which is ASCII in both.
> > > >
> > > > I don't have a copy of a template here at home, but I have =
them create
> > > > it by string concatenation because that seems to be the =
only way to be
> > > > able to have CDATA attributes, which I have to have =
because in the
> > > > legacy data numeric-appearing identifiers are actually =
10-character
> > > > strings with leading spaces, and if these are not =
specified as CDATA
> > > > the spaces go lost even with "xml:space=3D"preserve"" =
included in the
> > > > header. Here is a code snippet from one of my apps that =
creates an XML
> > > > document which is passed as a parameter to a SQL Server =
stored
> > > > procedure:
> > > >
> > > > > strXM =3D "<?xml version =3D" & Chr(34) & "1.0" & =
Chr(34) & " encoding=3D" & Chr(34) & "UTF-8" & Chr(34) & "?>" & vbCrLf = _
> > > > > & "<ROOT xml:space=3D" & Chr(34) & "preserve" & =
Chr(34) & ">" & vbCrLf
> > > > >
> > > > > Do While Not .EOF
> > > > > strXM =3D strXM & "<M><A>" & !Ofc & =
"</A><B><![CDATA[" & !Contract & "]]></B><C>" & !TCode & "</C><D>" & =
!Date & "</D>" _
> > > > > & "<E><![CDATA[" & !TransNo & =
"]]></E></M>" & vbCrLf
> > > > > .MoveNext
> > > > > Loop
> > > > >
> > > > > strXM =3D strXM & "</ROOT>"
> > > >
> > > > (The vbCrLf's are there so if there is a problem the =
document can be
> > > > printed to a text file and be easier for humans to read -- =
SQL Server
> > > > ignores them. The single-character aliases for entity and =
attribute
> > > > names are for performance -- for most of the stuff we use =
these for it
> > > > doesn't really matter because we are only sending a few =
rows, but the
> > > > first time I did it it was for something that was sending =
about 5000
> > > > rows and there it made a huge difference, so I stuck with =
it. We
> > > > comment both the front-end code and the stored procedure =
with the
> > > > mappings of these aliases.)
> > > >
> > > > > I do not know how SQL Server maps from char to nchar, =
specifically what conversion is performed. Also, in some (maybe all =
released) versions of SQL Server nchar and nvarchar are encoded in = UCS-2.
UCS-2 is a 16-bit encoding like UTF-16. It dates back to when = Unicode was
defined as having 2**16 characters instead of the 2**20+ = that it has now.
You can not express characters >=3D U+10000 in UCS-2 = not that you care about
these.
> > > >
> > > > Thankfully, no. :)
> > > > >
> > > > > I don't know if whether those systems you describe =
being written in java make a difference. They can do what they want. = The
native java string is Unicode though I don't remember if it is UCS-2 = or
UTF-16. My guess is that it was once the former and is now the = latter. One
of the documents on this on sun's site suggests that java = used UCS-2 until
the recently released 1.5 which is the first to use = UTF-16.
> > > >
> > > > The Java native string being unicode is exactly what made =
me start
> > > > worrying -- when I was learning Java a couple of years ago =
(because I
> > > > wanted to port an app to it so as to be able to run it =
right on the Unix
> > > > box where the Oracle database was) I was horrified the =
first time I
> > > > tried reading back what I had written to a text file when =
I saw spaces
> > > > between all the characters.
> > > > >
> > > > >Rich
> > > > >
> > > > > "Ellen K." <72322.1016@compuserve.com> wrote in message =
news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com...
> > > > > UTF-8 is unicode?!? Sheesh, all this time I thought =
it meant 8-bit.
> > > > > In fact I could swear I read that somewhere.
> > > > >
> > > > > My question was coming from the database perspective, =
where I always use
> > > > > char and varchar, as opposed to nchar and nvarchar. I =
give the
> > > > > front-end guys little templates for creating the XML =
documents for all
> > > > > my SQL Server stored procedures that take XML input, =
and I always
> > > > > specify UTF-8 in the header... and my char and varchar =
columns always
> > > > > end up normal, so since you're now telling me UTF-8 is =
really unicode, I
> > > > > guess that would answer my question for XML data I =
would be getting from
> > > > > the apps...? Or would the answer be different if the =
incoming XML is
> > > > > some other encoding?
> > > > >
> > > > > To simulate getting nvarchar data from somewhere, I =
just tried creating
> > > > > two dummy tables, one with an nvarchar column and the =
other with a
> > > > > varchar column, typed stuff into the nvarchar one, then =
inserted to the
> > > > > varchar one select from the nvarchar one and it looks =
normal. =20
> > > > >
> > > > > If all this means I was worrying about nothing, =
excellent! OTOH, is
> > > > > there something I should be worrying about that I =
didn't ask?
> > > > >
> > > > > The only pieces whose names I know so far are Sonic and =
SalesForce, both
> > > > > of which are written in Java, if that makes any =
difference. I know
> > > > > there is at least one other external piece but I think =
that is the next
> > > > > phase.
> > > > >
> > > > > On Sat, 19 Feb 2005 21:37:15 -0800, "Rich" <@> wrote in =
message
> > > > > <421821c1$1@w3.nls.net>:
> > > > >
> > > > > > You need to be more specific than "8-bit =
characters". There are many 8-bit character encodings. If you are = using
Windows to generate your data you most likely are using = Windows-1252 which is
the default 8-bit character set for U.S. English = in Windows. Windows
supports many 8-bit encodings so you could be using = something else too.
> > > > > >
> > > > > > Unicode is a character set not an encoding. There =
are multiple encodings the main ones being UTF-8, UTF-16, and UTF-32. = You
can use any of these for XML as well as non-Unicode encodings. For =
interoperability you should use Unicode preferably UTF-8.
> > > > > >
> > > > > > What comes out when the XML is parsed depends on =
the XML parser. XML is logically expressed in Unicode. The Windows XML =
parsers provide a Unicode interface. Other parsers could do = differently.
> > > > > >
> > > > > >Rich
> > > > > >
> > > > > >
> > > > > > "Ellen K." <72322.1016@compuserve.com> wrote in =
message news:4o2g11pu048kafbdilg46u77vs5ls0be55@4ax.com...
> > > > > > Our new enterprise system is going to be built =
around an Enterprise
> > > > > > Service Bus. I don't have the full specs yet but as =
I understand it the
> > > > > > main apps (starting with SalesForce) are going to be =
out on the internet
> > > > > > and the Sonic ESB will be the messaging piece. =
There will be an
> > > > > > Operational Data Store in house that will get =
updated every night on a
> > > > > > batch basis from the main apps. =20
> > > > > >
> > > > > > My data warehouse will continue to be the data =
warehouse and will remain
> > > > > > in house. The dimensions will stay the same but I =
might have to create
> > > > > > separate measures for the data from the new apps and =
then create views
> > > > > > to keep everything transparent to the users. =20
> > > > > >
> > > > > > I'm thinking if we're going to have an ODS in house =
already, I may as
> > > > > > well do the ETL from there. But I'm worrying that =
the new data will
> > > > > > probably be unicode (because Java defaults to that =
and SalesForce is
> > > > > > written in Java). Right now I am storing everything =
(except our blobs
> > > > > > of course) in 8-bit characters. =20
> > > > > >
> > > > > > Anyone here who's up on this stuff, can the XML that =
goes back and forth
> > > > > > convert between unicode and 8-bit characters, or am =
I gonna have to
> > > > > > redefine all my data? For example, if unicode data =
is put into an XML
> > > > > > document that specifies UTF-8, what comes out when =
the document is
> > > > > > parsed? How about vice versa? If this is too =
simplistic to work, what
> > > > > > is needed?
> > > > > >
> > > > > > (We actually have no substantive need for unicode -- =
we are bilingual
> > > > > > Spanish but all the special Spanish characters exist =
in the ascii
> > > > > > character set.)
------=_NextPart_000_0A9F_01C518BB.BCF745C0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.3790.1289" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2> I don't know where SQL =
Server=20
supported encodings are documented. Those supported by MSXML can = be
found=20
at <A=20
href=3D"http://support.microsoft.com/default.aspx?scid=3Dkb;en-us;275883"=
>http://support.microsoft.com/default.aspx?scid=3Dkb;en-us;275883</A>.&nb=
sp;=20
I believe many more actually work but this may be the supportd = subset.
The=20
easiest way to see if something works is to try it.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>Rich</FONT></DIV>
<DIV> </DIV>
<BLOCKQUOTE=20
style=3D"PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; =
BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<DIV>"Ellen K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:8fkm111cj6anlhe0p1qi8mta9pflapbgav@4ax.com">news:8fkm111cj6a=
nlhe0p1qi8mta9pflapbgav@4ax.com</A>...</DIV>On=20
Mon, 21 Feb 2005 20:52:26 -0800, "Rich" <@> wrote in =
message<BR><<A=20
=
href=3D"mailto:421aba24@w3.nls.net">421aba24@w3.nls.net</A>>:<BR><BR>&=
gt; =20
I don't understand why oracle using UTF-8 wouldn't work and oracle =
using=20
UTF-16 would. That sounds like an oracle problem.<BR><BR>That =
would be=20
no surprise to me, I am not an Oracle fan. It drove me<BR>nuts =
when I=20
had to use it at Kaiser. That parallel query processor =
they<BR>are so=20
proud of crashed if you looked at it cross-eyed. My SQL=20
Server<BR>OLTP databases each have over 600 users now, one on a=20
single-processor<BR>box, the other on a dual-processor one, zero =
performance=20
problems and<BR>almost no tuning ever required. If they were on =
Oracle=20
we would need<BR>more CPUs and a full-time DBA just to babysit =
it. =20
If I had a monster<BR>multi-terabyte database that was too big for SQL =
Server=20
I would go with<BR>DB2 or possibly Terabyte. =20
<BR><BR>><BR>> There are two XML issues I=20
see.<BR>><BR>> One, and the only bug in what we have =
discussed, is that the encoding declared in the ?xml PI must match the =
actual=20
encoding of the file. You can use any encoding that is supported =
as long=20
as your XML document is correctly encoded in whatever is=20
declared.<BR>><BR>> The second issue is what =
encodings does=20
your software support. I believe all XML implementations to =
claim=20
compliance must support UTF-8 and UTF-16. Most should support=20
more. I haven't tried Microsoft SQL Server but I would expect it =
to=20
handle any of the encodings with support installed on that host =
machine. =20
Windows supports many encodings. I just checked my machine and =
see more=20
than 100 supported encodings.<BR><BR>Where do I look to see what =
encodings are=20
supported?<BR>><BR>> My suggestion is to stick with =
UTF-8=20
for the stuff you generate if you can. For what you have to =
consume from=20
other software simply has to work. If not you can go back to the =
folks=20
responsible for creating it and find a way to make it =
valid.<BR><BR>For stuff=20
I will have to generate, the FOR XML clause doesn't have =
a<BR>parameter for=20
the encoding and I'm not finding anything in BooksOnline<BR>about how =
to set=20
the encoding, or about a default encoding for output<BR>generated =
using FOR=20
XML. ??? Clearly this is something about which =
I<BR>need to=20
learn much more, until now I have only been consuming XML, =
not<BR>outputting=20
it. Somewhere I saw that SS2005 can automatically =
create<BR>SOAP=20
objects, maybe that is something to investigate.<BR><BR>Based on =
Sunday's=20
experiments and the fact that Oracle's default is<BR>UTF-16, it's =
looking to=20
me like that would be the most foolproof, but<BR>OTOH being that every =
character will take 2 bytes it could slow things<BR>down. =
I had=20
originally planned to communicate directly with Oracle but<BR>I will =
likely=20
have to use Sonic for situations requiring the equivalent<BR>of a =
distributed=20
transaction where related data need to be written to<BR>SQL Server, =
Oracle and=20
one or more of the purchased apps. Our Oracle<BR>guy's strength =
is the=20
Oracle Financials apps as opposed to stuff =
like<BR>interoperability. =20
Sonic and the purchased apps are being installed and<BR>configured by =
an=20
outside consulting outfit -- I don't wanna get involved<BR>with trying =
to make=20
either the outside consultants or the vendors fix<BR>stuff that =
doesn't work,=20
I want it to work from day one and continue to<BR>work going =
forward. I=20
am going to try to talk with them about this but<BR>based on what I've =
seen so=20
far I can't say my expectations are =
very<BR>high.<BR> =20
<BR>><BR>>Rich<BR>><BR>> "Ellen K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:qt9l115cio5kc8ijjumaubpvb6n9ig50ir@4ax.com">news:qt9l115cio5=
kc8ijjumaubpvb6n9ig50ir@4ax.com</A>...<BR>> =20
Argggh, I replied from work using Dale's mirror, it doesn't show=20
up<BR>> here, waitaminnit...<BR>><BR>> OK, here =
we=20
go:<BR>><BR>> First of all I have to say you have once =
again=20
earned my eternal<BR>> gratitude, I am so glad you are=20
here. :)<BR>><BR>> I still have to make a =
decision=20
regarding the standard I want to<BR>> request. Are you =
saying=20
that if for example I get XML generated by say<BR>> =
SalesForce with=20
UTF-8 specified and it includes these characters, I<BR>> =
would not=20
have a problem? Stuff I write myself I'm not worried =
about<BR>> =20
because I have the opportunity to tweak it, what I'm worried about=20
is<BR>> what the ESB will try to feed my databases. =
Since even=20
the hand-typed<BR>> XML was accepted when UTF-16 was =
specified, I'm=20
kind of leaning toward<BR>> that, especially since the Oracle =
guy=20
told me that the TJ accounting<BR>> manager used to complain =
that the=20
Spanish characters weren't coming<BR>> out on reports when we =
were on=20
Oracle 8.x which used UTF-8 but as soon<BR>> as we moved to =
9.x which=20
uses UTF-16 there were no more problems. What<BR>> (if =
any) do=20
you see as a potential downside to UTF-16?<BR>><BR>> =
Thanks=20
again. :)<BR>> <BR>> =20
<BR>><BR>> On Mon, 21 Feb 2005 00:24:53 -0800, "Rich" =
<@>=20
wrote in message<BR>> <<A=20
=
href=3D"mailto:42199a9e@w3.nls.net">42199a9e@w3.nls.net</A>>:<BR>><=
BR>> =20
> The problem is not that UTF-8 won't work for what you =
call=20
Spanish characters. UTF-8 can encode anything that you consider =
a=20
character. The problem is that when these characters are present =
they=20
are not being encoded in UTF-8. There are two straight forward =
solutions=20
I see. One is to encode the XML correctly in UTF-8. The =
other is=20
to encode the XML in the Windows ANSI encoding I suspect you are using =
and to=20
tag it correctly. On a Windows U.S. English system this would be =
"Windows-1252". I would expect SQL Server to support this. =
Other=20
applications may or may not. You could also use UTF-16 as long =
as you=20
generate your XML in UTF-16. I don't know if this is simple for =
your=20
application or not. If you are using VB 7.0 it should =
be.<BR>> =20
><BR>> >Rich<BR>> ><BR>> =
> "Ellen=20
K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:of2j11lqf4r4h1dsnvok8i1dv7c9e3rlb2@4ax.com">news:of2j11lqf4r=
4h1dsnvok8i1dv7c9e3rlb2@4ax.com</A>...<BR>> =20
> Well. You just helped me learn a =
lot. For=20
one thing, I guess I<BR>> > thought that because the =
Spanish=20
characters can be expressed as 8 bits<BR>> > they =
were=20
ASCII. <BR>> ><BR>> > I just made =
a little=20
test stored procedure taking an XML document as a<BR>> =
> =20
parameter, created the document manually in Notepad with UTF-8=20
specified<BR>> > in the header and tried including =
some of=20
the Spanish characters... and<BR>> > it =
failed. SQL=20
Server could not execute sp_xml_preparedocument because<BR>> =20
> "an invalid character was found in text =
content". Just=20
to make sure<BR>> > that was the problem I =
substituted=20
non-Spanish characters for the<BR>> > Spanish ones =
and it=20
executed fine. However, I can manually type the<BR>> =
> =20
text with the Spanish characters into the varchar field if I open=20
the<BR>> > table in EnterpriseManager and SQL Server =
is=20
perfectly happy. OTOH, if<BR>> > I specify =
UTF-16, I=20
get "Switch from current encoding to specified<BR>> =
> =20
encoding not supported." Next thing I tried was cloning =
the sproc=20
to<BR>> > write to the table with the nvarchar =
column, still=20
no joy, same error<BR>> > message... but on changing =
the=20
datatype of the input parameter from text<BR>> > to =
ntext it=20
worked fine. BUT here's the surprise (OK, to me it was =
a<BR>> =20
> surprise): If I again clone the sproc to point to the =
table=20
with the<BR>> > varchar column, but leave the input=20
parameter as ntext and specify<BR>> > UTF-16 in the =
document=20
header, it works. In other words, a varchar (and<BR>> =20
> presumably a char) column can successfully accept unicode =
data=20
even<BR>> > though char and varchar are explicitly =
defined=20
as non-unicode datatypes!<BR>> ><BR>> > =
Now I=20
have to understand whether I have a problem at work. I=20
never<BR>> > experienced this problem in real life =
because=20
none of the data we<BR>> > currently send using XML =
includes=20
any of the Spanish characters. Is<BR>> > the =
problem=20
only going to occur if the XML document is constructed =
using<BR>> =20
> the concatenated-string method? Or would it =
happen any=20
time an XML<BR>> > document specified as UTF-8 =
included=20
Spanish characters? (The data<BR>> > =
sent by=20
SalesForce to the ODS is likely to include Spanish =
characters,<BR>> =20
> but it probably creates the XML some other way.) Do I =
need to=20
tell the<BR>> > consulting outfit to specify all XML =
as=20
UTF-16? <BR>> ><BR>> > For the ETL =
from=20
the ODS to the data warehouse I am not planning to use<BR>> =20
> Sonic, but rather probably to link the databases and use a =
bunch=20
of<BR>> > stored procedures controlled by some VB =
code, IOW=20
I will not need XML<BR>> > because all the extract =
sprocs=20
will look like INSERT INTO....SELECT<BR>> > =20
FROM.<BR>> ><BR>> > I don't yet =
understand why=20
UTF-8 can't work for the Spanish<BR>> > =
characters...=20
(unless it only doesn't work when the characters are<BR>> =
> =20
manually typed into the document). If I correctly understand=20
the<BR>> > document to which you referred me, an =
8-bit=20
character can't fit in one<BR>> > UTF-8 byte because =
the=20
first bit is reserved for indicating which is the<BR>> =
> =20
first byte of a UTF-8 multi-byte character. (This was your point =
about<BR>> > not greater than 0x7F.) But =
why=20
wouldn't it just make two bytes out of<BR>> > the =
Spanish=20
characters then? The documentation says UTF-8 =
uses<BR>> =20
> multiple bytes for the characters that it can't fit into =
one=20
byte. <BR>> ><BR>> > =
???<BR>> =20
><BR>> ><BR>> ><BR>> > On =
Sun, 20=20
Feb 2005 13:10:40 -0800, "Rich" <@> wrote in =
message<BR>> =20
> <<A=20
=
href=3D"mailto:4218fc91@w3.nls.net">4218fc91@w3.nls.net</A>>:<BR>>&=
nbsp;=20
><BR>> > > The Spanish accented =
characters=20
are not part of ASCII. They are part of Windows calls ANSI of =
which=20
ASCII is the subset (0x00 to 0x7F). Any character in the 0x80 to =
0xFF=20
range is not compatible between ANSI and UTF-8.<BR>> =
> =20
><BR>> > >Rich<BR>> > =20
><BR>> > > "Ellen K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:7ouh119ivmuk26icg3mqqqk2ss1lfm5c10@4ax.com">news:7ouh119ivmu=
k26icg3mqqqk2ss1lfm5c10@4ax.com</A>...<BR>> =20
> > Should not have any non-ASCII characters, as =
previously=20
noted all the<BR>> > > special Spanish =
characters=20
are available in the ASCII character set.<BR>> > =
> =20
And since the company is built on our understanding of the=20
Hispanic<BR>> > > market, I don't see any =
use of,=20
say, pictograph-based languages in the<BR>> > =
> =20
foreseeable future. If 10 years down the road something =
like=20
that<BR>> > > happens, well, by then we =
will no=20
longer need compatibility with the<BR>> > > =
current=20
legacy system because it will long since have been =
replaced.<BR>> =20
> ><BR>> > > On Sun, 20 Feb =
2005=20
12:52:25 -0800, "Rich" <@> wrote in message<BR>> =
> =20
> <<A=20
=
href=3D"mailto:4218f849$1@w3.nls.net">4218f849$1@w3.nls.net</A>>:<BR>&=
gt; =20
> ><BR>> > > > =
From what=20
you describe below, if the values you emit to XML have non-ASCII =
characters I=20
would expect you to have a problem.<BR>> > =
> =20
><BR>> > > >Rich<BR>> =
> =20
> ><BR>> > > > "Ellen =
K."=20
<<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:eanh11h4vv6b9v21fiaounii3f5dunjl3g@4ax.com">news:eanh11h4vv6=
b9v21fiaounii3f5dunjl3g@4ax.com</A>...<BR>> =20
> > > On Sat, 19 Feb 2005 23:32:37 -0800, =
"Rich"=20
<@> wrote in message<BR>> > > =
> =20
<<A=20
=
href=3D"mailto:42183ccd@w3.nls.net">42183ccd@w3.nls.net</A>>:<BR>>&=
nbsp;=20
> > ><BR>> > > =
> =20
> The UTF in UTF-8/16/32 stands for Unicode =
Transformation=20
Format. You can find these defined in section 2.5 of <A=20
=
href=3D"http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf">http://www=
.unicode.org/versions/Unicode4.0.0/ch02.pdf</A>.<BR>> =20
> > ><BR>> > > =
> =20
THANK YOU SO MUCH!!! :)<BR>> > =
> =20
> ><BR>> > > > =20
> It's not clear to me how you are creating the XML =
from the=20
templates. If ANSI data is emitted into an XML document declared =
as=20
UTF-8 then you would have problems only for non-ASCII =
characters. UTF-8=20
and Windows-1252 are identical for 0x00 to 0x7F which is ASCII in=20
both.<BR>> > > ><BR>> =
> =20
> > I don't have a copy of a template here at home, =
but I=20
have them create<BR>> > > > it by =
string=20
concatenation because that seems to be the only way to =
be<BR>> =20
> > > able to have CDATA attributes, which =
I have=20
to have because in the<BR>> > > > =
legacy=20
data numeric-appearing identifiers are actually =
10-character<BR>> =20
> > > strings with leading spaces, and if =
these are=20
not specified as CDATA<BR>> > > > =
the=20
spaces go lost even with "xml:space=3D"preserve"" included in =
the<BR>> =20
> > > header. Here is a code snippet =
from one=20
of my apps that creates an XML<BR>> > > =
> =20
document which is passed as a parameter to a SQL Server =
stored<BR>> =20
> > > procedure:<BR>> > =20
> ><BR>> > > > =20
> strXM =3D "<?xml version =3D" =
& Chr(34)=20
& "1.0" & Chr(34) & " encoding=3D" & Chr(34) =
& "UTF-8"=20
& Chr(34) & "?>" & vbCrLf _<BR>> > =
> =20
> > & =
"<ROOT=20
xml:space=3D" & Chr(34) & "preserve" & Chr(34) & =
">" &=20
vbCrLf<BR>> > > > =
><BR>> =20
> > > > Do =
While=20
Not .EOF<BR>> > > > =20
> strXM =3D strXM &=20
"<M><A>" & !Ofc & =
"</A><B><![CDATA[" &=20
!Contract & "]]></B><C>" & !TCode &=20
"</C><D>" & !Date & "</D>" _<BR>> =20
> > > =20
=
> &nb=
sp; =20
& "<E><![CDATA[" & !TransNo &=20
"]]></E></M>" & vbCrLf<BR>> > =
> =20
> > =20
.MoveNext<BR>> > > > =20
> Loop<BR>> > =
> =20
> ><BR>> > > > =20
> strXM =3D strXM &=20
"</ROOT>"<BR>> > > ><BR>> =
> > > (The vbCrLf's are there so if there =
is a=20
problem the document can be<BR>> > > =
> =20
printed to a text file and be easier for humans to read -- SQL=20
Server<BR>> > > > ignores =
them. The=20
single-character aliases for entity and attribute<BR>> =
> =20
> > names are for performance -- for most of the =
stuff we=20
use these for it<BR>> > > > =
doesn't really=20
matter because we are only sending a few rows, but the<BR>> =20
> > > first time I did it it was for =
something that=20
was sending about 5000<BR>> > > > =
rows and=20
there it made a huge difference, so I stuck with it. =
We<BR>> =20
> > > comment both the front-end code and =
the=20
stored procedure with the<BR>> > > =
> =20
mappings of these aliases.)<BR>> > > =20
><BR>> > > > > I =
do not=20
know how SQL Server maps from char to nchar, specifically what =
conversion is=20
performed. Also, in some (maybe all released) versions of SQL =
Server=20
nchar and nvarchar are encoded in UCS-2. UCS-2 is a 16-bit =
encoding like=20
UTF-16. It dates back to when Unicode was defined as having =
2**16=20
characters instead of the 2**20+ that it has now. You can not =
express=20
characters >=3D U+10000 in UCS-2 not that you care about =
these.<BR>> =20
> > ><BR>> > > =
> =20
Thankfully, no. :)<BR>> > > =
> =20
><BR>> > > > > I =
don't=20
know if whether those systems you describe being written in java make =
a=20
difference. They can do what they want. The native java =
string is=20
Unicode though I don't remember if it is UCS-2 or UTF-16. My =
guess is=20
that it was once the former and is now the latter. One of the =
documents=20
on this on sun's site suggests that java used UCS-2 until the recently =
released 1.5 which is the first to use UTF-16.<BR>> =
> =20
> ><BR>> > > > The =
Java native=20
string being unicode is exactly what made me start<BR>> =
> =20
> > worrying -- when I was learning Java a couple of =
years=20
ago (because I<BR>> > > > wanted =
to port=20
an app to it so as to be able to run it right on the =
Unix<BR>> =20
> > > box where the Oracle database was) I =
was=20
horrified the first time I<BR>> > > =
> =20
tried reading back what I had written to a text file when I saw=20
spaces<BR>> > > > between all the=20
characters.<BR>> > > > =
><BR>> =20
> > > >Rich<BR>> > =
> =20
> ><BR>> > > > =
> =20
"Ellen K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message <A=20
=
href=3D"news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com">news:aqag115606i=
9g8bmh3lst66une1f1sotth@4ax.com</A>...<BR>> =20
> > > > UTF-8 is =
unicode?!? =20
Sheesh, all this time I thought it meant 8-bit.<BR>> =
> =20
> > > In fact I could swear I read that=20
somewhere.<BR>> > > > =
><BR>> =20
> > > > My question was coming =
from the=20
database perspective, where I always use<BR>> > =
> =20
> > char and varchar, as opposed to nchar and=20
nvarchar. I give the<BR>> > > =
> =20
> front-end guys little templates for creating the XML =
documents for=20
all<BR>> > > > > my SQL =
Server=20
stored procedures that take XML input, and I always<BR>> =
> =20
> > > specify UTF-8 in the header... and my =
char=20
and varchar columns always<BR>> > > =
> =20
> end up normal, so since you're now telling me UTF-8 is =
really=20
unicode, I<BR>> > > > > =
guess=20
that would answer my question for XML data I would be getting=20
from<BR>> > > > > the=20
apps...? Or would the answer be different if the =
incoming=20
XML is<BR>> > > > > some =
other=20
encoding?<BR>> > > > =
><BR>> =20
> > > > To simulate getting =
nvarchar data=20
from somewhere, I just tried creating<BR>> > =
> =20
> > two dummy tables, one with an nvarchar column =
and the=20
other with a<BR>> > > > > =
varchar=20
column, typed stuff into the nvarchar one, then inserted to =
the<BR>> =20
> > > > varchar one select from =
the=20
nvarchar one and it looks normal. <BR>> > =
> =20
> ><BR>> > > > =
> If=20
all this means I was worrying about nothing, excellent! =
OTOH,=20
is<BR>> > > > > there =
something I=20
should be worrying about that I didn't ask?<BR>> > =20
> > ><BR>> > > =
> =20
> The only pieces whose names I know so far are Sonic and =
SalesForce,=20
both<BR>> > > > > of =
which are=20
written in Java, if that makes any difference. I =
know<BR>> =20
> > > > there is at least one =
other=20
external piece but I think that is the next<BR>> > =20
> > > phase.<BR>> > =
> =20
> ><BR>> > > > =
> On=20
Sat, 19 Feb 2005 21:37:15 -0800, "Rich" <@> wrote in=20
message<BR>> > > > > =
<<A=20
=
href=3D"mailto:421821c1$1@w3.nls.net">421821c1$1@w3.nls.net</A>>:<BR>&=
gt; =20
> > > ><BR>> > =
> =20
> > > You need to be more specific =
than=20
"8-bit characters". There are many 8-bit character =
encodings. If=20
you are using Windows to generate your data you most likely are using=20
Windows-1252 which is the default 8-bit character set for U.S. English =
in=20
Windows. Windows supports many 8-bit encodings so you could be =
using=20
something else too.<BR>> > > > =
> =20
><BR>> > > > > =20
> Unicode is a character set not an encoding. =
There are=20
multiple encodings the main ones being UTF-8, UTF-16, and =
UTF-32. You=20
can use any of these for XML as well as non-Unicode encodings. =
For=20
interoperability you should use Unicode preferably =
UTF-8.<BR>> =20
> > > > ><BR>> =
> =20
> > > > What comes out when =
the XML=20
is parsed depends on the XML parser. XML is logically expressed =
in=20
Unico
|