Text 2668, 1342 rader
Skriven 2005-02-21 21:58:24 av Rich (1:379/45)
Kommentar till text 2653 av Ellen K (1:379/45)
Ärende: Re: ESB / XML / Unicode vs 8-bit characters ?
=====================================================
From: "Rich" <@>
This is a multi-part message in MIME format.
------=_NextPart_000_0A6B_01C51860.770EC1B0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
If SalesForce creates UTF-8 encoded XML, and it really is UTF-8 =
encoded, then you should have no problem. The XML is valid and any = software
you have that consumes XML should accept it just fine.
I don't see a problem with UTF-16 if you prefer this to UTF-8. Both =
can represent exactly the same characters. The only difference is the =
mechanics of the encoding.
Rich
"Ellen K" <Ellen.K@harborwebs.com> wrote in message =
news:908e44.556f2c@harborwebs.com...
First of all I have to say you have once again earned my eternal =
gratitude, I
am so glad you are here. :)
I still have to make a decision regarding the standard I want to =
request. Are
you saying that if for example I get XML generated by say SalesForce =
with UTF-8
specified and it includes these characters, I would not have a =
problem? Stuff
I write myself I'm not worried about because I have the opportunity to =
tweak
it, what I'm worried about is what the ESB will try to feed my =
databases.=20
Since even the hand-typed XML was accepted when UTF-16 was specified, =
I'm kind
of leaning toward that, especially since the Oracle guy told me that =
the TJ
accounting manager used to complain that the Spanish characters =
weren't coming
out on reports when we were on Oracle 8.x which used UTF-8 but as soon =
as we
moved to 9.x which uses UTF-16 there were no more problems. What (if =
any) do
you see as a potential downside to UTF-16?
> From: "Rich" <@>
> This is a multi-part message in MIME format.
> ------=3D_NextPart_000_08FA_01C517AB.C3617E00
> Content-Type: text/plain;
> charset=3D"iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
> The problem is not that UTF-8 won't work for what you call Spanish =
=3D
> characters. UTF-8 can encode anything that you consider a =
character. =3D
> The problem is that when these characters are present they are not =
being =3D
> encoded in UTF-8. There are two straight forward solutions I see. =
One =3D
> is to encode the XML correctly in UTF-8. The other is to encode the =
XML =3D
> in the Windows ANSI encoding I suspect you are using and to tag it =
=3D
> correctly. On a Windows U.S. English system this would be =3D
> "Windows-1252". I would expect SQL Server to support this. Other =
=3D
> applications may or may not. You could also use UTF-16 as long as =
you =3D
> generate your XML in UTF-16. I don't know if this is simple for =
your =3D
> application or not. If you are using VB 7.0 it should be.
> Rich
> "Ellen K." <72322.1016@compuserve.com> wrote in message =3D
> news:of2j11lqf4r4h1dsnvok8i1dv7c9e3rlb2@4ax.com...
> Well. You just helped me learn a lot. For one thing, I guess I
> thought that because the Spanish characters can be expressed as 8 =
bits
> they were ASCII. =3D20
> I just made a little test stored procedure taking an XML document as =
a
> parameter, created the document manually in Notepad with UTF-8 =3D
> specified
> in the header and tried including some of the Spanish characters... =
=3D
> and
> it failed. SQL Server could not execute sp_xml_preparedocument =3D
> because
> "an invalid character was found in text content". Just to make =
sure
> that was the problem I substituted non-Spanish characters for the
> Spanish ones and it executed fine. However, I can manually type the
> text with the Spanish characters into the varchar field if I open =
the
> table in EnterpriseManager and SQL Server is perfectly happy. OTOH, =
=3D
> if
> I specify UTF-16, I get "Switch from current encoding to specified
> encoding not supported." Next thing I tried was cloning the sproc =
to
> write to the table with the nvarchar column, still no joy, same =
error
> message... but on changing the datatype of the input parameter from =
=3D
> text
> to ntext it worked fine. BUT here's the surprise (OK, to me it was =
a
> surprise): If I again clone the sproc to point to the table with =
the
> varchar column, but leave the input parameter as ntext and specify
> UTF-16 in the document header, it works. In other words, a varchar =
=3D
> (and
> presumably a char) column can successfully accept unicode data even
> though char and varchar are explicitly defined as non-unicode =3D
> datatypes!
> Now I have to understand whether I have a problem at work. I never
> experienced this problem in real life because none of the data we
> currently send using XML includes any of the Spanish characters. Is
> the problem only going to occur if the XML document is constructed =
=3D
> using
> the concatenated-string method? Or would it happen any time an XML
> document specified as UTF-8 included Spanish characters? (The data
> sent by SalesForce to the ODS is likely to include Spanish =
characters,
> but it probably creates the XML some other way.) Do I need to tell =
=3D
> the
> consulting outfit to specify all XML as UTF-16? =3D20
> For the ETL from the ODS to the data warehouse I am not planning to =
=3D
> use
> Sonic, but rather probably to link the databases and use a bunch of
> stored procedures controlled by some VB code, IOW I will not need =
XML
> because all the extract sprocs will look like INSERT INTO....SELECT
> FROM.
> I don't yet understand why UTF-8 can't work for the Spanish
> characters... (unless it only doesn't work when the characters are
> manually typed into the document). If I correctly understand the
> document to which you referred me, an 8-bit character can't fit in =
one
> UTF-8 byte because the first bit is reserved for indicating which is =
=3D
> the
> first byte of a UTF-8 multi-byte character. (This was your point =
=3D
> about
> not greater than 0x7F.) But why wouldn't it just make two bytes =
out =3D
> of
> the Spanish characters then? The documentation says UTF-8 uses
> multiple bytes for the characters that it can't fit into one byte. =
=3D20
> ???
> On Sun, 20 Feb 2005 13:10:40 -0800, "Rich" <@> wrote in message
> <4218fc91@w3.nls.net>:
>> The Spanish accented characters are not part of ASCII. They are =
=3D
> part of Windows calls ANSI of which ASCII is the subset (0x00 to =
0x7F). =3D
> Any character in the 0x80 to 0xFF range is not compatible between =
ANSI =3D
> and UTF-8.
>>=20
>> Rich
>>=20
>> "Ellen K." <72322.1016@compuserve.com> wrote in message =3D
> news:7ouh119ivmuk26icg3mqqqk2ss1lfm5c10@4ax.com...
>> Should not have any non-ASCII characters, as previously noted all =
=3D
> the
>> special Spanish characters are available in the ASCII character =3D
> set.
>> And since the company is built on our understanding of the Hispanic
>> market, I don't see any use of, say, pictograph-based languages in =
=3D
> the
>> foreseeable future. If 10 years down the road something like that
>> happens, well, by then we will no longer need compatibility with =
=3D
> the
>> current legacy system because it will long since have been =3D
> replaced.
>>=20
>> On Sun, 20 Feb 2005 12:52:25 -0800, "Rich" <@> wrote in message
>> <4218f849$1@w3.nls.net>:
>>=20
>> > From what you describe below, if the values you emit to XML =3D
> have non-ASCII characters I would expect you to have a problem.
>> >
>> >Rich
>> >
>> > "Ellen K." <72322.1016@compuserve.com> wrote in message =3D
> news:eanh11h4vv6b9v21fiaounii3f5dunjl3g@4ax.com...
>> > On Sat, 19 Feb 2005 23:32:37 -0800, "Rich" <@> wrote in message
>> > <42183ccd@w3.nls.net>:
>> >
>> > > The UTF in UTF-8/16/32 stands for Unicode Transformation =3D
> Format. You can find these defined in section 2.5 of =3D
> http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf.
>> >
>> > THANK YOU SO MUCH!!! :)
>> > >
>> > > It's not clear to me how you are creating the XML from the =
=3D
> templates. If ANSI data is emitted into an XML document declared as =
=3D
> UTF-8 then you would have problems only for non-ASCII characters. =
UTF-8 =3D
> and Windows-1252 are identical for 0x00 to 0x7F which is ASCII in =
both.
>> >
>> > I don't have a copy of a template here at home, but I have them =
=3D
> create
>> > it by string concatenation because that seems to be the only way =
=3D
> to be
>> > able to have CDATA attributes, which I have to have because in =
=3D
> the
>> > legacy data numeric-appearing identifiers are actually =3D
> 10-character
>> > strings with leading spaces, and if these are not specified as =
=3D
> CDATA
>> > the spaces go lost even with "xml:space=3D3D"preserve"" included =
=3D
> in the
>> > header. Here is a code snippet from one of my apps that creates =
=3D
> an XML
>> > document which is passed as a parameter to a SQL Server stored
>> > procedure:
>> >
>> > > strXM =3D3D "<?xml version =3D3D" & Chr(34) & "1.0" & =
Chr(34) =3D
> & " encoding=3D3D" & Chr(34) & "UTF-8" & Chr(34) & "?>" & vbCrLf _
>> > > & "<ROOT xml:space=3D3D" & Chr(34) & "preserve" & =
Chr(34) =3D
> & ">" & vbCrLf
>> > >
>> > > Do While Not .EOF
>> > > strXM =3D3D strXM & "<M><A>" & !Ofc & =
"</A><B><![CDATA[" =3D
> & !Contract & "]]></B><C>" & !TCode & "</C><D>" & !Date & "</D>" _
>> > > & "<E><![CDATA[" & !TransNo & "]]></E></M>" & =
=3D
> vbCrLf
>> > > .MoveNext
>> > > Loop
>> > >
>> > > strXM =3D3D strXM & "</ROOT>"
>> >
>> > (The vbCrLf's are there so if there is a problem the document =
=3D
> can be
>> > printed to a text file and be easier for humans to read -- SQL =
=3D
> Server
>> > ignores them. The single-character aliases for entity and =3D
> attribute
>> > names are for performance -- for most of the stuff we use these =
=3D
> for it
>> > doesn't really matter because we are only sending a few rows, =
=3D
> but the
>> > first time I did it it was for something that was sending about =
=3D
> 5000
>> > rows and there it made a huge difference, so I stuck with it. =
=3D
> We
>> > comment both the front-end code and the stored procedure with =
=3D
> the
>> > mappings of these aliases.)
>> >
>> > > I do not know how SQL Server maps from char to nchar, =3D
> specifically what conversion is performed. Also, in some (maybe all =
=3D
> released) versions of SQL Server nchar and nvarchar are encoded in =
=3D
> UCS-2. UCS-2 is a 16-bit encoding like UTF-16. It dates back to =
when =3D
> Unicode was defined as having 2**16 characters instead of the 2**20+ =
=3D
> that it has now. You can not express characters >=3D3D U+10000 in =
UCS-2 =3D
> not that you care about these.
>> >
>> > Thankfully, no. :)
>> > >
>> > > I don't know if whether those systems you describe being =3D
> written in java make a difference. They can do what they want. The =
=3D
> native java string is Unicode though I don't remember if it is UCS-2 =
or =3D
> UTF-16. My guess is that it was once the former and is now the =
latter. =3D
> One of the documents on this on sun's site suggests that java used =
UCS-2 =3D
> until the recently released 1.5 which is the first to use UTF-16.
>> >
>> > The Java native string being unicode is exactly what made me =3D
> start
>> > worrying -- when I was learning Java a couple of years ago =3D
> (because I
>> > wanted to port an app to it so as to be able to run it right on =
=3D
> the Unix
>> > box where the Oracle database was) I was horrified the first =3D
> time I
>> > tried reading back what I had written to a text file when I saw =
=3D
> spaces
>> > between all the characters.
>> > >
>> > >Rich
>> > >
>> > > "Ellen K." <72322.1016@compuserve.com> wrote in message =3D
> news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com...
>> > > UTF-8 is unicode?!? Sheesh, all this time I thought it =3D
> meant 8-bit.
>> > > In fact I could swear I read that somewhere.
>> > >
>> > > My question was coming from the database perspective, where I =
=3D
> always use
>> > > char and varchar, as opposed to nchar and nvarchar. I give =
=3D
> the
>> > > front-end guys little templates for creating the XML =3D
> documents for all
>> > > my SQL Server stored procedures that take XML input, and I =
=3D
> always
>> > > specify UTF-8 in the header... and my char and varchar =3D
> columns always
>> > > end up normal, so since you're now telling me UTF-8 is really =
=3D
> unicode, I
>> > > guess that would answer my question for XML data I would be =
=3D
> getting from
>> > > the apps...? Or would the answer be different if the =3D
> incoming XML is
>> > > some other encoding?
>> > >
>> > > To simulate getting nvarchar data from somewhere, I just =3D
> tried creating
>> > > two dummy tables, one with an nvarchar column and the other =
=3D
> with a
>> > > varchar column, typed stuff into the nvarchar one, then =3D
> inserted to the
>> > > varchar one select from the nvarchar one and it looks normal. =
=3D
> =3D20
>> > >
>> > > If all this means I was worrying about nothing, excellent! =
=3D
> OTOH, is
>> > > there something I should be worrying about that I didn't ask?
>> > >
>> > > The only pieces whose names I know so far are Sonic and =3D
> SalesForce, both
>> > > of which are written in Java, if that makes any difference. =
=3D
> I know
>> > > there is at least one other external piece but I think that =
=3D
> is the next
>> > > phase.
>> > >
>> > > On Sat, 19 Feb 2005 21:37:15 -0800, "Rich" <@> wrote in =3D
> message
>> > > <421821c1$1@w3.nls.net>:
>> > >
>> > > > You need to be more specific than "8-bit characters". =
=3D
> There are many 8-bit character encodings. If you are using Windows =
to =3D
> generate your data you most likely are using Windows-1252 which is =
the =3D
> default 8-bit character set for U.S. English in Windows. Windows =
=3D
> supports many 8-bit encodings so you could be using something else =
too.
>> > > >
>> > > > Unicode is a character set not an encoding. There are =
=3D
> multiple encodings the main ones being UTF-8, UTF-16, and UTF-32. =
You =3D
> can use any of these for XML as well as non-Unicode encodings. For =
=3D
> interoperability you should use Unicode preferably UTF-8.
>> > > >
>> > > > What comes out when the XML is parsed depends on the XML =
=3D
> parser. XML is logically expressed in Unicode. The Windows XML =
parsers =3D
> provide a Unicode interface. Other parsers could do differently.
>> > > >
>> > > >Rich
>> > > >
>> > > >
>> > > > "Ellen K." <72322.1016@compuserve.com> wrote in message =
=3D
> news:4o2g11pu048kafbdilg46u77vs5ls0be55@4ax.com...
>> > > > Our new enterprise system is going to be built around an =
=3D
> Enterprise
>> > > > Service Bus. I don't have the full specs yet but as I =3D
> understand it the
>> > > > main apps (starting with SalesForce) are going to be out =
=3D
> on the internet
>> > > > and the Sonic ESB will be the messaging piece. There will =
=3D
> be an
>> > > > Operational Data Store in house that will get updated =3D
> every night on a
>> > > > batch basis from the main apps. =3D20
>> > > >
>> > > > My data warehouse will continue to be the data warehouse =
=3D
> and will remain
>> > > > in house. The dimensions will stay the same but I might =
=3D
> have to create
>> > > > separate measures for the data from the new apps and then =
=3D
> create views
>> > > > to keep everything transparent to the users. =3D20
>> > > >
>> > > > I'm thinking if we're going to have an ODS in house =3D
> already, I may as
>> > > > well do the ETL from there. But I'm worrying that the =
=3D
> new data will
>> > > > probably be unicode (because Java defaults to that and =3D
> SalesForce is
>> > > > written in Java). Right now I am storing everything =3D
> (except our blobs
>> > > > of course) in 8-bit characters. =3D20
>> > > >
>> > > > Anyone here who's up on this stuff, can the XML that goes =
=3D
> back and forth
>> > > > convert between unicode and 8-bit characters, or am I =3D
> gonna have to
>> > > > redefine all my data? For example, if unicode data is =
=3D
> put into an XML
>> > > > document that specifies UTF-8, what comes out when the =3D
> document is
>> > > > parsed? How about vice versa? If this is too simplistic =
=3D
> to work, what
>> > > > is needed?
>> > > >
>> > > > (We actually have no substantive need for unicode -- we =
=3D
> are bilingual
>> > > > Spanish but all the special Spanish characters exist in =
=3D
> the ascii
>> > > > character set.)
> ------=3D_NextPart_000_08FA_01C517AB.C3617E00
> Content-Type: text/html;
> charset=3D"iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> <HTML><HEAD>
> <META http-equiv=3D3DContent-Type content=3D3D"text/html; =3D
> charset=3D3Diso-8859-1">
> <META content=3D3D"MSHTML 6.00.3790.1289" name=3D3DGENERATOR>
> <STYLE></STYLE>
> </HEAD>
> <BODY bgColor=3D3D#ffffff>
> <DIV><FONT face=3D3DArial size=3D3D2> The problem is not =
that =3D
> UTF-8 won't=3D20
> work for what you call Spanish characters. UTF-8 can encode =
=3D
> anything that=3D20
> you consider a character. The problem is that when these =3D
> characters are=3D20
> present they are not being encoded in UTF-8. There are two =3D
> straight=3D20
> forward solutions I see. One is to encode the XML correctly =
in=3D20
> UTF-8. The other is to encode the XML in the Windows ANSI =
encoding =3D
> I=3D20
> suspect you are using and to tag it correctly. On a Windows =
U.S. =3D
> English=3D20
> system this would be "Windows-1252". I would expect SQL Server =
to =3D
> support=3D20
> this. Other applications may or may not. You could also =
use =3D
> UTF-16=3D20
> as long as you generate your XML in UTF-16. I don't know if =
this =3D
> is simple=3D20
> for your application or not. If you are using VB 7.0 it =
should=3D20
> be.</FONT></DIV>
> <DIV><FONT face=3D3DArial size=3D3D2></FONT> </DIV>
> <DIV><FONT face=3D3DArial size=3D3D2>Rich</FONT></DIV>
> <DIV><FONT face=3D3DArial size=3D3D2></FONT> </DIV>
> <BLOCKQUOTE=3D20
> style=3D3D"PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; =
=3D
> BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
> <DIV>"Ellen K." <<A=3D20
> =3D
> =
href=3D3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>=
&g=3D
> t;=3D20
> wrote in message <A=3D20
> =3D
> =
href=3D3D"news:of2j11lqf4r4h1dsnvok8i1dv7c9e3rlb2@4ax.com">news:of2j11lqf=
4r=3D
> 4h1dsnvok8i1dv7c9e3rlb2@4ax.com</A>...</DIV>Well. =3D20
> You just helped me learn a lot. For one thing, I =
guess=3D20
> I<BR>thought that because the Spanish characters can be expressed as =
8 =3D
> bits<BR>they were ASCII. <BR><BR>I just made a little test =3D
> stored=3D20
> procedure taking an XML document as a<BR>parameter, created the =3D
> document=3D20
> manually in Notepad with UTF-8 specified<BR>in the header and tried =
=3D
> including=3D20
> some of the Spanish characters... and<BR>it failed. SQL Server =
=3D
> could not=3D20
> execute sp_xml_preparedocument because<BR>"an invalid character was =
=3D
> found in=3D20
> text content". Just to make sure<BR>that was the problem =
I =3D
> substituted non-Spanish characters for the<BR>Spanish ones and it =
=3D
> executed=3D20
> fine. However, I can manually type the<BR>text with the =
Spanish=3D20
> characters into the varchar field if I open the<BR>table in =3D
> EnterpriseManager=3D20
> and SQL Server is perfectly happy. OTOH, if<BR>I specify =
UTF-16, =3D
> I get=3D20
> "Switch from current encoding to specified<BR>encoding not=3D20
> supported." Next thing I tried was cloning the sproc =3D
> to<BR>write=3D20
> to the table with the nvarchar column, still no joy, same =3D
> error<BR>message...=3D20
> but on changing the datatype of the input parameter from text<BR>to =
=3D
> ntext it=3D20
> worked fine. BUT here's the surprise (OK, to me it was=3D20
> a<BR>surprise): If I again clone the sproc to point to the =
table =3D
> with=3D20
> the<BR>varchar column, but leave the input parameter as ntext =
and=3D20
> specify<BR>UTF-16 in the document header, it works. In other =
=3D
> words, a=3D20
> varchar (and<BR>presumably a char) column can successfully accept =
=3D
> unicode data=3D20
> even<BR>though char and varchar are explicitly defined as =
non-unicode=3D20
> datatypes!<BR><BR>Now I have to understand whether I have a problem =
at =3D
> work. I never<BR>experienced this problem in real life =
=3D
> because=3D20
> none of the data we<BR>currently send using XML includes any of the =
=3D
> Spanish=3D20
> characters. Is<BR>the problem only going to occur if the XML =
=3D
> document is=3D20
> constructed using<BR>the concatenated-string method? Or =
=3D
> would it=3D20
> happen any time an XML<BR>document specified as UTF-8 included =
Spanish =3D
> characters? (The data<BR>sent by SalesForce to the ODS =
is =3D
> likely=3D20
> to include Spanish characters,<BR>but it probably creates the XML =
some =3D
> other=3D20
> way.) Do I need to tell the<BR>consulting outfit to specify =
all =3D
> XML as=3D20
> UTF-16? <BR><BR>For the ETL from the ODS to the data warehouse =
I =3D
> am not=3D20
> planning to use<BR>Sonic, but rather probably to link the databases =
=3D
> and use a=3D20
> bunch of<BR>stored procedures controlled by some VB code, IOW I will =
=3D
> not need=3D20
> XML<BR>because all the extract sprocs will look like INSERT=3D20
> INTO....SELECT<BR>FROM.<BR><BR>I don't yet understand why UTF-8 =
can't =3D
> work for=3D20
> the Spanish<BR>characters... (unless it only doesn't work when the =
=3D
> characters=3D20
> are<BR>manually typed into the document). If I correctly =3D
> understand=3D20
> the<BR>document to which you referred me, an 8-bit character can't =
fit =3D
> in=3D20
> one<BR>UTF-8 byte because the first bit is reserved for indicating =
=3D
> which is=3D20
> the<BR>first byte of a UTF-8 multi-byte character. (This was =
=3D
> your point=3D20
> about<BR>not greater than 0x7F.) But why wouldn't it =
just =3D
> make two=3D20
> bytes out of<BR>the Spanish characters then? The =3D
> documentation=3D20
> says UTF-8 uses<BR>multiple bytes for the characters that it can't =
fit =3D
> into=3D20
> one byte. <BR><BR>???<BR><BR><BR><BR>On Sun, 20 Feb 2005 =3D
> 13:10:40 -0800,=3D20
> "Rich" <@> wrote in message<BR><<A=3D20
> =3D
> =
href=3D3D"mailto:4218fc91@w3.nls.net">4218fc91@w3.nls.net</A>>:<BR><BR=
>&=3D
> gt; =3D20
> The Spanish accented characters are not part of ASCII. They =
are =3D
> part of=3D20
> Windows calls ANSI of which ASCII is the subset (0x00 to =
0x7F). =3D
> Any=3D20
> character in the 0x80 to 0xFF range is not compatible between ANSI =
and =3D
> UTF-8.<BR>><BR>>Rich<BR>><BR>> "Ellen K." =
<<A=3D20
> =3D
> =
href=3D3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>=
&g=3D
> t;=3D20
> wrote in message <A=3D20
> =3D
> =
href=3D3D"news:7ouh119ivmuk26icg3mqqqk2ss1lfm5c10@4ax.com">news:7ouh119iv=
mu=3D
> k26icg3mqqqk2ss1lfm5c10@4ax.com</A>...<BR>> =3D20
> Should not have any non-ASCII characters, as previously noted =
all=3D20
> the<BR>> special Spanish characters are available in the =
=3D
> ASCII=3D20
> character set.<BR>> And since the company is built on =
our=3D20
> understanding of the Hispanic<BR>> market, I don't see any =
=3D
> use of,=3D20
> say, pictograph-based languages in the<BR>> =
foreseeable=3D20
> future. If 10 years down the road something like=3D20
> that<BR>> happens, well, by then we will no longer need =3D
> compatibility=3D20
> with the<BR>> current legacy system because it will long =
=3D
> since have=3D20
> been replaced.<BR>><BR>> On Sun, 20 Feb 2005 12:52:25 =
=3D
> -0800,=3D20
> "Rich" <@> wrote in message<BR>> <<A=3D20
> =3D
> =
href=3D3D"mailto:4218f849$1@w3.nls.net">4218f849$1@w3.nls.net</A>>:<BR=
>&=3D
> gt;<BR>> =3D20
> > From what you describe below, if the values you =
emit =3D
> to XML=3D20
> have non-ASCII characters I would expect you to have a =3D
> problem.<BR>> =3D20
> ><BR>> >Rich<BR>> ><BR>> =3D
> > "Ellen=3D20
> K." <<A=3D20
> =3D
> =
href=3D3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>=
&g=3D
> t;=3D20
> wrote in message <A=3D20
> =3D
> =
href=3D3D"news:eanh11h4vv6b9v21fiaounii3f5dunjl3g@4ax.com">news:eanh11h4v=
v6=3D
> b9v21fiaounii3f5dunjl3g@4ax.com</A>...<BR>> =3D20
> > On Sat, 19 Feb 2005 23:32:37 -0800, "Rich" <@> =
wrote =3D
> in=3D20
> message<BR>> > <<A=3D20
> =3D
> =
href=3D3D"mailto:42183ccd@w3.nls.net">42183ccd@w3.nls.net</A>>:<BR>>=
;&=3D
> nbsp;=3D20
> ><BR>> > > The UTF in =
UTF-8/16/32 =3D
> stands=3D20
> for Unicode Transformation Format. You can find these defined =
in =3D
> section=3D20
> 2.5 of <A=3D20
> =3D
> =
href=3D3D"http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf">http://w=
ww=3D
> .unicode.org/versions/Unicode4.0.0/ch02.pdf</A>.<BR>> =3D20
> ><BR>> > THANK YOU SO =
MUCH!!! =3D20
> :)<BR>> > ><BR>> > =3D
> > =3D20
> It's not clear to me how you are creating the XML from the =3D
> templates. If=3D20
> ANSI data is emitted into an XML document declared as UTF-8 then you =
=3D
> would=3D20
> have problems only for non-ASCII characters. UTF-8 and =3D
> Windows-1252 are=3D20
> identical for 0x00 to 0x7F which is ASCII in =
both.<BR>> =3D20
> ><BR>> > I don't have a copy of a template here =
=3D
> at home,=3D20
> but I have them create<BR>> > it by string =3D
> concatenation=3D20
> because that seems to be the only way to be<BR>> > =
=3D
> able to=3D20
> have CDATA attributes, which I have to have because in =3D
> the<BR>> =3D20
> > legacy data numeric-appearing identifiers are =
actually=3D20
> 10-character<BR>> > strings with leading spaces, =
and =3D
> if=3D20
> these are not specified as CDATA<BR>> > the spaces =
=3D
> go lost=3D20
> even with "xml:space=3D3D"preserve"" included in the<BR>> =
=3D
> > =3D20
> header. Here is a code snippet from one of my apps that =
creates =3D
> an=3D20
> XML<BR>> > document which is passed as a parameter =
=3D
> to a SQL=3D20
> Server stored<BR>> > procedure:<BR>> =3D20
> ><BR>> > > =
strXM =3D
> =3D3D=3D20
> "<?xml version =3D3D" & Chr(34) & "1.0" & Chr(34) =
& =3D
> " =3D20
> encoding=3D3D" & Chr(34) & "UTF-8" & Chr(34) & =
"?>" =3D
> &=3D20
> vbCrLf _<BR>> > =3D20
> > & "<ROOT =3D
> xml:space=3D3D"=3D20
> & Chr(34) & "preserve" & Chr(34) & ">" &=3D20
> vbCrLf<BR>> > ><BR>> > =3D20
> > Do While Not .EOF<BR>> =
=3D
> > =3D20
> > strXM =3D3D strXM =
&=3D20
> "<M><A>" & !Ofc & =3D
> "</A><B><![CDATA[" &=3D20
> !Contract & "]]></B><C>" & !TCode &=3D20
> "</C><D>" & !Date & "</D>" =
_<BR>> =3D20
> > =3D20
> =3D
> =
> &nb=
=3D
> sp; =3D20
> & "<E><![CDATA[" & !TransNo &=3D20
> "]]></E></M>" & vbCrLf<BR>> =
> =3D20
> > =
.MoveNext<BR>> =3D
> > > Loop<BR>> =3D
> > =3D20
> ><BR>> > > =
strXM =3D
> =3D3D strXM=3D20
> & "</ROOT>"<BR>> ><BR>> > =
(The =3D
> vbCrLf's are there so if there is a problem the document can =3D
> be<BR>> =3D20
> > printed to a text file and be easier for humans to read =
-- =3D
> SQL=3D20
> Server<BR>> > ignores them. The =3D
> single-character=3D20
> aliases for entity and attribute<BR>> > names are =
=3D
> for=3D20
> performance -- for most of the stuff we use these for =
it<BR>> =3D
> > doesn't really matter because we are only sending a few =
=3D
> rows, but=3D20
> the<BR>> > first time I did it it was for =
something =3D
> that was=3D20
> sending about 5000<BR>> > rows and there it made a =
=3D
> huge=3D20
> difference, so I stuck with it. We<BR>> > =
=3D
> comment both=3D20
> the front-end code and the stored procedure with the<BR>> =
=3D
> > =3D20
> mappings of these aliases.)<BR>> ><BR>> =3D
> > =3D20
> > I do not know how SQL Server maps from char to =
nchar, =3D
> specifically what conversion is performed. Also, in some =
(maybe =3D
> all=3D20
> released) versions of SQL Server nchar and nvarchar are encoded =
in=3D20
> UCS-2. UCS-2 is a 16-bit encoding like UTF-16. It dates =
=3D
> back to=3D20
> when Unicode was defined as having 2**16 characters instead of the =
=3D
> 2**20+ that=3D20
> it has now. You can not express characters >=3D3D U+10000 =
in =3D
> UCS-2 not=3D20
> that you care about these.<BR>> ><BR>> =
> =3D
> Thankfully, no. :)<BR>> > =3D
> ><BR>> =3D20
> > > I don't know if whether those systems =
you =3D
> describe=3D20
> being written in java make a difference. They can do what =
they=3D20
> want. The native java string is Unicode though I don't =
remember =3D
> if it is=3D20
> UCS-2 or UTF-16. My guess is that it was once the former and =
is =3D
> now the=3D20
> latter. One of the documents on this on sun's site suggests =
that =3D
> java=3D20
> used UCS-2 until the recently released 1.5 which is the first to =
use=3D20
> UTF-16.<BR>> ><BR>> > The Java native =
=3D
> string=3D20
> being unicode is exactly what made me start<BR>> > =
=3D
> worrying=3D20
> -- when I was learning Java a couple of years ago (because =3D
> I<BR>> =3D20
> > wanted to port an app to it so as to be able to run it =
=3D
> right on the=3D20
> Unix<BR>> > box where the Oracle database was) I =
was =3D
> horrified the first time I<BR>> > tried reading =
back =3D
> what I=3D20
> had written to a text file when I saw spaces<BR>> =
> =3D
> between=3D20
> all the characters.<BR>> > ><BR>> =3D
> > =3D20
> >Rich<BR>> > ><BR>> > =3D
> > =3D20
> "Ellen K." <<A=3D20
> =3D
> =
href=3D3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>=
&g=3D
> t;=3D20
> wrote in message <A=3D20
> =3D
> =
href=3D3D"news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com">news:aqag11560=
6i=3D
> 9g8bmh3lst66une1f1sotth@4ax.com</A>...<BR>> =3D20
> > > UTF-8 is unicode?!? Sheesh, all =
this =3D
> time I=3D20
> thought it meant 8-bit.<BR>> > > In fact =
I =3D
> could=3D20
> swear I read that somewhere.<BR>> > =3D
> ><BR>> =3D20
> > > My question was coming from the database =3D
> perspective,=3D20
> where I always use<BR>> > > char and =3D
> varchar, as=3D20
> opposed to nchar and nvarchar. I give the<BR>> =3D
> > =3D20
> > front-end guys little templates for creating the XML =3D
> documents for=3D20
> all<BR>> > > my SQL Server stored =3D
> procedures that=3D20
> take XML input, and I always<BR>> > > =3D
> specify UTF-8=3D20
> in the header... and my char and varchar columns =
always<BR>> =3D20
> > > end up normal, so since you're now telling me =
=3D
> UTF-8 is=3D20
> really unicode, I<BR>> > > guess that =
would =3D
> answer=3D20
> my question for XML data I would be getting from<BR>> =3D
> > =3D20
> > the apps...? Or would the answer be =3D
> different if=3D20
> the incoming XML is<BR>> > > some =
other=3D20
> encoding?<BR>> > ><BR>> > =3D
> > To=3D20
> simulate getting nvarchar data from somewhere, I just tried=3D20
> creating<BR>> > > two dummy tables, one =
=3D
> with an=3D20
> nvarchar column and the other with a<BR>> > =3D
> > =3D20
> varchar column, typed stuff into the nvarchar one, then inserted =
to=3D20
> the<BR>> > > varchar one select from the =
=3D
> nvarchar=3D20
> one and it looks normal. <BR>> > =3D
> ><BR>> =3D20
> > > If all this means I was worrying about =
nothing,=3D20
> excellent! OTOH, is<BR>> > > =
=3D
> there=3D20
> something I should be worrying about that I didn't =
ask?<BR>> =3D20
> > ><BR>> > > The only pieces =
=3D
> whose=3D20
> names I know so far are Sonic and SalesForce, both<BR>> =3D
> > =3D20
> > of which are written in Java, if that makes any =3D
> difference. I=3D20
> know<BR>> > > there is at least one other =
=3D
> external=3D20
> piece but I think that is the next<BR>> > =
> =3D
> phase.<BR>> > ><BR>> > =3D
> > On=3D20
> Sat, 19 Feb 2005 21:37:15 -0800, "Rich" <@> wrote in=3D20
> message<BR>> > > <<A=3D20
> =3D
> =
href=3D3D"mailto:421821c1$1@w3.nls.net">421821c1$1@w3.nls.net</A>>:<BR=
>&=3D
> gt; =3D20
> > ><BR>> > > > =
=3D
> You need=3D20
> to be more specific than "8-bit characters". There are many =
=3D
> 8-bit=3D20
> character encodings. If you are using Windows to generate your =
=3D
> data you=3D20
> most likely are using Windows-1252 which is the default 8-bit =3D
> character set=3D20
> for U.S. English in Windows. Windows supports many 8-bit =3D
> encodings so=3D20
> you could be using something else too.<BR>> > =3D
> > =3D20
> ><BR>> > > > Unicode is =
a =3D
> character set not an encoding. There are multiple encodings =
the =3D
> main=3D20
> ones being UTF-8, UTF-16, and UTF-32. You can use any of these =
=3D
> for XML=3D20
> as well as non-Unicode encodings. For interoperability you =3D
> should use=3D20
> Unicode preferably UTF-8.<BR>> > > =3D20
> ><BR>> > > > What comes =
=3D
> out when=3D20
> the XML is parsed depends on the XML parser. XML is logically =
=3D
> expressed=3D20
> in Unicode. The Windows XML parsers provide a Unicode =3D
> interface. =3D20
> Other parsers could do differently.<BR>> > =3D
> > =3D20
> ><BR>> > > >Rich<BR>> =3D
> > =3D20
> > ><BR>> > > =
><BR>> =3D
> > > > "Ellen K." <<A=3D20
> =3D
> =
href=3D3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>=
&g=3D
> t;=3D20
> wrote in message <A=3D20
> =3D
> =
href=3D3D"news:4o2g11pu048kafbdilg46u77vs5ls0be55@4ax.com">news:4o2g11pu0=
48=3D
> kafbdilg46u77vs5ls0be55@4ax.com</A>...<BR>> =3D20
> > > > Our new enterprise system is going =
to =3D
> be=3D20
> built around an Enterprise<BR>> > > =3D
> > =3D20
> Service Bus. I don't have the full specs yet but as I =
understand =3D
> it=3D20
> the<BR>> > > > main apps =
(starting =3D
> with=3D20
> SalesForce) are going to be out on the internet<BR>> =3D
> > =3D20
> > > and the Sonic ESB will be the messaging =3D
> piece. =3D20
> There will be an<BR>> > > =
> =3D20
> Operational Data Store in house that will get updated every night =
on=3D20
> a<BR>> > > > batch basis from =
the =3D
> main=3D20
> apps. <BR>> > > ><BR>> =
=3D
> > =3D20
> > > My data warehouse will continue to be the data =
=3D
> warehouse=3D20
> and will remain<BR>> > > > in =
=3D
> house. =3D20
> The dimensions will stay the same but I might have to =3D
> create<BR>> =3D20
> > > > separate measures for the data from =
=3D
> the new=3D20
> apps and then create views<BR>> > > =3D
> > to=3D20
> keep everything transparent to the users. =
<BR>> =3D20
> > > ><BR>> > > =3D
> > I'm=3D20
> thinking if we're going to have an ODS in house already, I may=3D20
> as<BR>> > > > well do the ETL =
from =3D
> there. But I'm worrying that the new data =3D
> will<BR>> =3D20
> > > > probably be unicode (because Java =
=3D
> defaults to=3D20
> that and SalesForce is<BR>> > > =
> =3D
> written=3D20
> in Java). Right now I am storing everything (except our=3D20
> blobs<BR>> > > > of course) in =
=3D
> 8-bit=3D20
> characters. <BR>> > > =3D20
> ><BR>> > > > Anyone here =
who's =3D
> up on=3D20
> this stuff, can the XML that goes back and forth<BR>> =3D
> > =3D20
> > > convert between unicode and 8-bit characters, =
or =3D
> am I=3D20
> gonna have to<BR>> > > > =
redefine =3D
> all my=3D20
> data? For example, if unicode data is put into an=3D20
> XML<BR>> > > > document that =3D
> specifies=3D20
> UTF-8, what comes out when the document is<BR>> > =
=3D
> > =3D20
> > parsed? How about vice versa? If this is too =
=3D
> simplistic=3D20
> to work, what<BR>> > > > is=3D20
> needed?<BR>> > > ><BR>> =3D
> > =3D20
> > > (We actually have no substantive need for =3D
> unicode -- we=3D20
> are bilingual<BR>> > > > Spanish =
=3D
> but all=3D20
> the special Spanish characters exist in the ascii<BR>> =3D
> > =3D20
> > > character set.)<BR></BLOCKQUOTE></BODY></HTML>
------=_NextPart_000_0A6B_01C51860.770EC1B0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.3790.1289" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2> If SalesForce creates =
UTF-8 encoded=20
XML, and it really is UTF-8 encoded, then you should have no = problem.
The=20
XML is valid and any software you have that consumes XML should accept = it
just=20
fine.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2> I don't see a problem with =
UTF-16 if=20
you prefer this to UTF-8. Both can represent exactly the same=20
characters. The only difference is the mechanics of the=20
encoding.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial size=3D2>Rich</FONT></DIV>
<DIV> </DIV>
<BLOCKQUOTE=20
style=3D"PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; =
BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<DIV>"Ellen K" <<A=20
href=3D"mailto:Ellen.K@harborwebs.com">Ellen.K@harborwebs.com</A>> =
wrote in=20
message <A=20
=
href=3D"news:908e44.556f2c@harborwebs.com">news:908e44.556f2c@harborwebs.=
com</A>...</DIV>First=20
of all I have to say you have once again earned my eternal gratitude, =
I<BR>am=20
so glad you are here. :)<BR><BR>I still have to make a =
decision=20
regarding the standard I want to request. Are<BR>you saying that =
if for=20
example I get XML generated by say SalesForce with UTF-8<BR>specified =
and it=20
includes these characters, I would not have a problem? =
Stuff<BR>I write=20
myself I'm not worried about because I have the opportunity to =
tweak<BR>it,=20
what I'm worried about is what the ESB will try to feed my databases.=20
<BR>Since even the hand-typed XML was accepted when UTF-16 was =
specified, I'm=20
kind<BR>of leaning toward that, especially since the Oracle guy told =
me that=20
the TJ<BR>accounting manager used to complain that the Spanish =
characters=20
weren't coming<BR>out on reports when we were on Oracle 8.x which used =
UTF-8=20
but as soon as we<BR>moved to 9.x which uses UTF-16 there were no more =
problems. What (if any) do<BR>you see as a potential downside to =
UTF-16?<BR><BR><BR>> From: "Rich" <@><BR>> This is a =
multi-part=20
message in MIME format.<BR>>=20
------=3D_NextPart_000_08FA_01C517AB.C3617E00<BR>> Content-Type:=20
text/plain;<BR>> charset=3D"iso-8859-1"<BR>> =
Content-Transfer-Encoding:=20
quoted-printable<BR>> The problem is not that UTF-8 won't work for =
what you=20
call Spanish =3D<BR>> characters. UTF-8 can encode anything =
that you=20
consider a character. =3D<BR>> The problem is that when these =
characters are present they are not being =3D<BR>> encoded in =
UTF-8. =20
There are two straight forward solutions I see. One =3D<BR>> =
is to=20
encode the XML correctly in UTF-8. The other is to encode the =
XML=20
=3D<BR>> in the Windows ANSI encoding I suspect you are using and =
to tag it=20
=3D<BR>> correctly. On a Windows U.S. English system this =
would be=20
=3D<BR>> "Windows-1252". I would expect SQL Server to support =
this. Other =3D<BR>> applications may or may not. You =
could also=20
use UTF-16 as long as you =3D<BR>> generate your XML in =
UTF-16. I don't=20
know if this is simple for your =3D<BR>> application or not. =
If you are=20
using VB 7.0 it should be.<BR>> Rich<BR>> "Ellen K." <<A=20
=
href=3D"mailto:72322.1016@compuserve.com">72322.1016@compuserve.com</A>&g=
t;=20
wrote in message =3D<BR>> <A=20
=
href=3D"news:of2j11lqf4r4h1dsnvok8i1dv7c9e3rlb2@4ax.com">news:of2j11lqf4r=
4h1dsnvok8i1dv7c9e3rlb2@4ax.com</A>...<BR>>=20
Well. You just helped me learn a lot. For one =
thing, I=20
guess I<BR>> thought that because the Spanish characters can be =
expressed=20
as 8 bits<BR>> they were ASCII. =3D20<BR>> I just made a little =
test=20
stored procedure taking an XML document as a<BR>> parameter, =
created the=20
document manually in Notepad with UTF-8 =3D<BR>> specified<BR>> =
in the=20
header and tried including some of the Spanish characters... =
=3D<BR>>=20
and<BR>> it failed. SQL Server could not execute=20
sp_xml_preparedocument =3D<BR>> because<BR>> "an invalid =
character was=20
found in text content". Just to make sure<BR>> that was =
the=20
problem I substituted non-Spanish characters for the<BR>> Spanish =
ones and=20
it executed fine. However, I can manually type the<BR>> text =
with the=20
Spanish characters into the varchar field if I open the<BR>> table =
in=20
EnterpriseManager and SQL Server is perfectly happy. OTOH, =
=3D<BR>>=20
if<BR>> I specify UTF-16, I get "Switch from current encoding to=20
specified<BR>> encoding not supported." Next thing I =
tried was=20
cloning the sproc to<BR>> write to the table with the nvarchar =
column,=20
still no joy, same error<BR>> message... but on changing the =
datatype of=20
the input parameter from =3D<BR>> text<BR>> to ntext it worked =
fine. =20
BUT here's the surprise (OK, to me it was a<BR>> surprise): =
If I=20
again clone the sproc to point to the table with the<BR>> varchar =
column,=20
but leave the input parameter as ntext and specify<BR>> UTF-16 in =
the=20
document header, it works. In other words, a varchar =3D<BR>> =
(and<BR>> presumably a char) column can successfully accept unicode =
data=20
even<BR>> though char and varchar are explicitly defined as =
non-unicode=20
=3D<BR>> datatypes!<BR>> Now I have to understand whether I have =
a problem=20
at work. I never<BR>> experienced this problem in real =
life=20
because none of the data we<BR>> currently send using XML includes =
any of=20
the Spanish characters. Is<BR>> the problem only going to =
occur if=20
the XML document is constructed =3D<BR>> using<BR>> the=20
concatenated-string method? Or would it happen any time an =
XML<BR>> document specified as UTF-8 included Spanish=20
characters? (The data<BR>> sent by SalesForce to the =
ODS is=20
likely to include Spanish characters,<BR>> but it probably creates =
the XML=20
some other way.) Do I need to tell =3D<BR>> the<BR>> =
consulting=20
outfit to specify all XML as UTF-16? =3D20<BR>> For the ETL from =
the ODS to=20
the data warehouse I am not planning to =3D<BR>> use<BR>> Sonic, =
but=20
rather probably to link the databases and use a bunch of<BR>> =
stored=20
procedures controlled by some VB code, IOW I will not need XML<BR>> =
because=20
all the extract sprocs will look like INSERT INTO....SELECT<BR>>=20
FROM.<BR>> I don't yet understand why UTF-8 can't work for the=20
Spanish<BR>> characters... (unless it only doesn't work when the =
characters=20
are<BR>> manually typed into the document). If I correctly =
understand=20
the<BR>> document to which you referred me, an 8-bit character =
can't fit in=20
one<BR>> UTF-8 byte because the first bit is reserved for =
indicating which=20
is =3D<BR>> the<BR>> first byte of a UTF-8 multi-byte =
character. =20
(This was your point =3D<BR>> about<BR>> not greater than=20
0x7F.) But why wouldn't it just make two bytes out =
=3D<BR>>=20
of<BR>> the Spanish characters then? The documentation =
says=20
UTF-8 uses<BR>> multiple bytes for the characters that it can't fit =
into=20
one byte. =3D20<BR>> ???<BR>> On Sun, 20 Feb 2005 13:10:40 =
-0800, "Rich"=20
<@> wrote in message<BR>> <<A=20
=
href=3D"mailto:4218fc91@w3.nls
|