Text 2648, 261 rader
Skriven 2005-02-20 23:16:16 av Ellen K. (1:379/45)
Kommentar till text 2641 av Rich (1:379/45)
Ärende: Re: ESB / XML / Unicode vs 8-bit characters ?
=====================================================
From: Ellen K. <72322.1016@compuserve.com>
Well. You just helped me learn a lot. For one thing, I guess I
thought that because the Spanish characters can be expressed as 8 bits they
were ASCII.
I just made a little test stored procedure taking an XML document as a
parameter, created the document manually in Notepad with UTF-8 specified in the
header and tried including some of the Spanish characters... and it failed.
SQL Server could not execute sp_xml_preparedocument because "an invalid
character was found in text content". Just to make sure
that was the problem I substituted non-Spanish characters for the Spanish ones
and it executed fine. However, I can manually type the text with the Spanish
characters into the varchar field if I open the table in EnterpriseManager and
SQL Server is perfectly happy. OTOH, if I specify UTF-16, I get "Switch from
current encoding to specified encoding not supported." Next thing I tried was
cloning the sproc to
write to the table with the nvarchar column, still no joy, same error
message... but on changing the datatype of the input parameter from text to
ntext it worked fine. BUT here's the surprise (OK, to me it was a surprise):
If I again clone the sproc to point to the table with the varchar column, but
leave the input parameter as ntext and specify UTF-16 in the document header,
it works. In other words, a varchar (and presumably a char) column can
successfully accept unicode data even though char and varchar are explicitly
defined as non-unicode datatypes!
Now I have to understand whether I have a problem at work. I never
experienced this problem in real life because none of the data we currently
send using XML includes any of the Spanish characters. Is the problem only
going to occur if the XML document is constructed using the concatenated-string
method? Or would it happen any time an XML
document specified as UTF-8 included Spanish characters? (The data
sent by SalesForce to the ODS is likely to include Spanish characters, but it
probably creates the XML some other way.) Do I need to tell the consulting
outfit to specify all XML as UTF-16?
For the ETL from the ODS to the data warehouse I am not planning to use Sonic,
but rather probably to link the databases and use a bunch of stored procedures
controlled by some VB code, IOW I will not need XML because all the extract
sprocs will look like INSERT INTO....SELECT FROM.
I don't yet understand why UTF-8 can't work for the Spanish characters...
(unless it only doesn't work when the characters are manually typed into the
document). If I correctly understand the document to which you referred me, an
8-bit character can't fit in one UTF-8 byte because the first bit is reserved
for indicating which is the first byte of a UTF-8 multi-byte character. (This
was your point about not greater than 0x7F.) But why wouldn't it just make
two bytes out of
the Spanish characters then? The documentation says UTF-8 uses
multiple bytes for the characters that it can't fit into one byte.
???
On Sun, 20 Feb 2005 13:10:40 -0800, "Rich" <@> wrote in message
<4218fc91@w3.nls.net>:
> The Spanish accented characters are not part of ASCII. They are part of
Windows calls ANSI of which ASCII is the subset (0x00 to 0x7F). Any character
in the 0x80 to 0xFF range is not compatible between ANSI and UTF-8.
>
>Rich
>
> "Ellen K." <72322.1016@compuserve.com> wrote in message
news:7ouh119ivmuk26icg3mqqqk2ss1lfm5c10@4ax.com...
> Should not have any non-ASCII characters, as previously noted all the
> special Spanish characters are available in the ASCII character set.
> And since the company is built on our understanding of the Hispanic
> market, I don't see any use of, say, pictograph-based languages in the
> foreseeable future. If 10 years down the road something like that
> happens, well, by then we will no longer need compatibility with the
> current legacy system because it will long since have been replaced.
>
> On Sun, 20 Feb 2005 12:52:25 -0800, "Rich" <@> wrote in message
> <4218f849$1@w3.nls.net>:
>
> > From what you describe below, if the values you emit to XML have
non-ASCII characters I would expect you to have a problem.
> >
> >Rich
> >
> > "Ellen K." <72322.1016@compuserve.com> wrote in message
news:eanh11h4vv6b9v21fiaounii3f5dunjl3g@4ax.com...
> > On Sat, 19 Feb 2005 23:32:37 -0800, "Rich" <@> wrote in message
> > <42183ccd@w3.nls.net>:
> >
> > > The UTF in UTF-8/16/32 stands for Unicode Transformation Format. You
can find these defined in section 2.5 of
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf.
> >
> > THANK YOU SO MUCH!!! :)
> > >
> > > It's not clear to me how you are creating the XML from the templates.
If ANSI data is emitted into an XML document declared as UTF-8 then you would
have problems only for non-ASCII characters. UTF-8 and Windows-1252 are
identical for 0x00 to 0x7F which is ASCII in both.
> >
> > I don't have a copy of a template here at home, but I have them create
> > it by string concatenation because that seems to be the only way to be
> > able to have CDATA attributes, which I have to have because in the
> > legacy data numeric-appearing identifiers are actually 10-character
> > strings with leading spaces, and if these are not specified as CDATA
> > the spaces go lost even with "xml:space="preserve"" included in the
> > header. Here is a code snippet from one of my apps that creates an XML
> > document which is passed as a parameter to a SQL Server stored
> > procedure:
> >
> > > strXM = "<?xml version =" & Chr(34) & "1.0" & Chr(34) & "
encoding=" & Chr(34) & "UTF-8" & Chr(34) & "?>" & vbCrLf _
> > > & "<ROOT xml:space=" & Chr(34) & "preserve" & Chr(34) & ">" &
vbCrLf
> > >
> > > Do While Not .EOF
> > > strXM = strXM & "<M><A>" & !Ofc & "</A><B><![CDATA[" & !Contract
& "]]></B><C>" & !TCode & "</C><D>" & !Date & "</D>" _
> > > & "<E><![CDATA[" & !TransNo & "]]></E></M>" & vbCrLf
> > > .MoveNext
> > > Loop
> > >
> > > strXM = strXM & "</ROOT>"
> >
> > (The vbCrLf's are there so if there is a problem the document can be
> > printed to a text file and be easier for humans to read -- SQL Server
> > ignores them. The single-character aliases for entity and attribute
> > names are for performance -- for most of the stuff we use these for it
> > doesn't really matter because we are only sending a few rows, but the
> > first time I did it it was for something that was sending about 5000
> > rows and there it made a huge difference, so I stuck with it. We
> > comment both the front-end code and the stored procedure with the
> > mappings of these aliases.)
> >
> > > I do not know how SQL Server maps from char to nchar, specifically
what conversion is performed. Also, in some (maybe all released) versions of
SQL Server nchar and nvarchar are encoded in UCS-2. UCS-2 is a 16-bit encoding
like UTF-16. It dates back to when Unicode was defined as having 2**16
characters instead of the 2**20+ that it has now. You can not express
characters >= U+10000 in UCS-2 not that you care about these.
> >
> > Thankfully, no. :)
> > >
> > > I don't know if whether those systems you describe being written in
java make a difference. They can do what they want. The native java string is
Unicode though I don't remember if it is UCS-2 or UTF-16. My guess is that it
was once the former and is now the latter. One of the documents on this on
sun's site suggests that java used UCS-2 until the recently released 1.5 which
is the first to use UTF-16.
> >
> > The Java native string being unicode is exactly what made me start
> > worrying -- when I was learning Java a couple of years ago (because I
> > wanted to port an app to it so as to be able to run it right on the Unix
> > box where the Oracle database was) I was horrified the first time I
> > tried reading back what I had written to a text file when I saw spaces
> > between all the characters.
> > >
> > >Rich
> > >
> > > "Ellen K." <72322.1016@compuserve.com> wrote in message
news:aqag115606i9g8bmh3lst66une1f1sotth@4ax.com...
> > > UTF-8 is unicode?!? Sheesh, all this time I thought it meant 8-bit.
> > > In fact I could swear I read that somewhere.
> > >
> > > My question was coming from the database perspective, where I always
use
> > > char and varchar, as opposed to nchar and nvarchar. I give the
> > > front-end guys little templates for creating the XML documents for all
> > > my SQL Server stored procedures that take XML input, and I always
> > > specify UTF-8 in the header... and my char and varchar columns always
> > > end up normal, so since you're now telling me UTF-8 is really unicode,
I
> > > guess that would answer my question for XML data I would be getting
from
> > > the apps...? Or would the answer be different if the incoming XML
is
> > > some other encoding?
> > >
> > > To simulate getting nvarchar data from somewhere, I just tried
creating
> > > two dummy tables, one with an nvarchar column and the other with a
> > > varchar column, typed stuff into the nvarchar one, then inserted to
the
> > > varchar one select from the nvarchar one and it looks normal.
> > >
> > > If all this means I was worrying about nothing, excellent! OTOH, is
> > > there something I should be worrying about that I didn't ask?
> > >
> > > The only pieces whose names I know so far are Sonic and SalesForce,
both
> > > of which are written in Java, if that makes any difference. I know
> > > there is at least one other external piece but I think that is the
next
> > > phase.
> > >
> > > On Sat, 19 Feb 2005 21:37:15 -0800, "Rich" <@> wrote in message
> > > <421821c1$1@w3.nls.net>:
> > >
> > > > You need to be more specific than "8-bit characters". There are
many 8-bit character encodings. If you are using Windows to generate your data
you most likely are using Windows-1252 which is the default 8-bit character set
for U.S. English in Windows. Windows supports many 8-bit encodings so you
could be using something else too.
> > > >
> > > > Unicode is a character set not an encoding. There are multiple
encodings the main ones being UTF-8, UTF-16, and UTF-32. You can use any of
these for XML as well as non-Unicode encodings. For interoperability you
should use Unicode preferably UTF-8.
> > > >
> > > > What comes out when the XML is parsed depends on the XML parser.
XML is logically expressed in Unicode. The Windows XML parsers provide a
Unicode interface. Other parsers could do differently.
> > > >
> > > >Rich
> > > >
> > > >
> > > > "Ellen K." <72322.1016@compuserve.com> wrote in message
news:4o2g11pu048kafbdilg46u77vs5ls0be55@4ax.com...
> > > > Our new enterprise system is going to be built around an Enterprise
> > > > Service Bus. I don't have the full specs yet but as I understand
it the
> > > > main apps (starting with SalesForce) are going to be out on the
internet
> > > > and the Sonic ESB will be the messaging piece. There will be an
> > > > Operational Data Store in house that will get updated every night
on a
> > > > batch basis from the main apps.
> > > >
> > > > My data warehouse will continue to be the data warehouse and will
remain
> > > > in house. The dimensions will stay the same but I might have to
create
> > > > separate measures for the data from the new apps and then create
views
> > > > to keep everything transparent to the users.
> > > >
> > > > I'm thinking if we're going to have an ODS in house already, I may
as
> > > > well do the ETL from there. But I'm worrying that the new data
will
> > > > probably be unicode (because Java defaults to that and SalesForce
is
> > > > written in Java). Right now I am storing everything (except our
blobs
> > > > of course) in 8-bit characters.
> > > >
> > > > Anyone here who's up on this stuff, can the XML that goes back and
forth
> > > > convert between unicode and 8-bit characters, or am I gonna have to
> > > > redefine all my data? For example, if unicode data is put into an
XML
> > > > document that specifies UTF-8, what comes out when the document is
> > > > parsed? How about vice versa? If this is too simplistic to work,
what
> > > > is needed?
> > > >
> > > > (We actually have no substantive need for unicode -- we are
bilingual
> > > > Spanish but all the special Spanish characters exist in the ascii
> > > > character set.)
--- BBBS/NT v4.01 Flag-5
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)
|