As per Relevance of the word standard, we have this rfc below:
Network Working Group P.
Request for Comments: 2781 Internet Mail
Category: Informational F.
Alis
February 2000
UTF-16, an encoding of ISO 10646
Status of this
This memo provides information for the Internet community. It
not specify an Internet standard of any kind. Distribution of
memo is unlimited
Copyright
Copyright (C) The Internet Society (2000). All Rights Reserved
1.
This document describes the UTF-16 encoding of Unicode/ISO-10646,
addresses the issues of serializing UTF-16 as an octet stream
transmission over the Internet, discusses MIME charset naming
described in [CHARSET-REG], and contains the registration for
MIME charset parameter values: UTF-16BE (big-endian), UTF-16
(little-endian), and UTF-16.
1.1 Background and
The Unicode Standard [UNICODE] and ISO/IEC 10646 [ISO-10646]
define a coded character set (CCS), hereafter referred to as Unicode
which encompasses most of the world's writing systems [WORKSHOP].
UTF-16, the object of this specification, is one of the standard
of encoding Unicode character data; it has the characteristics
encoding all currently defined characters (in plane 0, the BMP)
exactly two octets and of being able to encode all other
likely to be defined (the next 16 planes) in exactly four octets
The Unicode Standard further defines additional character
and other application details of great interest to implementors.
to the present time, changes in Unicode and amendments to ISO/
10646 have tracked each other, so that the character repertoires
code point assignments have remained in sync. The
standardization committees have committed to maintain this
useful synchronism, as well as not to assign characters outside
the 17 planes accessible to UTF-16.
Hoffman & Yergeau Informational [Page 1]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
The IETF policy on character sets and languages [CHARPOLICY]
that IETF protocols MUST be able to use the UTF-8 character
scheme [UTF-8]. Some products and network standards already
UTF-16, making it an important encoding for the Internet.
document is not an update to the [CHARPOLICY] document, only
description of the UTF-16 encoding
1.2
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
document are to be interpreted as described in RFC 2119 [MUSTSHOULD].
Throughout this document, character values are shown in
notation. For example, "0x013C" is the character whose value is
character assigned the integer value 316 (decimal) in the CCS
2. UTF-16
UTF-16 is described in the Unicode Standard, version 3.0 [UNICODE].
The definitive reference is Annex Q of ISO/IEC 10646-1 [ISO-10646].
The rest of this section summarizes the definition is simple terms
In ISO 10646, each character is assigned a number, which
calls the Unicode scalar value. This number is the same as the UCS-4
value of the character, and this document will refer to it as
"character value" for brevity. In the UTF-16 encoding, characters
represented using either one or two unsigned 16-bit integers
depending on the character value. Serialization of these integers
transmission as a byte stream is discussed in Section 3.
The rules for how characters are encoded in UTF-16 are
- Characters with values less than 0x10000 are represented as
single 16-bit integer with a value equal to that of the
number
- Characters with values between 0x10000 and 0x10FFFF
represented by a 16-bit integer with a value between 0xD800
0xDBFF (within the so-called high-half zone or high
area) followed by a 16-bit integer with a value between 0xDC00
0xDFFF (within the so-called low-half zone or low surrogate area).
- Characters with values greater than 0x10FFFF cannot be encoded
UTF-16.
Note: Values between 0xD800 and 0xDFFF are specifically reserved
use with UTF-16, and don't have any characters assigned to them
Hoffman & Yergeau Informational [Page 2]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
2.1 Encoding UTF-16
Encoding of a single character from an ISO 10646 character value
UTF-16 proceeds as follows. Let U be the character number, no
than 0x10FFFF
1) If U < 0x10000, encode U as a 16-bit unsigned integer
terminate
2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF
U' must be less than or equal to 0xFFFFF. That is, U' can
represented in 20 bits
3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800
0xDC00, respectively. These integers each have 10 bits free
encode the character value, for a total of 20 bits
4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-
bits of W1 and the 10 low-order bits of U' to the 10 low-
bits of W2. Terminate
Graphically, steps 2 through 4 look like
U' =
W1 = 110110
W2 = 110111
2.2 Decoding UTF-16
Decoding of a single character from UTF-16 to an ISO 10646
value proceeds as follows. Let W1 be the next 16-bit integer in
sequence of integers representing the text. Let W2 be the (eventual
next integer following W1.
1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the
of W1. Terminate
2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the
is in error and no valid character can be obtained using W1.
Terminate
3) If there is no W2 (that is, the sequence ends with W1), or if W
is not between 0xDC00 and 0xDFFF, the sequence is in error
Terminate
4) Construct a 20-bit unsigned integer U', taking the 10 low-
bits of W1 as its 10 high-order bits and the 10 low-order bits
W2 as its 10 low-order bits
Hoffman & Yergeau Informational [Page 3]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
5) Add 0x10000 to U' to obtain the character value U. Terminate
Note that steps 2 and 3 indicate errors. Error recovery is
specified by this document. When terminating with an error in steps 2
and 3, it may be wise to set U to the value of W1 to help the
diagnose the error and not lose information. Also note that a
decoding algorithm, as opposed to the single-character
described above, need not terminate upon detection of an error,
proper error reporting and/or recovery is provided
3. Labelling UTF-16
Appendix A of this specification contains registrations for
MIME charsets: "UTF-16BE", "UTF-16LE", and "UTF-16". MIME
represent the combination of a CCS (a coded character set) and a
(a character encoding scheme). Here the CCS is Unicode/ISO 10646
the CES is the same in all three cases, except for the
order of the octets in each character, and the external
of which serialization is used
This section describes which of the three labels to apply to a
of text. Section 4 describes how to interpret the labels on a
of text
3.1 Definition of big-endian and little-
Historically, computer hardware has processed two-octet entities
as 16-bit integers in one of two ways. So-called "big-endian
hardware handles two-octet entities with the higher-order
first, that is at the lower address in memory; when written out
disk or to a network interface (serializing), the high-order
thus appears first in the data stream. On the other hand, "Little
endian" hardware handles two-octet entities with the lower-
octet first. Hardware of both kinds is common today
For example, the unsigned 16-bit integer that represents the
number 258 is 0x0102. The big-endian serialization of that number
the octet 0x01 followed by the octet 0x02. The little-
serialization of that number is the octet 0x02 followed by the
0x01. The following C code fragment demonstrates a way to write 16-
bit quantities to a file in big-endian order, irrespective of
hardware's native byte order
void write_be(unsigned short u, FILE f) /* assume short is 16 bits */
{
putc(u >> 8, f); /* output high-order byte */
putc(u & 0xFF, f); /* then low-order */
}
Hoffman & Yergeau Informational [Page 4]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
The term "network byte order" has been used in many RFCs to
big-endian serialization, although that term has yet to be
defined in a standards-track document. Although ISO 10646
big-endian serialization (section 6.3 of [ISO-10646]), little-
order is also sometimes used on the Internet
3.2 Byte order mark (BOM
The Unicode Standard and ISO 10646 define the character "ZERO
NON-BREAKING SPACE" (0xFEFF), which is also known informally as "
ORDER MARK" (abbreviated "BOM"). The latter name hints at a
possible usage of the character, in addition to its normal use as
genuine "ZERO WIDTH NON-BREAKING SPACE" within text. This usage
suggested by Unicode section 2.4 and ISO 10646 Annex F (informative),
is to prepend a 0xFEFF character to a stream of Unicode characters
a "signature"; a receiver of such a serialized stream may then
the initial character both as a hint that the stream consists
Unicode characters and as a way to recognize the serialization order
In serialized UTF-16 prepended with such a signature, the order
big-endian if the first two octets are 0xFE followed by 0xFF; if
are 0xFF followed by 0xFE, the order is little-endian. Note
0xFFFE is not a Unicode character, precisely to preserve
usefulness of 0xFEFF as a byte-order mark
It is important to understand that the character 0xFEFF appearing
any position other than the beginning of a stream MUST be
with the semantics for the zero-width non-breaking space, and
NOT be interpreted as a byte-order mark. The contrapositive of
statement is not always true: the character 0xFEFF in the
position of a stream MAY be interpreted as a zero-width non-
space, and is not always a byte-order mark. For example, if a
splits a UTF-16 string into many parts, a part might begin
0xFEFF because there was a zero-width non-breaking space at
beginning of that substring
The Unicode standard further suggests than an initial 0
character may be stripped before processing the text, the
being that such a character in initial position may be an artifact
the encoding (an encoding signature), not a genuine intended "
WIDTH NON-BREAKING SPACE". Note that such stripping might affect
external process at a different layer (such as a digital signature
a count of the characters) that is relying on the presence of
characters in the stream
In particular, in UTF-16 plain text it is likely, but not certain
that an initial 0xFEFF is a signature. When concatenating
strings, it is important to strip out those signatures,
otherwise the resulting string may contain an unintended "ZERO
Hoffman & Yergeau Informational [Page 5]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
NON-BREAKING SPACE" at the connection point. Also,
specifications mandate an initial 0xFEFF character in
labelled as UTF-16 and specify that this signature is not part of
object
3.3 Choosing a label for UTF-16
Any labelling application that uses UTF-16 character encoding,
explicitly labels the text, and knows the serialization order of
characters in text, SHOULD label the text as either "UTF-16BE"
"UTF-16LE", whichever is appropriate based on the endianness of
text. This allows applications processing the text, but unable
look inside the text, to know the serialization definitively
Text in the "UTF-16BE" charset MUST be serialized with the
which make up a single 16-bit UTF-16 value in big-endian order
Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text
Text in the "UTF-16LE" charset MUST be serialized with the
which make up a single 16-bit UTF-16 value in little-endian order
Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text
Any labelling application that uses UTF-16 character encoding,
puts an explicit charset label on the text, and does not know
serialization order of the characters in text, MUST label the text
"UTF-16", and SHOULD make sure the text starts with 0xFEFF
An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE
would occur with document formats that mandate a BOM in UTF-16 text
thereby requiring the use of the "UTF-16" tag only
4. Interpreting text
When a program sees text labelled as "UTF-16BE", "UTF-16LE",
"UTF-16", it can make some assumptions, based on the labelling
given in the previous section. These assumptions allow the program
then process the text
4.1 Interpreting text labelled as UTF-16
Text labelled "UTF-16BE" can always be interpreted as being big
endian. The detection of an initial BOM does not affect de
serialization of text labelled as UTF-16BE. Finding 0xFF followed
0xFE is an error since there is no Unicode character 0xFFFE
Hoffman & Yergeau Informational [Page 6]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
4.2 Interpreting text labelled as UTF-16
Text labelled "UTF-16LE" can always be interpreted as being little
endian. The detection of an initial BOM does not affect de
serialization of text labelled as UTF-16LE. Finding 0xFE followed
0xFF is an error since there is no Unicode character 0xFFFE,
would be the interpretation of those octets under little-
order
4.3 Interpreting text labelled as UTF-16
Text labelled with the "UTF-16" charset might be serialized in
big-endian or little-endian order. If the first two octets of
text is 0xFE followed by 0xFF, then the text can be interpreted
being big-endian. If the first two octets of the text is 0
followed by 0xFE, then the text can be interpreted as being little
endian. If the first two octets of the text is not 0xFE followed
0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD
interpreted as being big-endian
All applications that process text with the "UTF-16" charset
MUST be able to read at least the first two octets of the text and
able to process those octets in order to determine the
order of the text. Applications that process text with the "UTF-16"
charset label MUST NOT assume the serialization without
checking the first two octets to see if they are a big-endian BOM,
little-endian BOM, or not a BOM. All applications that process
with the "UTF-16" charset label MUST be able to interpret both big
endian and little-endian text
5.
For the sake of example, let's suppose that there is a
character representing the Egyptian god Ra with character
0x12345 (this character does not exist at present in Unicode).
The examples here all evaluate to the phrase
*=
where the "*" represents the Ra hieroglyph (0x12345).
Text labelled with UTF-16BE, without a BOM
D8 08 DF 45 00 3D 00 52 00 61
Text labelled with UTF-16LE, without a BOM
08 D8 45 DF 3D 00 52 00 61 00
Hoffman & Yergeau Informational [Page 7]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
Big-endian text labelled with UTF-16, with a BOM
FE FF D8 08 DF 45 00 3D 00 52 00 61
Little-endian text labelled with UTF-16, with a BOM
FF FE 08 D8 45 DF 3D 00 52 00 61 00
6. Versions of the
ISO/IEC 10646 is updated from time to time by published amendments
similarly, different versions of the Unicode standard exist: 1.0,
1.1, 2.0, 2.1, and 3.0 as of this writing. Each new version
the previous one, but implementations, and more significantly data
are not updated instantly
In general, the changes amount to adding new characters, which
not pose particular problems with old data. Amendment 5 to ISO/
10646, however, has moved and expanded the Korean Hangul block
thereby making any previous data containing Hangul characters
under the new version. Unicode 2.0 has the same difference
Unicode 1.1. The official justification for allowing such
incompatible change was that no significant implementations and
containing Hangul existed, a statement that is likely to be true
remains unprovable. The incident has been dubbed the "Korean mess",
and the relevant committees have pledged to never, ever again
such an incompatible change
New versions, and in particular any incompatible changes,
consequences regarding MIME character encoding labels, to
discussed in Appendix A
7. IANA
IANA is to register the character sets found in Appendixes A.1, A.2,
and A.3 according to RFC 2278, using registration templates found
those appendixes
8. Security
UTF-16 is based on the ISO 10646 character set, which is
being added to, as described in Section 6 and Appendix A of
document. Processors must be able to handle characters that are
defined at the time that the processor was created in such a way
to not allow an attacker to harm a recipient by including
characters
Processors that handle any type of text, including text encoded
UTF-16, must be vigilant in checking for control characters
might reprogram a display terminal or keyboard. Similarly,
Hoffman & Yergeau Informational [Page 8]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
that interpret text entities (such as looking for
programming code), must be careful not to execute the code
first alerting the recipient
Text in UTF-16 may contain special characters, such as the
REPLACEMENT CHARACTER (0xFFFC), that might cause external processing
depending on the interpretation of the processing program and
availability of an external data stream that would be executed.
external processing may have side-effects that allow the sender of
message to attack the receiving system
Implementors of UTF-16 need to consider the security aspects of
they handle illegal UTF-16 sequences (that is, sequences
surrogate pairs that have illegal values or unpaired surrogates).
is conceivable that in some circumstances an attacker would be
to exploit an incautious UTF-16 parser by sending it an
sequence that is not permitted by the UTF-16 syntax, causing it
behave in some anomalous fashion
9.
[CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets
Languages", BCP 18, RFC 2277, January 1998.
[CHARSET-REG] Freed, N. and J. Postel, "IANA Charset
Procedures", BCP 19, RFC 2278, January 1998.
[HTTP-1.1] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
Masinter, L., Leach, P. and T. Berners-Lee, "
Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
[ISO-10646] ISO/IEC 10646-1:1993. International Standard --
Information technology -- Universal Multiple-
Coded Character Set (UCS) -- Part 1: Architecture
Basic Multilingual Plane. 22 amendments and
technical corrigenda have been published up to now
UTF-16 is described in Annex Q, published as
1. Many other amendments are currently at
stages of standardization. A second edition is
preparation, probably to be published in 2000; in
new edition, UTF-16 will probably be described in
C
[MUSTSHOULD] Bradner, S., "Key words for use in RFCs to
Requirement Levels", BCP 14, RFC 2119, March 1997.
[UNICODE] The Unicode Consortium, "The Unicode Standard --
Version 3.0", ISBN 0-201-61633-5. Described
Hoffman & Yergeau Informational [Page 9]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
standard/versions/Unicode3.0.html>.
[UTF-8] Yergeau, F., "UTF-8, a transformation format of
10646", RFC 2279, January 1998.
[WORKSHOP] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin., M. and P. Svanberg, "Report
the IAB Character Set Workshop", RFC 2130, April 1997.
10.
Deborah Goldsmith wrote a great deal of the initial wording for
specification. Martin Duerst proposed numerous significant changes
Other significant contributors include
Mati
Walt
Mark
Ned
Asmus
Lloyd
Dan
Murata
Larry
Markus
Keld
Ken
Some of the text in this specification was copied from [UTF-8],
that document was worked on by many people. Please see
acknowledgments section in that document for more people who may
contributed indirectly to this document
Hoffman & Yergeau Informational [Page 10]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
A. Charset
This memo is meant to serve as the basis for registration of
MIME charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE",
"UTF-16LE", and "UTF-16". These strings label objects containing
consisting of characters from the repertoire of ISO/IEC 10646
including all amendments at least up to amendment 5 (Korean block),
encoded to a sequence of octets using the encoding and
schemes outlined above
Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable
use in media types under the "text" top-level type, because they
not encode line endings in the way required for MIME "text"
types. An exception to this is HTTP, which uses a MIME-
mechanism, but is exempt from the restrictions on the text top-
type (see section 19.4.2 of HTTP 1.1 [HTTP-1.1]).
It is noteworthy that the labels described here do not contain
version identification, referring generically to ISO/IEC 10646.
is intentional, the rationale being as follows
A MIME charset is designed to give just the information needed
interpret a sequence of bytes received on the wire into a sequence
characters, nothing more (see RFC 2045, section 2.2, in [MIME]).
long as a character set standard does not change incompatibly
version numbers serve no purpose, because one gains nothing
learning from the tag that newly assigned characters may be
that one doesn't know about. The tag itself doesn't teach
about the new characters, which are going to be received anyway
Hence, as long as the standards evolve compatibly, the
advantage of having labels that identify the versions is only that
apparent. But there is a disadvantage to such version-
labels: when an older application receives data accompanied by
newer, unknown label, it may fail to recognize the label and
completely unable to deal with the data, whereas a generic,
label would have triggered mostly correct processing of the data
which may well not contain any new characters
The "Korean mess" (ISO/IEC 10646 amendment 5) is an
change, in principle contradicting the appropriateness of a
independent MIME charset as described above. But the
problem can only appear with data containing Korean Hangul
encoded according to Unicode 1.1 (or equivalently ISO/IEC 10646
before amendment 5), and there is arguably no such data to
about, this being the very reason the incompatible change was
acceptable
Hoffman & Yergeau Informational [Page 11]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
In practice, then, a version-independent label is warranted,
the label is understood to refer to all versions after Amendment 5,
and provided no incompatible change actually occurs.
incompatible changes occur in a later version of ISO/IEC 10646,
MIME charsets defined here will stay aligned with the
version until and unless the IETF specifically decides otherwise
A.1 Registration for UTF-16
To: ietf-charsets@iana.
Subject: Registration of new
Charset name(s): UTF-16
Published specification(s): This
Suitable for use in MIME content types under
"text" top-level type:
Person & email address to contact for further information
Paul Hoffman
Francois Yergeau
A.2 Registration for UTF-16
To: ietf-charsets@iana.
Subject: Registration of new
Charset name(s): UTF-16
Published specification(s): This
Suitable for use in MIME content types under
"text" top-level type:
Person & email address to contact for further information
Paul Hoffman
Francois Yergeau
A.3 Registration for UTF-16
To: ietf-charsets@iana.
Subject: Registration of new
Charset name(s): UTF-16
Published specification(s): This
Hoffman & Yergeau Informational [Page 12]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
Suitable for use in MIME content types under
"text" top-level type:
Person & email address to contact for further information
Paul Hoffman
Francois Yergeau
Authors'
Paul
Internet Mail
127 Segre
Santa Cruz, CA 95060
EMail: phoffman@imc.
Francois
Alis
100, boul. Alexis-Nihon, Suite 600
Montreal QC H4M 2P2
EMail: fyergeau@alis.
Hoffman & Yergeau Informational [Page 13]
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
Full Copyright
Copyright (C) The Internet Society (2000). All Rights Reserved
This document and translations of it may be copied and furnished
others, and derivative works that comment on or otherwise explain
or assist in its implementation may be prepared, copied,
and distributed, in whole or in part, without restriction of
kind, provided that the above copyright notice and this paragraph
included on all such copies and derivative works. However,
document itself may not be modified in any way, such as by
the copyright notice or references to the Internet Society or
Internet organizations, except as needed for the purpose
developing Internet standards in which case the procedures
copyrights defined in the Internet Standards process must
followed, or as required to translate it into languages other
English
The limited permissions granted above are perpetual and will not
revoked by the Internet Society or its successors or assigns
This document and the information contained herein is provided on
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED,
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE
Funding for the RFC Editor function is currently provided by
Internet Society
Hoffman & Yergeau Informational [Page 14]
if you see any problems within the linking, don't worry be happy,
this is version 0.1 of the Relevance System and you gotta expect some crappy subroutines sometimes,
just be content we did not write this in Java, which would have made this "bigger and better" HAHAHHA.
RFC documents can be found at I.E.T.F.
Relevance System Copyright © 2002 Spectrum WorldResearch
other technical nosh by ServerMasters Corporation
collaboration of BobX