As per Relevance of the word information, we have this rfc below:
Network Working Group F.
Request for Comments: 2044 Alis
Category: Informational October 1996
UTF-8, a transformation format of Unicode and ISO 10646
Status of this
This memo provides information for the Internet community. This
does not specify an Internet standard of any kind. Distribution
this memo is unlimited
The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993
define a 16 bit character set which encompasses most of the world'
writing systems. 16-bit characters, however, are not compatible
many current applications and protocols, and this has led to
development of a few so-called UCS transformation formats (UTF),
with different characteristics. UTF-8, the object of this memo,
the characteristic of preserving the full US-ASCII range: US-
characters are encoded in one octet having the usual US-ASCII value
and any octet with such a value can only be an US-ASCII character
This provides compatibility with file systems, parsers and
software that rely on US-ASCII values but are transparent to
values
1.
The Unicode Standard, version 1.1 [UNICODE], and ISO/IEC 10646-1:1993
[ISO-10646] jointly define a 16 bit character set, UCS-2,
encompasses most of the world's writing systems. ISO 10646
defines a 31-bit character set, UCS-4, with currently no
outside of the region corresponding to UCS-2 (the Basic
Plane, BMP). The UCS-2 and UCS-4 encodings, however, are hard to
in many current applications and protocols that assume 8 or even 7
bit characters. Even newer systems able to deal with 16
characters cannot process UCS-4 data. This situation has led to
development of so-called UCS transformation formats (UTF), each
different characteristics
UTF-1 has only historical interest, having been removed from
10646. UTF-7 has the quality of encoding the full Unicode
using only octets with the high-order bit clear (7 bit US-
values, [US-ASCII]), and is thus deemed a mail-safe
([RFC1642]). UTF-8, the object of this memo, uses all bits of
octet, but has the quality of preserving the full US-ASCII range
Yergeau Informational [Page 1]
RFC 2044 UTF-8 October 1996
US-ASCII characters are encoded in one octet having the normal US
ASCII value, and any octet with such a value can only stand for
US-ASCII character, and nothing else
UTF-16 is a scheme for transforming a subset of the UCS-4
into a pair of UCS-2 values from a reserved range. UTF-16
UTF-8 in that UCS-2 values from the reserved range must be
specially in the UTF-8 transformation
UTF-8 encodes UCS-2 or UCS-4 characters as a varying number
octets, where the number of octets, and the value of each, depend
the integer value assigned to the character in ISO 10646.
transformation format has the following characteristics (all
are in hexadecimal):
- Character values from 0000 0000 to 0000 007F (US-ASCII repertoire
correspond to octets 00 to 7F (7 bit US-ASCII values).
- US-ASCII values do not appear otherwise in a UTF-8 encoded charac
ter stream. This provides compatibility with file systems
other software (e.g. the printf() function in C libraries)
parse based on US-ASCII values but are transparent to other val
ues
- Round-trip conversion is easy between UTF-8 and either of UCS-4,
UCS-2 or Unicode
- The first octet of a multi-octet sequence indicates the number
octets in the sequence
- Character boundaries are easily found from anywhere in an
stream
- The lexicographic sorting order of UCS-4 strings is preserved.
course this is of limited interest since the sort order is
culturally valid in either case
- The octet values FE and FF never appear
UTF-8 was originally a project of the X/Open
Internationalization Group XOJIG with the objective to specify a
System Safe UCS Transformation Format [FSS-UTF] that is
with UNIX systems, supporting multilingual text in a single encoding
The original authors were Gary Miller, Greger Leijonhufvud and
Entenmann. Later, Ken Thompson and Rob Pike did significant work
the formal UTF-8.
Yergeau Informational [Page 2]
RFC 2044 UTF-8 October 1996
A description can also be found in Unicode Technical Report #4 [UNI
CODE]. The definitive reference, including provisions for UTF-16
data within UTF-8, is Annex R of ISO/IEC 10646-1 [ISO-10646].
2. UTF-8
In UTF-8, characters are encoded using sequences of 1 to 6 octets
The only octet of a "sequence" of one has the higher-order bit set
0, the remaining 7 bits being used to encode the character value.
a sequence of n octets, n>1, the initial octet has the n higher-
bits set to 1, followed by a bit set to 0. The remaining bit(s)
that octet contain bits from the value of the character to
encoded. The following octet(s) all have the higher-order bit set
1 and the following bit set to 0, leaving 6 bits in each to
bits from the character to be encoded
The table below summarizes the format of these different octet types
The letter x indicates bits available for encoding bits of the UCS-4
character value
UCS-4 range (hex.) UTF-8 octet sequence (binary
0000 0000-0000 007F 0
0000 0080-0000 07FF 110xxxxx 10
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10
Encoding from UCS-4 to UTF-8 proceeds as follows
1) Determine the number of octets required from the character
and the first column of the table above
2) Prepare the high-order bits of the octets as per the second
of the table
3) Fill in the bits marked x from the bits of the character value
starting from the lower-order bits of the character value
putting them first in the last octet of the sequence, then
next to last, etc. until all x bits are filled in
Yergeau Informational [Page 3]
RFC 2044 UTF-8 October 1996
The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can
obtained from the above, in principle, by simply extending
UCS-2 character with two zero-valued octets. However, UCS-2 val
ues between D800 and DFFF, being actually UCS-4 characters trans
formed through UTF-16, need special treatment: the UTF-16 trans
formation must be undone, yielding a UCS-4 character that is
transformed as above
Decoding from UTF-8 to UCS-4 proceeds as follows
1) Initialize the 4 octets of the UCS-4 character with all bits
to 0.
2) Determine which bits encode the character value from the number
octets in the sequence and the second column of the table
(the bits marked x).
3) Distribute the bits from the sequence to the UCS-4 character
first the lower-order bits from the last octet of the sequence
proceeding to the left until no x bits are left
If the UTF-8 sequence is no more than three octets long,
can proceed directly to UCS-2 (or equivalently Unicode).
A more detailed algorithm and formulae can be found in [FSS_UTF],
[UNICODE] or Annex R to [ISO-10646].
3.
The Unicode sequence "A." (0041, 2262, 0391,
002E) may be encoded as follows
41 E2 89 A2 CE 91 2
The Unicode sequence "Hi Mom !" (0048, 0069,
0020, 004D, 006F, 006D, 0020, 263A, 0021) may be encoded as follows
48 69 20 4D 6F 6D 20 E2 98 BA 21
The Unicode sequence representing the Han characters for the
word "nihongo" (65E5, 672C, 8A9E) may be encoded as follows
E6 97 A5 E6 9C AC E8 AA 9
Yergeau Informational [Page 4]
RFC 2044 UTF-8 October 1996
MIME
This memo is meant to serve as the basis for registration of a
character encoding (charset) as per [RFC1521]. The proposed
parameter value is "UTF-8". This string would label media
containing text consisting of characters from the repertoire of
10646-1 encoded to a sequence of octets using the encoding
outlined above
Security
Security issues are not discussed in this memo
The following have participated in the drafting and discussion
this memo
James E. Agenbroad Andries
Martin J. D|rst David
Edwin F. Hart Kent
Markus Kuhn Michael
Alain LaBonte Murray
Keld Simonsen Arnold
[FSS_UTF] X/Open CAE Specification C501 ISBN 1-85912-082-2 28cm
22p. pbk. 172g. 4/95, X/Open Company Ltd., "File Sys
tem Safe UCS Transformation Format (FSS_UTF)", X/
Preleminary Specification, Document Number P316.
published in Unicode Technical Report #4.
[ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Infor
mation technology -- Universal Multiple-Octet
Character Set (UCS) -- Part 1: Architecture and
Multilingual Plane. UTF-8 is described in Annex R
adopted but not yet published. UTF-16 is described
Annex Q, adopted but not yet published
[RFC1521] Borenstein, N., and N. Freed, "MIME (
Internet Mail Extensions) Part One: Mechanisms
Specifying and Describing the Format of Internet Mes
sage Bodies", RFC 1521, Bellcore, Innosoft,
1993.
[RFC1641] Goldsmith, D., and M. Davis, "Using Unicode
MIME", RFC 1641, Taligent inc., July 1994.
Yergeau Informational [Page 5]
RFC 2044 UTF-8 October 1996
[RFC1642] Goldsmith, D., and M. Davis, "UTF-7: A Mail-
Transformation Format of Unicode", RFC 1642,
Taligent, Inc., July 1994.
[UNICODE] The Unicode Consortium, "The Unicode Standard --
Worldwide Character Encoding -- Version 1.0", Addison
Wesley, Volume 1, 1991, Volume 2, 1992. UTF-8
described in Unicode Technical Report #4.
[US-ASCII] Coded Character Set--7-bit American Standard Code
Information Interchange, ANSI X3.4-1986.
Author's
Francois
Alis
100, boul. Alexis-
Suite 600
Montreal QC H4M 2P
Tel: +1 (514) 747-2547
Fax: +1 (514) 747-2561
EMail: fyergeau@alis.
Yergeau Informational [Page 6]
if you see any problems within the linking, don't worry be happy,
this is version 0.1 of the Relevance System and you gotta expect some crappy subroutines sometimes,
just be content we did not write this in Java, which would have made this "bigger and better" HAHAHHA.
RFC documents can be found at I.E.T.F.
Relevance System Copyright © 2002 Spectrum WorldResearch
other technical nosh by ServerMasters Corporation
collaboration of BobX