As per Relevance of the word information, we have this rfc below:
Network Working Group C.
Request for Comments: 2130
Category: Informational C.
Preston &
K.
H.
R.
Cisco
M.
University of
P.
April 1997
The Report of the IAB Character Set
held 29 February - 1 March, 1996
Status of this
This memo provides information for the Internet community. This
does not specify an Internet standard of any kind. Distribution
this memo is unlimited
The authors would like to sincerely thank Information
Institute (ISI), and in particular Joyce K. Reynolds for
hosting this event; Joe Kemp and Jeanine Yamazaki of ISI made
the facilities met our needs. We also wish to thank the
Society, which underwrote travel for participants who might
otherwise have been able to attend. Of course, we also wish to
the many experts who participated in the workshop and on the
list; a complete list of these people can be found in Appendix D
Bunyip Information Systems was kind enough to provide mailing
facilities for this work
Table of
0: Executive summary.......................................... 2
1: Introduction............................................... 3
2: Character sets on the Internet -- the problem.............. 3
2.1: Character set handling in existing protocols............... 4
3: Architectural model........................................ 6
3.1: Segments defined........................................... 7
3.2: On the wire................................................ 8
Weider, et. al. Informational [Page 1]
RFC 2130 Character Set Workshop Report April 1997
3.3: Determining which values of CCS, CES, and TES are used..... 9
3.4: Recommended Defaults....................................... 10
3.5: Guidelines for conversions between coded character sets.... 13
4: Presentation issues........................................ 14
5: Open issues................................................ 14
5.1: Language tags.............................................. 15
5.2: Public identifiers......................................... 16
5.3: Bi-directionality.......................................... 16
6: Security Considerations.................................... 16
7: Conclusions................................................ 16
8: Recommendations............................................ 17
8.1: To the IAB................................................. 17
8.2: For new Internet protocols................................. 18
8.3: For registration of new character sets..................... 18
Appendix A: List of protocols affected by character set issues... 20
Appendix B: Acronyms............................................. 23
Appendix C: Glossary............................................. 24
Appendix D: References........................................... 25
Appendix E: Recommended reading.................................. 27
Appendix F: Workshop attendee list............................... 29
Appendix G: Authors' Addresses................................... 30
This report details the conclusions of an IAB-sponsored
workshop held 29 February - 1 March, 1996, to discuss the use
character sets on the Internet. It motivates the need to
character set handling in Internet protocols which transmit text
provides a conceptual framework for specifying character sets
recommends the use of MIME tagging for transmitted text, recommends
default character set *without* stating that there is no need
other character sets, and makes a series of recommendations to
IAB, IANA, and the IESG for furthering the integration of
character set framework into text transmission protocols
0: Executive
The term 'Character Set' means many things to many people. Even
MIME registry of character sets registers items that have
differences in semantics and applicability. This workshop
guidance to the IAB and IETF about the use of character sets on
Internet and provides a common framework for interoperability
the many characters in use there
The framework consists of four components: an architecture model
which specifies components necessary for on-the-wire transmission
text; recommendations for tagging transmitted (and stored) text
recommended defaults for each level of the model; and a set
Weider, et. al. Informational [Page 2]
RFC 2130 Character Set Workshop Report April 1997
recommendations to the IAB, IANA, and the IESG for furthering
integration of this framework into text transmission protocols
The architectural model specifies 7 layers, of which only three
required for on-the-wire transmission. The Coded Character Set is
mapping from a set of abstract characters to a set of integers.
Character Encoding Scheme is a mapping from a Coded Character Set (
several) to a set of octets. The Transfer Encoding Syntax is
transformation applied to data which has been encoded using
Character Encoding Scheme to allow it to be transmitted. These
should be specified in a transmitted text stream by using the
encoding mechanisms
This report recommends the use of ISO 10646 as the default
Character Set, and UTF-8 as the default Character Encoding Scheme
the creation of new protocols or new version of old protocols
transmit text. These defaults do not deprecate the use of
character sets when and where they are needed; they are
intended to provide guidance and a specification
interoperability
1:
This is the report of an IAB-sponsored invitational workshop on
use of Character Sets on the Internet, held 29 February - 1
1996 at Information Sciences Institute (ISI) in Marina del Rey
California. In addition, this report covers the discussion on
mailing list up to and slightly beyond the workshop itself.
goals of this workshop were to provide guidance to the IAB and
IETF about the use of character sets on the Internet, and if
a common framework for interoperability between the many
sets in use there. Both goals were achieved
2: Character sets on the Internet - the
The term 'character set' is typically applied to the contents of
wide variety of text transmission and display protocols used on
Internet. Because the term is used to mean different things
confusion has arisen. For example, the MIME registry of
sets [MIME] contains items that may differ greatly in
applicability and semantics in various Internet protocols
In addition, there is a vast profusion of different text
schemes in use on the Internet. This per se is not a problem;
scheme has evolved to meet real needs. However,
applications such as mail, directories, and the World Wide Web
each developed different techniques for dealing with the
number of schemes. A robust information architecture for
Weider, et. al. Informational [Page 3]
RFC 2130 Character Set Workshop Report April 1997
Internet requires as much interoperability between these
as possible
2.1: Related topics deemed out of scope for this
Successful display of plain text transmitted over the
requires a lot of information about the text itself, such as
underlying character set, language, and so forth. An additional
of formatting information is needed if the receiving
wishes to use local (cultural) conventions when it presents the
to the user. This formatting includes information, that provides
data necessary to format certain types of textual data (dates
times, numbers and monetary notation) into a form which is
to the user. The POSIX [POSIX] notation of locale
language, coded character set and cultural conventions
To avoid unfruitful discussion, and to make the best use of the
available for the workshop, we declared the following issues out
scope for the purposes of this workshop
-
-
- culture (e.g. do we present the American or British spelling?)
- user interface
- internal representation of textual
- included characters (why aren't certain characters available
any character set?)
- locale (in the POSIX sense
- font
-
- user input/output
- Han unification
There are some related issues which were included for discussion
most importantly the 'locale' components necessary for transport
identification of multilingual texts
2.2: Character Set handling in existing
One of the group's overriding concerns was that the
developed for character set handling not break existing protocols
With that in mind, the way character sets are being used in
protocols was examined. See Appendix A for a list of those
and some recommendations for change
2.2.1: General
The problem areas here fall into three main categories: protocols
Weider, et. al. Informational [Page 4]
RFC 2130 Character Set Workshop Report April 1997
identifiers, and data
2.2.1.1:
The protocol machinery SHOULD NOT be changed; allowing, for instance
SMTP [SMTP] to use both MAIL FROM and POST FRA is dangerous to
protocols' stability. However, many protocols carry error
and other information that is intended for human consumption;
MIGHT be an advantage to allow these to be localized into a
language and character set, rather than staying in English and US
ASCII [ASCII]. If this is done, new extensions should follow
framework outlined below
2.2.1.2: Identifiers
There is a strong statement of direction from the IAB, RFC 1958 [
1958], which states
4.3 Public (i.e. widely visible) names should be in
independent ASCII. Specifically, this refers to DNS names
and to protocol elements that are transmitted in text format
...
5.4 Designs should be fully international, with support
localization (adaptation to local character sets).
particular, there should be a uniform approach to
set tagging for information content
In protocols that up to now have used US-ASCII only, UTF-8 [UTF-8]
forms a simple upgrade path; however, its use should be
either by negotiating a protocol version or by negotiating
usage, and a fallback to a US-ASCII compatible representation such
UTF-7 [UTF-7] MUST be available
The need for passing application data such as language on
identifiers varies between applications; protocols SHOULD attempt
evaluate this need when designing mechanisms. Applying the
requirement for identifiers that are only used in a local
(such as private mailbox folder names) is both unrealistic
unreasonable; in such cases, methods for consistency in the
of character set should be considered
2.2.1.3:
Data that require character set handling includes text, databases
and HTML [HTML] pages, for example. In these the support
multiple character sets and proper application information
absolutely vital, and MUST be supported
Weider, et. al. Informational [Page 5]
RFC 2130 Character Set Workshop Report April 1997
2.3: Architectural
To address the issues enumerated for this work, first
architectural model was created which establishes the components
are required to fully specify the transmission of textual data.
of these components are already familiar to the users of
protocols such as MIME. Not all of these are discussed in detail
this report; we restrict ourselves primarily to those
which are required to specify the 'on-the-wire' phase of
transmission
Mandating a single, all-encompassing character set would not fit
with the IETF philosophy of planning for architectural diversity
So, the best that can be done is to provide a common *framework*
identifying and using the multitude of character sets available
the Internet. It would be an advantage if the total number of
Character Sets could be kept to a minimum. This framework
meet the following requirements
- it should not break existing protocols (because then the
of deployment is very small),
- it should allow the use of character sets currently used on
Internet,
- it should be relatively easy to build into new protocols
3: Architectural
The basic architectural model which guided our discussions is
in below. A distinction was made between those segments which
necessary to successfully transmit character set data on-the-wire
those needed to present that data to a user in a
manner. The discussions were primarily restricted to those
of the model which specify the 'on-the-wire' transmission of
data
User interface issues: these are briefly discussed in Section 3.1.1.
On-the-wire: see section 3.2 for detailed discussion
Transfer
Character Encoding
Coded Character
Weider, et. al. Informational [Page 6]
RFC 2130 Character Set Workshop Report April 1997
3.1: Segments
3.1:1: User
3.1.1.1:
Layout includes the elements needed for displaying text to the user
such as font selection, word-wrapping, etc. It is similar to
'presentation' layer in the 7-layer ISO telecommunications
[ISO-7498].
3.1.1.2:
Culture includes information about cultural preferences, which
spelling, word choice, and so forth
3.1.1.3:
The locale component includes the information necessary to
choices about text manipulation which will present the text to
user in an expected format. This information may include the
of date, time and monetary symbol preferences. Notice that
modifications are typically applied to a text stream before it
presented to the user, although they also are used to specify
formats
3.1.1.4:
This component specifies the language of the transmitted text.
times and in specific cases, language information may be required
achieve a particular level of quality for the purpose of displaying
text stream. For example, UTF-8 encoded Han may require
of a language tag to select the specific glyphs to be displayed at
particular level of quality
Note that information other than language may be used to achieve
required level of quality in a display process. In particular,
font tag is sufficient to produce identical results. However,
association of a language with a specific block of text
usefulness far beyond its use in display. In particular, as
amount of information available in multiple languages on the
Wide Web grows, it becomes critical to specify which language is
use in particular documents, to assist automatic indexing
retrieval of relevant documents
Weider, et. al. Informational [Page 7]
RFC 2130 Character Set Workshop Report April 1997
The term 'language tag' should be reserved for the short
of RFC 1766 [RFC-1766] that only serves to identify the language
While there may be other text attributes intimately associated
the language of the document, such as desired font or text direction
these should be specified with other identifiers rather
overloading the language tag
3.2: On the
There are three segments of the model which are required
completely specifying the content of a transmitted text stream (
the occasional exception of the Language component, mentioned above).
These components are
1) Coded Character Set
2) Character Encoding Scheme,
3) Transfer Encoding Syntax
Each of these abstract components must be explicitly specified by
transmitter when the data is sent. There may be instances of
implicit specification due to the protocol/standard being used (i.e
ANSI/NISO Z39.50). Also, in MIME, the Coded Character Set
Character Encoding Scheme are specified by the Charset parameter
the Content-Type header field, and Transfer Encoding Syntax
specified by the Content-Transfer-Encoding header field
3.2.1: Coded Character
A Coded Character Set (CCS) is a mapping from a set of
characters to a set of integers. Examples of coded character
are ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859
[ISO-8859].
3.2.2: Character Encoding
A Character Encoding Scheme (CES) is a mapping from a Coded
Set or several coded character sets to a set of octets. Examples
Character Encoding Schemes are ISO 2022 [ISO-2022] and UTF-8 [UTF-8].
A given CES is typically associated with a single CCS; for example
UTF-8 applies only to ISO 10646.
Weider, et. al. Informational [Page 8]
RFC 2130 Character Set Workshop Report April 1997
3.2.3: Transfer Encoding
It is frequently necessary to transform encoded text into a
which is transmissible by specific protocols. The Transfer
Syntax (TES) is a transformation applied to character data
using a CCS and possibly a CES to allow it to be transmitted
Examples of Transfer Encoding Syntaxes are Base64 Encoding [Base64],
gzip encoding, and so forth
3.3: Determining which values of CCS, CES, and TES are
To completely specify which CCS, CES, and TES are used in a
text transmission, there needs to be a consistent set of labels
specifying which CCS, CES, and TES are used. Once the
mechanisms have been selected, there are six techniques for
these labels to the data
The labels themselves are named and registered, either with
[IANA] or with some other registry. Ideally, their definitions
retrievable from some registration authority
Labels may be determined in one of the following ways
- Determined by guessing, where the receiver of the text has
guess the values of the CCS, CES, and TES. For example: "I
this from Sweden so it's probably ISO-8859-1." This
obviously not a very foolproof way to decode text
- Determined by the standard, where the protocol used to
the data has made documented choices of CCS, CES, and TES in
standard. Thus, the encodings used are known through
access protocol, for example HTTP [HTTP] uses (but is
limited to) ISO-8859-1, SMTP uses US-ASCII
- Attached to the transfer envelope, where the descriptive labels
attached to the wrapper placed around the text for transport
MIME headers are a good example of this technique
- Included in the data stream, where the data stream itself
been encoded in such a way as to signal the character set used
For example, ISO-2022 encodes the data with escape sequences
provide information on the character subset currently being used
- Agreed by prior bilateral agreement, where some out-of-
negotiation has allowed the text transmitter and receiver
determine the CCS, CES, and TES for the transmitted text
- Agreed to by negotiation during some phase,
initialization of the protocol
Weider, et. al. Informational [Page 9]
RFC 2130 Character Set Workshop Report April 1997
3.3.1: Recommendations for value specification
While each of these techniques (with the exception of guessing)
useful in particular situations, interoperability requires a
consistent set of techniques. Thus, we recommend that
registered values be used for all tagging of character sets
languages UNLESS there is an existing mechanism for determining
required information using one of the other techniques (
guessing). This recommendation will require a fair bit of work
the part of protocol designers, implementors, the IETF, the IESG,
the IAB
However, it is important to point out that the MIME concept
'charset' in some cases cuts across several layers of components
our model. While this can be accepted in existing registrations,
also recommend that the MIME registration procedure for
sets be modified to show how a proposed character set deals with
CCS and the CES. Most 'charsets' have a well defined CCS and CES
they should merely be teased apart for the registration
There are a number of other recommendations, but these will
covered in the next sections
3.4: Recommended
For a number of reasons, one cannot define a mandatory set
defaults for all Internet protocols. There is a mass of
practice, future protocols are likely to have different purposes
which may determine their handling of text, and protocols may
specific variation support. For example, in mail, text is
predominant data type and coded character sets then become a
issue for the protocol. Also, since e-mail is ubiquitous and
expect to be able to send it to everyone, the mail protocols need
be quite adept at handling different character set encodings. On
other hand, if strings are seldom used in a given protocol, there
no need to weigh the protocol down with a sophisticated apparatus
handling multiple character sets, assuming that the
character set can handle all the protocol's needs. This
also applies to the specification techniques for character
parameters. If only one character set encoding is needed, it can
made explicit in the protocol specification. Protocols with
greater need for character set support will need a more
specification technique
Weider, et. al. Informational [Page 10]
RFC 2130 Character Set Workshop Report April 1997
3.4.1: Clarity of
We recommend that each protocol clearly specify what it is using
each of the layers of the transmission model. Users (or clients
should never have to guess what the parameter is for a given layer
3.4.2: Default Coded Character Set
The default Coded Character Set is the repertoire of ISO-10646.
3.4.3: Default Character Encoding
For text-oriented protocols, new protocols should use UTF-8,
protocols that have a backwards compatibility requirement should
the default of the existing protocol, e.g. US-ASCII for mail,
ISO-8859-1 for HTTP. The recommended specification scheme is
MIME "charset" specification, using the IANA "charset
specifications. The MIME specifications will need to be clarified
meet this model in the future
For other protocols, the default should be UTF-8 as this
allows US-ASCII to be entered as-is, and enables the full
of ISO 10646.
Some protocols, such as those descended from SGML [SGML], have
natural notations for characters outside their "natural" repertoire
for instance, HTML [HTML] allows the use of nnnn to refer to
ISO 10646 character. Note that this, like all other encodings
depend on "escape characters", redefines at least one character
the base character set for use as an indicator of "foreign
characters. Use of this approach must be weighed very carefully
3.4.4: Default Transport Encoding
There is no recommended default for this level. For plain
oriented protocols, the bytestream transport format should be 8-
clean, possibly with normalization of end-of-line indicators.
special cases could be made for protocols that are not 8-bit clean
such as encoding it for transport over 7-bit connections. For
the same recommendation holds as above. The specification
should either be defined in the protocol, if only one way
permitted, or by use of MIME content-transfer-encoding (CTE
techniques, using IANA registered values
Weider, et. al. Informational [Page 11]
RFC 2130 Character Set Workshop Report April 1997
3.4.5: Default
There is no recommended default for the language level. For
readable text, there should always be a way to specify the
language. The specification technique should be a MIME
with IANA registered values for languages. If headers are used,
header should be 'Content-Language'.
3.4.6: Default
The default should be the POSIX locale. The specification
should use the Cultural register of CEN ENV 12005 [CEN] for
values. If headers are used, the header should be 'Content-Locale'.
3.4.7: Default
There is no recommended default for the Culture level.
specification technique should be a MIME or MIME-like
(e.g. Content-Culture) and should use the Cultural register of
ENV 12005 for its values
3.4.8: Default
There is no recommended default for the Presentation level.
specification technique should be a MIME or MIME-like
(e.g. Content-Layout) and use the glyph register of ISO 10036
other registers for its values
3.4.9:
In some cases, text transmission may require the use of a number
different values for a given parameter; for example,
annotation of Japanese text might well require shifting the Content
Language parameter. The way to switch the value of parameters
a single body of text depends on the application. For instance,
HTML I18N [I18N] work defines a language attribute on most of
elements, including , , and , for the purpose
switching between different languages. When only one value
needed, this value should be as general as possible, and specified
the protocol standard with reference to the IANA or other
value. All levels should be specified explicitly
3.4.10:
Because stored text may very well be stored without any of
additional information necessary for decoding, stored text SHOULD
tagged in a MIME compliant fashion. This alleviates the problem
being unable to interpret text which has been stored for a long time
Weider, et. al. Informational [Page 12]
RFC 2130 Character Set Workshop Report April 1997
or text whose provenance is not available
3.5: Guidelines for conversions between coded character
This section covers various algorithms to convert a source text S
encoded in the coded character set CCS(S), to a target text T
encoded in the coded character set CCS(T).
Rep(X) is the character repertoire of coded character set X, i.e.
set of characters which can be represented with X
3.5.1: Exact
When Rep(CCS(S)) and Rep(CCS(T)) are equal or Rep(CCS(S)) is a
of Rep(CCS(T)), exact conversion is possible; i.e. T is equal to S
The octets just need to be remapped. The algorithm for
this remapping is simple, if the IANA-registered definition
for CCS(S) and CCS(T) are available
3.5.2: Approximate
In all other cases, any conversion creates a text T which
from S. There are different principles for how this
difference should be handled. A choice between them should be made
depending on the purpose and requirements of the conversion.
possible, the client application should be given mechanisms
determine what has been done to the text
3.5.2.1: Length-modifying conversion for human
When the length of the target text T is allowed to differ from
length of the source text S, one should use a conversion method
which each source character is converted to one or several
character(s), using a best resemblance criteria in the choice of
target character(s).
Examples
LATIN CAPITAL LETTER [*] ->
COPYRIGHT SIGN [*] -> (c
3.5.2.2: Length-preserving conversion for human
Where the text T must be presented and the length of T cannot
from the length of S, one should use a conversion method where
source character is converted to one target character, using
kind of best resemblance criteria in the choice of target character
Weider, et. al. Informational [Page 13]
RFC 2130 Character Set Workshop Report April 1997
Examples
LATIN CAPITAL LETTER [*] ->
COPYRIGHT SIGN [*] ->
3.5.2.3: Conversion without data
Where the conversion of the text S into T must be
reversible, apply a Character Encoding Syntax or other
transformation method. This case is most frequently met in
storage requirements
Examples
LATIN CAPITAL LETTER [*] -> &
COPYRIGHT SIGN [*] -> &(
An alternate method, which can be used if the size of Rep(CCS(T)) >=
Rep(CCS(S)), then for each character in Rep(CCS(S)) which is
present in Rep(CCS(T)), define a mapping into a character
Rep(CCS(T)) which is not present in Rep(CCS(S)).
Examples
LATIN CAPITAL LETTER [*] -> CYRILLIC CAPITAL LETTER [*]
COPYRIGHT SIGN [*] -> PARTIAL DIFFERENTIAL SIGN [*]
Note that conversion without data loss requires redefining
member of T to indicate "the introduction of character data
T". This effectively adds another level of CES on top of CES(T).
4: Presentation
There are a number of considerations to make in selecting the
character set. One such consideration is the protocol's
to users with limited equipment (for example only ISO 8859-1 or
keyboard without the ability to enter all the characters in
10646). Alternative representation should be considered for
users, both for input and output. Possible options for
representation of characters that can not be displayed
transliteration (a la CEN/TC304 or ISO TC46/SC2 ), RFC 1345 [RFC
1345] representative icons, or the WG2 short name (u+xxxx).
5: Open
In addition to the issues declared out of scope and enumerated
section 2.1, the following issues are still open and will need to
addressed in other forums. These issues: language tags,
identifiers such as URL names, and bi-directionality are
discussed below as they repeatedly encroached the discussion
Weider, et. al. Informational [Page 14]
RFC 2130 Character Set Workshop Report April 1997
5.1: Language
Although the workshop decided not to explicitly address the so-
"CJK issue", a few members felt it was necessary to have
mechanism to address the problem of correct Han character display
the ISO-10646 issue, and that saying that it was a "font issue"
not suffice
The "CJK issue" refers to the extended discussion about "
unification", the use of a single ISO-10646 codepoint to
multiple national variants of a Chinese (Han) character. ISO-10646
can map uniquely to any single CJK national character set, but in
absence of additional information an application can not display
ISO-10646 text using the proper national variants for that text
It was agreed that language tags would be sufficient to
unified characters. There was not, in our opinion, a
technical difference between the use of different coded
sets with overlapping codepoints, and a single coded character
with language tags. Either way, the application has
information to display the text properly
It was observed that in contemporary usage of MIME charsets,
language is implied as well as the coded character set and
character encoding syntax. We agreed that this is
overloading of MIME charsets
To specify the language used in a particular block of text,
recommend that the MIME tag "Content-Language" be used. There are
number of questions about this approach that need to be worked out
however
- Is Content-Language: actually suitable
- Is there an overload between this function and the
intended functions of Content-Language: as described in
1766?
- What, precisely, does "Content-Language: zh-tw, ja, ko, zh-cn
mean in this context? We believe it means that, in drawing
Han character, the Taiwanese variant (presumably
Han) is preferred, followed by the Japanese, Korean,
mainland Chinese (presumably simplified Han) variants. It
*NOT* mean "mixed text containing Taiwanese, Japanese, Korean
and mainland Chinese text with all the national variants
each of these".
Mixed CJK text, that simultaneously displays different
occupying the same codepoint, requires language tags embedded in
data. Ohta and Handa propose in RFC 1554 [RFC-1554] a MIME
Weider, et. al. Informational [Page 15]
RFC 2130 Character Set Workshop Report April 1997
using ISO-2022 shifts between multiple coded character sets;
effect this is an encoding that uses coded character sets
displaying the appropriate glyphs
There is some speculation that states that mixed CJK text
relatively infrequent, and that therefore it is acceptable to
that such text be represented using a rich text format that
support language tags. In other words, that a simplifying
can be made for TEXT/PLAIN in email using ISO-10646 that will
require multiple display representations for the same codepoint.
mechanism such as RFC 1554 could address this need if it
important; although arguably RFC 1554 should really be identified
TEXT/ISO-2022.
Note again that we recommend that support for language tagging
be built into new protocols, as this will become a critical
of the automated indexing and retrieval in information
of the future
5.2: Public
There is a considerable demand from the user community for
ability to use non-ASCII characters in URL names, IMAP mailbox names
file names, and other public identifiers. This is still an
problem
5.3: Bi-
It was realized that a consistent framework for bi-directional
was needed but there was no attempt to work on it in this workshop
6: Security
There are no security considerations associated with character sets
7:
This paper provides a conceptual framework and a set
recommendations which, if adopted, should provide a solid
for interoperability on the Internet. There are, however, a number
open issues which will need to be addressed to provide ever
use of text on the Internet
Weider, et. al. Informational [Page 16]
RFC 2130 Character Set Workshop Report April 1997
8:
8.1: To the
There were a number of recommendations to the IAB about making
standards process more aware of the need for character
interoperability, and about the framework itself
A: The IAB should trigger the examination of all RFCs to
the way they handle character sets, and obsolete or annotate
RFCs where necessary
B: The IESG should trigger the recommendation of procedures to
RFC editor to encourage RFCs to specify character set handling
they specify the transmission of text
C: The IAB should trigger the production of a perspectives
on the character set work that has gone on in the past and relate
to the current framework
D: Full ISO 10646 has a sufficiently broad repertoire, and scope
further extension, that it is sufficient for use in
Protocols (without excluding the use of existing alternatives).
There is no need for specific development of character set
for the Internet
E: The IAB should encourage the IRTF to create a research group
explore the open issues of character sets on the Internet. This
should set its sights much higher than this workshop did
F: The IANA (perhaps with the help of an IETF or IRTF group)
develop procedures for the registration of new character sets
use in the Internet
G: Register UTF-8 as a Character Encoding Scheme for MIME
H: The current use of the "x-*" format for
experimental tags should be continued for private use
consenting parties. All other namespaces should be allocated by IANA
I: Application protocol RFCs SHOULD include a section
"multilingual Considerations".
J: Application Protocol RFCs SHOULD indicate how to transfer 'on
wire' all characters in the character sets they use. They SHOULD
specify how to transfer other information that applications may
to know about the data
Weider, et. al. Informational [Page 17]
RFC 2130 Character Set Workshop Report April 1997
K: The IESG should trigger a set of extensions to RFC 1522 to
language tagging of the free text parts of message headers
8.2: For new Internet
New protocols do not suffer from the need to be compatible with
7-bit pipes. New protocol specifications SHOULD use ISO 10646 as
base charset unless there is an overriding need to use a
base character set
New protocols SHOULD use values from the IANA registries
referring to parameter values. The way these values are carried
the protocols is protocol dependent; if the protocol uses RFC-822-
like headers, the header names already in use SHOULD be used
For protocols with only a single choice for each component,
protocol should use the most general specification and should
specified with reference to the registered value in the
standard
Protocols SHOULD tag text streams with the language of the text
8.3: For the registration of new character
Ned Freed will be releasing a new MIME registration document
conjunction with this paper
8.3.1: A definition table for a coded character
A definition table for a coded character set A must for
character C that is in the repertoire of A give
a) if C is present in ISO 10646, the code value (in hexadecimal form
for that character
b) If C is not present in ISO 10646, but may be constructed using
10646 combining characters, the series of code values (
hexadecimal form) used to construct that character
c) if C is not present in ISO 10646, a textual description of
character, and a reference to its origin
Weider, et. al. Informational [Page 18]
RFC 2130 Character Set Workshop Report April 1997
8.3.2: A definition of a character encoding
A definition of a character encoding scheme consists of
- A description of an algorithm which transforms every
sequence of octets to either a sequence of pairs
value> or to the error state "illegal octet sequence
- Specifications, either by reference to CCS's registered by IANA
in text, of each CCS upon which this CES is based
Weider, et. al. Informational [Page 19]
RFC 2130 Character Set Workshop Report April 1997
Appendix A
A-1: IETF
The following list describes how various existing protocols
multiple character set information
See 8.2. ESMTP makes it easy to negotiate the use of
language and encoding if it is needed
RFC 1522 forms an adequate framework for supporting text; UTF-8
alone is not a possible solution, because the mail pathways
assumed to be 7-bit 'forever'. However, RFC 1522 should
extended to allow language tagging of the free text parts
message headers
Selection of charset parameters for Email text bodies
reasonably well covered by the charset= parameter on Text/*
types. Language is defined by the Content-language header
RFC 1766. Other information will have to be added using
part headers; due to the way MIME differentiates between
part headers and message headers, these will all have to
names starting with Content- .
See 8.2. No strong tradition for negotiation of encoding in
exists
NetNews
These should be able to leverage off the mechanisms defined
Email. One difference is that nearly all NNTP channels are 8-
bit clean; some NNTP newsgroups have a tradition of using 8-
charsets in both headers and bodies. Defining character
default on a per newsgroup basis might be a suitable approach
The identifiers carried as information about parties are
defined to be in UTF-8.
Weider, et. al. Informational [Page 20]
RFC 2130 Character Set Workshop Report April 1997
See 8.2. The common use of welcome banners in the login
means that there might be strong reason here to allow client
server to negotiate a language different from the default
greetings and error messages. This should be a simple
extension
Many fileservers now how have the capability of using non-
characters in filenames, while the "dir" and "get" commands
are defined in terms of US-ASCII only. One possible
would be to define a "UTF-8" mode for the transfer of
and directory information; this would need to be a
facility, with fallback to US-ASCII if not negotiated.
important point here is consistency between all implementations
a single charset is better here than the ability to
multiple charsets
World Wide
See 8.2. The single-shot stype of HTTP makes negotiation
complex than it would otherwise be
Internationalization of HTML [I18N] seems fairly well covered
the current "I18N" document. It needs review to see if it
more specific details in order to carry application
apart from the language
URLs are "input identifiers", and powerful arguments should
made if they are ever to be anything but US-ASCII
IMAP's information objects are MIME Email objects, and
are able to use that standard's methods. However, IMAP
names are local identifiers; there is strong reason to
non-ASCII characters in these. A UTF-8 negotiation might be
most appropriate thing, however, UTF-8 is awkward to use
Unfortunately, UTF-7 isn't suitable because it conflicts
popular hierarchy delimiters. The most recent IMAP work
progress specification describes a modified UTF-7 which
this problem
Weider, et. al. Informational [Page 21]
RFC 2130 Character Set Workshop Report April 1997
DNS names are the prime example of identifiers that need to
in US-ASCII for global interoperability. However, some
information, in particular TXT records, may
information (such as names) that is outside the ASCII range.
single solution is the best; problems resulting from UTF-8
should be investigated
WHOIS++
WHOIS++ version 1 is defined to use ISO 8859-1. The next
will use UTF-8. The currently designed changes will also
the specification of individual attributes on attribute names
these will make the passing of application information about
values (such as language) easier. No immediate action
necessary
This has been a stable protocol for so many years now that
seems unwise to suggest that it be modified. Furthermore
compatible extensions exist in RWHOIS and WHOIS++;
should rather be made to these protocols than to the
protocol itself
This is a prime example of protocol where character set
is necessary and nonexistent. The current work in progress
character set negotiation in Telnet seems adequate to the task
the question of passing other application data that might
useful is still open
A-2: Non-IETF
For these protocols, the IETF does not have any power to change them
However, the guidelines developed by the workshop may still be
as input to the further development of the protocols
Gopher: Gopher, Gopher
Prospero (Archie
NFS:
CORBA, Finger, GEDI, IRC, ISO 10160/1, Kerberos, LPR, RSTAT, RWhois
SGML, TFTP, X11, X.500, Z39.50
Weider, et. al. Informational [Page 22]
RFC 2130 Character Set Workshop Report April 1997
Appendix B:
ASCII American National Standard Code for Information
CCS Coded Character
CEN ENV European Committee for Standardisation (CEN)
pre-standard (ENV
CES Character Encoding
CJK Chinese Japanese
CORBA Common Object Request Broker
CTE Content Transfer
DNS Domain Name
ESMTP Extended
FTP File Transfer
HTML Hypertext Transfer
I18N Internationalization (or 18 characters between the
(I) and last (n)character
IAB Internet Activities
IANA Internet Assigned Numbers
IESG Internet Engineering Steering
IETF Internet Engineering Task
IMAP Internet Message Access
IRC Internet Relay
IRTF Internet Research Task
ISI Information Sciences
ISO International Standards
MIME Multipurpose Internet Mail
NFS Networked File
NNTP Net News Transfer
POSIX Portable Operating System
RFC Request for Comments (Internet standards documents
RPC Remote Procedure
RSTAT Remote
RTCP Real-Time Transport Control
Rwhois Referral
SGML Standard Generalized Mark-up
SMTP Simple Mail Transfer
TES Transfer Encoding
TFTP Trivial File Transfer
URL Uniform Resource
UTF Universal Text/Translation
Weider, et. al. Informational [Page 23]
RFC 2130 Character Set Workshop Report April 1997
Appendix C:
Bi-directionality - A property of some text where text written right
to- left (Arabic or Hebrew) and text written left-to-
(e.g. Latin) are intermixed in one and the same line
Character - A single graphic symbol represented by sequence of one
more bytes
Character Encoding Scheme - The mapping from a coded character set
an encoding which may be more suitable for specific purpose.
example, UTF-8 is a character encoding scheme for ISO 10646.
Character Set - An enumerated group of symbols (e.g., letters,
or glyphs
Coded Character Set - The mapping from a set of integers to
characters of a character set
Culture - Preferences in the display of text based on cultural norms
such as spelling and word choice
Language - The words and combinations of words the constitute a
of expression and communication among people with a
history or set of traditions
Layout - Information needed to display text to the user, similar
the presentation layer in the ISO telecommunications model
Locale - The attributes of communication, such as language,
set and cultural conventions
On-the-wire - The data that actually gets put into packets
transmission to other computers
Transfer Encoding Syntax - The mapping from a coded character
which has been encoded in a Character Encoding Scheme to
encoding which may be more suitable for transmission
specific protocols. For example, Base64 is a transfer
syntax
Weider, et. al. Informational [Page 24]
RFC 2130 Character Set Workshop Report April 1997
Appendix D:
[*] Non-ASCII
[ASCII] ANSI X3.4:1986 "Coded Character Sets - 7 Bit
National Standard Code for Information Interchange (7-bit ASCII)"
[Base64] Freed, N., and N. Borenstein, "Multipurpose
Mail Extensions (MIME) Part One: Format of Internet
Bodies", RFC 2045, November 1996.
[CEN] see http://tobbi.iti.is/TC304/welcome.html for current status
[HTML] Berners-Lee, T., and D. Connolly, "Hypertext Markup Language -
2.0", RFC 1866, November 1995.
[HTTP] Berners-Lee, T., Fielding, R., and H. Nielsen, "
Transfer Protocol -- HTTP/1.0", RFC 1945, May 1996.
[I18N] Yergeau, F., et.al., "Internationalization of the
Markup Language", RFC 2070, January 1997.
[IANA] Reynolds, J., and J. Postel, "Assigned Numbers", STD 2,
1700, ISI, October 1994.
[ISO-2022] ISO/IEC 2022:1994, "Information technology --
Code Structure and Extension Techniques", JTC1/SC2.
[ISO-7498] ISO/IEC 7498-1:1994, "Information technology - Open
Interconnection - Basic Reference Model: The Basic Model".
[ISO-8859] Information Processing -- 8-bit Single-Byte Coded
Character Sets -- Part 1: Latin Alphabet no. 1,
ISO 8859-1:1987(E). Part 2: Latin Alphabet no. 2, ISO 8859-2
1987(E). Part 3: Latin Alphabet no. 3, ISO 8859-3:1988(E).
Part 4: Latin Alphabet no. 4, ISO 8859-4, 1988(E). Part 5:
Latin/Cyrillic Alphabet ISO 8859-5, 1988(E). Part 6:
Latin/Arabic Alphabet, ISO 8859-6, 1987(E). Part 7: Latin/
Alphabet, ISO 8859-7, 1987(E). Part 8: Latin/Hebrew Alphabet,
8859-8-1988(E).Part 9: Latin Alphabet no. 5, ISO 8859-9, 1990(E).
Part 10: Latin Alphabet no. 6, ISO 8859-10:1992(E).
[ISO-10646] ISO/IEC 10646-1:1993(E ), "Information technology --
Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
Architecture and Basic Multilingual Plane". JTC1/SC2, 1993
Weider, et. al. Informational [Page 25]
RFC 2130 Character Set Workshop Report April 1997
[MIME] See [Base64]
[POSIX] Institute of Electrical and Electronics Engineers. "
standard interpretations for IEEE standard portable
systems interface for computer environments". IEEE Std 1003.1
-1988/Int, 1992 edition. Sponsor, Technical Committee on
Systems of the IEEE Computer Society. New York, NY: Institute
Electrical and Electronic Engineers, 1992.
RFC 1340 See [IANA
[RFC-1345] Simonsen, K., "Character Mnemonics & Character Sets",
RFC 1345, Rationel Alim Planlaegning, June 1992.
[RFC-1554] Ohta, M., and K. Handa, "ISO-2022-JP-2:
Extension of ISO-2022-JP", Tokyo Institute of Technology, ETL
December 1993.
RFC 1642 See [UTF-7]
[RFC-1766] Alvestrad, H., "Tags for the Identification of Languages",
RFC 1766, UNINETT, March 1995.
[RFC 1958] Carpenter, B. (ed.) "Architectural Principles of
Internet", RFC 1958, IAB, June 1996.
[SGML] ISO 8879:1986 "Information Processing - Text and Office
- Standard Generalized Markup Language (SGML)"
[SMTP] Postel, J., "Simple Mail Transfer Protocol", STD 10, RFC 821,
August, 1982.
[Unicode] "The Unicode standard, version 2.0. Unicode Consortium
Reading, Mass.: Addison-Wesley Developers Press, 1996
[UTF-7] Goldsmith, D., and M. Davis, "UTF-7: A Mail
Transformation Format of Unicode", RFC 1642, Taligent, Inc.,
1994.
[UTF-8] International Standards Organization, Joint
Committee 1 (ISO/JTC1), "Amendment 2:1993, UCS
Format 8 (UTF-8)", in ISO/IEC 10646-1:1993 Information
- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
Architecture and Basic Multilingual Plane. JTC1/SC2, 1993.
Weider, et. al. Informational [Page 26]
RFC 2130 Character Set Workshop Report April 1997
Appendix E: Recommended
Alvestrand, H., "Tags for the Identification of Languages", RFC 1766,
UNINETT, March 1995.
Alvestrand, H., "X.400 Use of Extended Character Sets", RFC 1502,
SINTEF DELAB, August 1993.
Borenstein, N., "Implications of MIME for Internet Mail Gateways",
RFC 1344, Bellcore, June 1992.
Freed, N., and N. Borenstein, "Multipurpose
Mail Extensions (MIME) Part One: Format of Internet
Bodies", RFC 2045, November 1996.
Chernov, A., "Registration of a Cyrillic Character Set", RFC 1489,
RELCOM Development Team, July 1993.
Choi, U., and K. Chan, "Korean Character Encoding for
Messages", RFC 1557, KAIST, December 1993.
Freed, N., and N. Borenstein, "Multipurpose Internet Mail
(MIME) Part Two: Media Types", RFC 2046, November 1996.
Goldsmith, D., and M. Davis, "Transformation Format for Unicode",
RFC 1642, Taligent, Inc., July 1994.
Goldsmith, D., and M. Davis, "Using Unicode with MIME", RFC 1641,
Taligent, Inc., July 1994.
Jerman-Blazic, B. "Character handling in computer communication"
"user needs in information technology standards", Computer
Professional service, eds. C.D. Evans, B.L. Meed & R.S. Walker
P.C. Butterworth Heineman, 1993, Oxford, Boston, p. 102-129.
Jerman-Blazic, B. "Tool supporting the internationalization of
generic network services", Computer Networks and ISDN Systems
No. 27 (1994), p. 429-435.
Jerman-Blazic, B., A. Gogala and D. Gabrijelcic, "Transparent
processing: A solution for internationalization of
services", The LISA Forum Newsletter, 5 (1996) p. 12-21
Lee, F., "HZ - A Data Format for Exchanging Files of Arbitrarily
Chinese and ASCII Characters", RFC 1843, Stanford University
August 1995.
Weider, et. al. Informational [Page 27]
RFC 2130 Character Set Workshop Report April 1997
McCarthy, J., "Arbitrary Character Sets", RFC 373,
University, July 1972.
Moore, K., "MIME (Multipurpose Internet Mail Extensions) Part Two
Message Header Extensions for Non-ASCII Text", RFC 1522,
September 1993. (Obsoleted by RFC 2047.)
Moore, K., "MIME (Multipurpose Internet Mail Extensions) Part Three
Message Header Extensions for Non-ASCII Text", RFC 2047,
University of Tennessee, November 1996.
Murai, J., Crispin, M., and E. von der Poel. "Japanese
Encoding for Internet Messages", RFC 1468, Keio University &
Panda Programming, June 1993.
Nussbacher, H., "Handling of Bi-directional Texts in MIME",
Inter-University, December 1993.
Nussbacher, H., and Y. Bourvine, "Hebrew Character Encoding
Internet Messages", RFC 1555, Israeli Inter-University
Hebrew University, December 1993.
Ohta, M., "Character Sets ISO-10646 and ISO-10646-J-1", RFC 1815,
Tokyo Institute of Technology, July 1995.
Postel, J., and J. Reynolds, "File Transfer Protocol (FTP)", STD 9,
RFC 959, ISI, October 1985.
Postel, J., and J. Reynolds, "Telnet Protocol Specification", STD 8,
RFC 854, ISI, May 1983.
Reynolds, J., and J. Postel, "Assigned Numbers", STD 2, RFC 1700,
ISI, October 1994. p.100-117.
Rose, M., "The Internet Message", Prentice Hall, 1992.
Simonsen, K., "Character Mnemonics & Character Sets", RFC 1345,
Rationel Almen Planlaegning, June 1992.