As per Relevance of the word extended, we have this rfc below:
Network Working Group H.
Request for Comments: 1502 SINTEF
August 1993
X.400 Use of Extended Character
Status of this
This RFC specifies an IAB standards track protocol for the
community, and requests discussion and suggestions for improvements
Please refer to the current edition of the "IAB Official
Standards" for the standardization state and status of this protocol
Distribution of this memo is unlimited
1.
Since 1988, X.400 has had the capacity for carrying a large number
different character sets in a message by using the body
"GeneralText" defined by ISO/IEC 10021-7.
Since 1992, the Internet also has the means of passing
messages containing multiple character sets, by using the
defined in RFC-MIME
This RFC defines a suggested method of using "GeneralText" in
to harmonize as much as possible the usage of this body part
2. General
2.1.
The target of this memo is to define a way of using
standards to achieve
(1) in the short term, a standard for sending E-mail in
European languages (Latin letters with European accents
Greek and Cyrillic
(2) in the medium term, extending this to cover the Hebrew
Arabic character
(3) in the long term, opening up true international E-mail
allowing the full character set specified in ISO-10646 to
used
Alvestrand [Page 1]
RFC 1502 X.400 Use of Extended Character Sets August 1993
The author believes that this document gives a specification that
easily accomodate the use of any character set in the ISO registry
and, by giving guidance rules for choosing character sets, will
interworking
2.2. Families of character
2.2.1. ISO 6937/T.61
ISO 6937 is a code technique used and recommended in T.51 and T.101
(Teletex and Videotex service) and in X.500, providing a
of 333 characters from the Latin script by use of non-
diacritical marks. It corresponds closely to CCITT
T.61.
The problem with that technique is that the character stream comes
two modes, i.e., some characters are coded with one byte and
with two (composite characters). This makes information
systems such as an E-mail UA or GW more complex
It is also not extensible to other languages like Korean or Chinese
or even Greek, without invoking the character set
techniques of ISO 2022.
2.2.2. ISO 8859
ISO 8859 defines a set of character sets, each suitable for use
some group of languages. Each character in ISO 8859 is coded in
single byte
There are currently 11 parts of ISO 8859, plus a "supplementary" set
registered as ISO IR 154. Most languages using single-byte
can be written in one or another of the ISO 8859 sets. There
sets covering Greek, Hebrew and Arabic, but there is
controversy over the problem of the rendering direction for
and Arabic
All the ISO 8859 sets include US-ASCII as a subset. All use 8 bits
ISO 8859 is regarded by many as a solution; for instance, the
windows system now comes with ISO-8859-1 as the "standard"
set, with the possibility of specifying others. But since the
applications often do not support character set switching
text, it is problematic to use these in a truly
environment. (Also, most fonts claiming to be "ISO- 8859-1" in X11R
are actually 7-bit fonts. The implied lie is very unfortunate.)
Alvestrand [Page 2]
RFC 1502 X.400 Use of Extended Character Sets August 1993
It turns out to work fine, however, if the second language
English, since this can be written in all ISO 8859 sets
The parts 3 and 4 have not seen wide acceptance, and it is
that they will be discarded. They should therefore not be used
Note that an ISO 8859 set is actually 2 sets in the ISO sense: US
ASCII in the G0 set and another character set in the G1 set.
overloading of the word "character set" is unfortunate,
traditional
2.2.3. ISO 10646
At the moment of writing, ISO 10646 has just been accepted as
International Standard. It is basically a 32-bit character set,
all of the currently used characters being numbered by the first 16
bits, leaving some room for expansion
It is not possible to use ISO 10646 as a normal character set
because it does not conform to the rules for usage of byte values
down in ISO 2022 and other places; it uses the "control space"
(parts of) graphic character codes
There are a number of ways to encode ISO 10646 characters "on
wire". There are methods within the ISO 2022 standard to switch
these, either as "other coding system without return" or as "
coding system with return" (that is, you can go back from it to
one you came from using an ISO 2022 escape sequence).
The following registrations have been made
ISO 10646 UCS-2 Level 1 has been registered with ESC 2/5 2/15 4/0,
ISO 10646 UCS-4 Level 1 has been registered with ESC 2/5 2/15 4/1,
The following are applied for
Reg# Escape sequence Standard/Sponsor
174 ESC 2/5 2/15 F ISO/IEC 10646 UCS-2, Level 2
175 ESC 2/5 2/15 F ISO/IEC 10646 UCS-4, Level 2
176 ESC 2/5 2/15 F ISO/IEC 10646 UCS-2, Level 3
177 ESC 2/5 2/15 F ISO/IEC 10646 UCS-4, Level 3
178 ESC 2/5 F ISO/IEC 10646 UTF-1
<< NOTE: The registration numbers for UCS-2 level 1 and UCS-4
level 1 are not known. Neither are the assigned final
for the other sets. Information requested!>>
Alvestrand [Page 3]
RFC 1502 X.400 Use of Extended Character Sets August 1993
This character set will become very important in the future, but
the moment, few systems are able to support this directly
The GeneralText body part can be used for carrying any of
character sets
2.3. Body parts that can be used in X.400
At the moment, no established way of transferring a full set
characters in X.400-based E-mail exists. In the future, it is
that a new body part, based in ISO 10646, will be available,
GeneralText may be able to use ISO 10646, but this matter has not
been clarified
In the short term, the deployed and available body parts are
(1) IA5
(2) For X.400/84: ISO6937Text and
(3) For X.400/88:
IA5Text is the method of choice for E-mail that contains
characters from IA5 (equivalent to US-ASCII).
The ISO6937Text body part is defined in the ISO DIS
corresponding to X.400(84) [10]; these never became a standard,
they are now quite difficult to find. It is in principle limited
using text that can be presented in ISO 6937, but since ISO 6937
refers to the ISO 2022 method of changing character sets, it
theoretically possible to use any ISO registered character set,
there is no facility for announcing the character sets used.
makes interworking with equipment that does not support the
character sets complex
It is still, however, the only body part suitable for carrying non
paginated text in non-basic character sets in X.400(84).
Teletex, which is identical in all versions of the X.400 standard
has the same problem of implicit ISO6937, but has the added
that it also specifies a page format, with, for instance, a
margin of 5 character positions. This is often not desirable
The details of Teletex are specified in recommendation T.51 and
relatives
GeneralText is defined in ISO 10021-8, the part of [9]
corresponds to CCITT recommendation [11]. It is an Extended
Alvestrand [Page 4]
RFC 1502 X.400 Use of Extended Character Sets August 1993
part, so no modification to CCITT implementations is needed to
it
GeneralText is suitable for interchange, since it has got
announcement facilities. It can use any number of character sets,
announces them both in the Encoded Information Types of the X.400
envelope and the parameters of the body part
We recommend this body part for carrying unformatted text
X.400/88.
3. GUIDELINES FOR THE GENERATION OF
3.1. Formal definition of
A GeneralText message is a byte stream that contains characters
character switching sequences according to [12].
The X.400 ASN.1 definition of the GeneralText body part is
general-text-body-part EXTENDED-BODY-PART-
PARAMETERS GeneralTextParameters IDENTIFIED BY id-ep-general-
DATA
::= id-et-general-
GeneralTextParameters ::= SET OF
CharacterSetRegistration ::= INTEGER (1..32767)
GeneralTextData ::=
The definition is from ISO/IEC 10021-7 [9], Annex I,
modifications made in the MHS Implementor' Guide, version 8,
3.6.3, bullet F130. It does not appear in the CCITT version of
standards
3.2. Brief description of ISO 2022 character set
There are 4 graphic character sets active at any time in
GeneralText message, called G0, G1, G2 and G3. In addition, there
2 control character sets, called C0 and C1.
At any moment, one of the sets G0-G3 is active in code positions 2/1
to 7/14, and another is active in code positions 10/0 to 15/15.
setting is achieved by so-called "locking shift" sequences
(Formally, code positions 2/0 and 7/15 are reserved for "space"
"DEL" respectively, and only 94-character character sets can be
Alvestrand [Page 5]
RFC 1502 X.400 Use of Extended Character Sets August 1993
in G0. In practice, this restriction is sometimes ignored
Single characters from the non-active sets may be invoked by the
of "single shift" sequences
The control character sets always occupy the code positions 0/0
1/15 (C0) and 8/0 to 9/15 (C1).
The character sets currently active as G0-G3 and C0-C1 may be
using "character set designating sequences".
At the beginning of a GeneralText message, one must always
that set 2 (IA5) is active as G0, shifted into the lower half,
set 1 (standard) is active as C0, and that no G1-G3 or C1 set
invoked. This is specified in the definition of "GeneralString"
[5], the definition of ASN.1 encoding (section 23.5.2).
If this is not a suitable initial state, a message must always
with the necessary announcers and escape sequences to designate
invoke the character sets that are actually used. The character
in use may be changed later in the message by use of
sequences
The parameters of a GeneralText message always list all the
sets used, by quoting their ISO reference numbers
It is impossible to use a character set not registered with ISO in
GeneralText message
It is also impossible to decide on the true meaning of a byte in
GeneralText message without scanning the whole message for shift
escape sequences
3.3. How to use the character
RECOMMENDATION
When the text to be rendered is representable in one of the
sets of ISO-8859, the G0 set should be set to ISO 646
Reference Version (1991), also called US-ASCII, ISO-IR-6.
The older character set ISO-IR-2, ISO 646 IRV(1983), should NOT
used. This means that the escape sequence ESC 2/8 4/2 (
US-ASCII as G0) should always occur at the beginning of the message
The G1 set should be set to the character set identified by
relevant ISO-8859 part. G2 and G3 are not used
Alvestrand [Page 6]
RFC 1502 X.400 Use of Extended Character Sets August 1993
This corresponds to the first level of ISO 4873 usage
For the currently defined parts of ISO 8859, the character
designations for the G1 set are (relative to ISO 8859:1987):
Part ISO IR name Escape sequence
for G1
1 ISO-IR-100 Esc 2D 41 West Europe (Latin-1)
2 ISO-IR-101 Esc 2D 42 East European (Latin-2)
3 ISO-IR-109 Esc 2D 43 (Latin-3)
4 ISO-IR-110 Esc 2D 44 (Latin-4)
5 ISO-IR-144 Esc 2D 4C
6 ISO-IR-127 Esc 2D 47
7 ISO-IR-126 Esc 2D 46
8 ISO-IR-138 Esc 2D 48
9 ISO-IR-148 Esc 2D 4D Turkish (Latin-5)
10 ISO-IR-157 Not listed Sami (Latin-6)
The escape sequence for 8859-10 (Latin-6) is not listed in RFC 1345.
NOTE: The use of ISO 8859-3 and ISO 8859-4 is NOT recommended
other possibilities exist
NOTE: There is a debate about the Arabic and Hebrew character sets
These languages are normally read right to left, but encodings
been done in both "visual" (left to right) and "phonetic" (right
left) ordering, there is significant disagreement about what
"right" way to do it is, and the character sets mentioned do
specify it. So, one should be careful not to use these character
until a standard is agreed upon, or the result will probably
unreadable (siht ekil).
(Note that there is some confusion as to what parts are
standardized; the Norwegian standards institute reports that
part 1, 2, 3, 4, 6, 7 and 8 are currently standards. Other
claim that both 8859-10 and 8859-11 are standards, and I
think that 8859-9 is.)
NOTE: ISO has not ruled out the possibility of changing the ISO 8859
standard. This would involve changing the registry information
this table, so this should be assumed valid for ISO 8859
that are current in 1993.
The G1 set should be permanently shifted into the upper half of
code page
Alvestrand [Page 7]
RFC 1502 X.400 Use of Extended Character Sets August 1993
When the text is not representable in one of the ISO-8859
sets, the following rules may be applied
(1) If any Latin characters are used, keep IA5 as the G0 set
(2) If a mainstream character set is used (Greek, Cyrillic
Hebrew, Arabic), designate this as the G1 character set
and permanently shift this into the upper half of the
page (LS1R).
EXCEPTION: The Japanese community has a long tradition
switching between the Japanese 16-bit character
ISO-IR-87 and US-ASCII as the G0 set. See [7]
for details. If ISO-IR-87 is used, that technique should
used instead of the one recommended here
(3) If occasional extensions to a character set that
basically Latin occur (like accents, national
and so on), and these are available in a single
set, designate the relevant set as G2 and use
shift (SS2) to invoke characters from this character set
The ISO 8859 supplementary set, ISO-IR-154, is
for this purpose
This corresponds to the ISO 4873 "second level" application
(4) If two non-Latin character sets are used, the second
be designated as G3, and shifted into the upper half of
code page by the use of Locking Shift 3 Right (LS3R).
This corresponds to the ISO 4873 "third level" application
(5) If avoidable, use of character sets with floating accents
like ISO 6937, should be avoided
(6) The shifts changing the lower half of the code table (SI/SO
LS2 and LS3) should NOT be used
RATIONALE: Keeping the G0 set reserved for US-ASCII will ensure
text in US-ASCII has the same bit representation always
The use of the upper code page for other scripts ensures that
text in these languages and text of this type mixed with English
be represented without the use of shift sequences
If the language and/or content of a text is completely unknown
chapter 5 gives an algorithm that may be used to decide upon
character sets. This might, for instance, be suitable for use
Alvestrand [Page 8]
RFC 1502 X.400 Use of Extended Character Sets August 1993
automatic mail gateways
NOTE: At the time of this writing, few applications that use ISO 4873
level 2 and level 3 encoding exist. It has been estimated
implementing them in an application that already uses a
repertoire of characters is a matter of programmer-days,
programmer-months, but this has not been proven
4. GUIDELINES FOR THE RENDERING OF
As a basic rule, one should NOT assume that any of the rules
are followed
An user agent capable of rendering GeneralText should
(1) ALWAYS be able to identify and render characters in IA5,
matter how they are designated and invoked
(2) ALWAYS be able to identify and render characters in
"native" character sets, no matter how they are
and invoked
(3) ALWAYS indicate the presence of characters that cannot
adequately represented on the current output device
(4) NEVER render a character in an unknown or
character set by displaying the character in the same
position in the native character set
(5) PREFERABLY be able to identify and render characters that
the same as characters in the "native" character sets,
though they are designated and invoked as part of
character sets. This applies in particular to
"invariant" part of ISO 8859, parts 1 through 6.
(6) PREFERABLY be able to combine the floating accents of
6937 with their base characters for suitable rendering
the capabilities of the current output device
(7) PREFERABLY be able to display text both in a mode
fallbacks for nonrenderable characters and in a
designating nonrenderable characters as such
(8) PREFERABLY be able to save the content of a
message to a file or other suitable media, saving
character set information, for later processing by
means. It is not illegal to render the character
information into a different format; however, it should
Alvestrand [Page 9]
RFC 1502 X.400 Use of Extended Character Sets August 1993
noted that it is easy to lose vital information if the
chosen for representing character sets does not offer
possibility of referencing all character sets in the
registry of character sets
These requirements also apply to gateways that transform the
into some other format, for example a gateway that transforms
message into MIME using [7] for the purpose
5. RECOMMENDATION FOR SELECTION OF CHARACTER
5.1. Algorithm for selection of character
When one has text in which characters from several character
occurs, and wants to process this into a GeneralText document, it
often hard to guess right at the character sets to select
The following paragraphs give an algorithm that can be started at
beginning of a message, and at the end of it, return a set
character sets that can be used as G0..G3 character sets, OR
indication that the task is impossible
VARIABLES
The set of character sets that MUST be used for this
The set of character sets that MAY be used for this message
Each set also contains a counter for each character position
The set of all the character sets known to be usable in
destination format
ALGORITHM
1) Add IA5 (ISO-IR-6) to the UsedSets (as G0).
2) Get the next character of the text. If the text
completely analyzed, go to
3) If it is in the UsedSets, go to 2).
4) Find the set of character sets from PossibleSets in which
character occurs. If it does not occur in any,
failure
Alvestrand [Page 10]
RFC 1502 X.400 Use of Extended Character Sets August 1993
5) If it is in a single character set in PossibleSets only,
this set to UsedSets, and go to 2).
6) If it is in more than one character set, add these
PossibleSets (if not already present), and increment
counter for that character in all the sets. Go to 2).
FINISHED
1) (FINAL SELECTION) Remove any character set in UsedSets
PossibleSets
Zero the counters for any character in PossibleSets that
occurs in UsedSets
WHILE (more characters left
Select one character set and move it from PossibleSets
UsedSqets
Zero the counters for all characters in this set in the
PossibleSets
END
This step can be "tuned" any way you want, for instance
choosing the character sets most likely to be understood
the destination first, choosing the character sets
the most characters first, avoiding multi-byte character
as long as possible, or any other scheme suitable for
application
5.2. WHAT TO DO ON
Failure will occur in this schema if a character is found that is
in the PossibleSets. It may then be handled in one of the
ways
(1) Replace the character with the SUB control
(2) Replace the character with Keld Simonsen Mnemonics [8].
This is a reversible transformation as long as
recipient is aware that it has been used, but
passing out-of-band information to indicate this
(3) Replace the lost characters with any suitable fallback
mnemonic scheme intended for human
(4) Bounce the message/refuse the conversion/give up
The action to be taken may be different based on the percentage
"lost" characters
Alvestrand [Page 11]
RFC 1502 X.400 Use of Extended Character Sets August 1993
If the message has "controls" like "conversion with loss prohibited",
only the last possibility may be used
5.3. RECOMMENDED CHARACTER
There are 2 steps in the algorithm above that are left for
judgement
(1) Selection of the sets to appear in PossibleSets
(2) The algorithm for deciding which character set to select
step 9.
In the context of generating X.400 GeneralText messages,
following is recommended
Sets in PossibleSets
ISO-IR-6 Esc 28 42 (G0) US-ASCII, IA5, ISO646
ISO-IR-100 Esc 2D 41 (G1) ISO-8859-1 West
ISO-IR-101 Esc 2D 42 (G1) ISO-8859-2 Central/Eastern
ISO-IR-144 Esc 2D 4C (G1) ISO-8859-5
ISO-IR-127 Esc 2D 47 (G1) ISO-8859-6
ISO-IR-126 Esc 2D 46 (G1) ISO-8859-7
ISO-IR-138 Esc 2D 48 (G1) ISO-8859-8
ISO-IR-148 Esc 2D 4D (G1) ISO-8859-9
The following multi-byte character sets are recommended
ISO-IR-87 (Japanese JIS C6226-1983) Esc 24 29 42 (G1)
ISO-IR-149 (Korean KS C 5601-1989) Esc 24 29 43 (G1)
ISO-IR-58 (Chinese GB 2312-80) Esc 24 29 41 (G1)
It is a STRONG recommendation that character sets not listed above
which do not add any new characters to the total set of
given by the character sets above, should NOT be used in X.400
interchange
ISO-IR-87 is the Japanese character set that is allowed in a
string, such as the subject field
NOTE: ISO-IR-87 has been "superseded" by ISO-IR-168, which allows
extra Kanji characters. Any application that handles ISO-IR-87
also be able to handle ISO-IR-168.
Algorithm for selecting character sets
Start at the top of the list above, and add each set only if it
needed
Alvestrand [Page 12]
RFC 1502 X.400 Use of Extended Character Sets August 1993
6.
[1] Information technology - ISO 8-bit code for
interchange - Structure and rules for implementation,
edition, 1991-12-15.
[2] Information technology - 8-bit single-byte coded
character sets (parts 1-11; the parts have different dates,
ones referenced here are from RFC 1345).
[3] Information technology - Coded graphic character set for
communication (parts 1 and 2; part 2 dated 1983-12-15).
[4] Code for the representation of names of languages. 1988 version
[5] CCITT Recommendation X.209(1988): Specification of
Encoding Rules for Abstract Syntax Notation One (ASN.1).
Technically aligned with ISO 8825 and ISO 8825/AD 1.
[6] Information Technology - Universal Multiple-Octet
Character Set (UCS) - ISO 10646.
[7] Murai, J., Crispin M., and E. van der Poel, "Japanese
Encoding for Internet Message Bodies", RFC 1468,
University, Panda Programming, June 1993.
[8] Simonsen, K., "Character Mnemonics & Character Sets", RFC 1345,
Rationel Almen Planlaegning, June 1992.
[9] Information Technology - Text communication - Message-
Text Interchange Systems (MOTIS) - ISO 10021 - October 1988.
[10] ISO DIS documents describing X.400/84 with slight extensions
Now very hard to get copies of, since they failed to
ISes
[11] CCITT Recommendation X.420 (1988), Interpersonal
System
[12] International Standard--Information Processing-- ISO 7-bit
8-bit coded character sets--Code extension techniques,
2022:1986.
7. Security
Security issues are not discussed in this memo
Alvestrand [Page 13]
RFC 1502 X.400 Use of Extended Character Sets August 1993
8. Author's
Harald Tveit
SINTEF
N-7034
EMail: Harald.Alvestrand@delab.sintef.
Alvestrand [Page 14]
if you see any problems within the linking, don't worry be happy,
this is version 0.1 of the Relevance System and you gotta expect some crappy subroutines sometimes,
just be content we did not write this in Java, which would have made this "bigger and better" HAHAHHA.
RFC documents can be found at I.E.T.F.
Relevance System Copyright © 2002 Spectrum WorldResearch
other technical nosh by ServerMasters Corporation
collaboration of BobX