As per Relevance of the word mechanism, we have this rfc below:
Network Working Group K.
Request for Comments: 2482
Category: Informational G.
January 1999
Language Tagging in Unicode Plain
Status of this
This memo provides information for the Internet community. It
not specify an Internet standard of any kind. Distribution of
memo is unlimited
Copyright
Copyright (C) The Internet Society (1999). All Rights Reserved
IESG Note
This document has been accepted by ISO/IEC JTC1/SC2/WG2 in
#34 to be submitted as a recommendation from WG2 for inclusion
Plane 14 in part 2 of ISO/IEC 10646.
1.
This document proposed a mechanism for language tagging in [UNICODE
plain text. A set of special-use tag characters on Plane 14
[ISO10646] (accessible through UTF-8, UTF-16, and UCS-4
forms) are proposed for encoding to enable the spelling out
ASCII-based string tags using characters which can be
separated from ordinary text content characters in ISO10646 (
UNICODE).
One tag identification character and one cancel tag character
also proposed. In particular, a language tag identification
is proposed to identify a language tag string specifically;
language tag itself makes use of [RFC1766] language tag
spelled out using the Plane 14 tag characters. Provision of
specific, low-overhead mechanism for embedding language tags in
text is aimed at meeting the need of Internet Protocols such as ACAP
which require a standard mechanism for marking language in UTF-8
strings
The tagging mechanism as well the characters proposed in
document have been approved by the Unicode Consortium for
in The Unicode Standard. However, implementation of this
Whistler & Adams Informational [Page 1]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
awaits formal acceptance by ISO JTC1/SC2/WG2, the working
responsible for ISO10646. Potential implementers should be aware
until this formal acceptance occurs, any usage of the
proposed herein is strictly experimental and not sanctioned
standardized character data interchange
2. Definitions and
No attempt is made to define all terms used in this document.
particular, the terminology pertaining to the subject of
character systems is not explicitly specified. See [UNICODE],
[ISO10646], and [RFC2130] for additional definitions in this area
2.1 Requirements
This document occasionally uses terms that appear in capital letters
When the terms "MUST", "SHOULD", "MUST NOT", "SHOULD NOT", and "MAY
appear capitalized, they are being used to indicate
requirements of this specification. A discussion of the meanings
these terms appears in [RFC2119].
2.2
The terms defined below are used in special senses and thus
some clarification
2.2.1
The association of attributes of text with a point or range of
primary text. (The value of a particular tag is not
considered to be a part of the "content" of the text.
examples of tagging is to mark language or font of a portion
text.)
2.2.2
The association of secondary textual content with a point or range
the primary text. (The value of a particular annotation *is
considered to be a part of the "content" of the text.
examples include glossing, citations, exemplication, Japanese yomi
etc.)
2.2.3 Out-of-
An out-of-band channel conveys a tag in such a way that the
content, as encoded, is completely untouched and unmodified. This
typically done by metadata or hyperstructure of some sort
Whistler & Adams Informational [Page 2]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
2.2.4 In-
An in-band channel conveys a tag along with the textual content
using the same basic encoding mechanism as the text itself. This
done by various means, but an obvious example is SGML markup,
the tags are encoded in the same character set as the text and
interspersed with and carried along with the text data
3.0
There has been much discussion over the last 8 years of
tagging and of other kinds of tagging of Unicode plain text. It
fair to say that there is more-or-less universal agreement
language tagging of Unicode plain text is required for
textual processes. For example, language "hinting" of
text is necessary for multilingual spell-checking based on
dictionaries to work well. Language tagging provides a minimum
of required information for text-to-speech processes to
correctly. Language tagging is regularly done on web pages,
enable selection of alternate content, for example
However, there has been a great deal of controversy regarding
appropriate placement of language tags. Some have held that the
appropriate placement of language tags (or other kinds of tags)
out-of-band, making use of attributed text structures or metadata
Others have argued that there are requirements for lower-
in-band mechanisms for language tags (or other tags) in plain text
The controversy has been muddied by the existence and widespread
of a number of in-band text markup mechanisms (HTML, text/enriched
etc.) which enable language tagging, but which imply the use
general parsing mechanisms which are deemed too "heavyweight"
protocol developers and a number of other applications.
difficulty of using general in-band text markup for simple
derives from the fact that some characters are used both for
content and for the text markup; this makes it more difficult
write simple, fast algorithms to find only the textual content
ignore the tags, or vice versa. (Think of this as the
equivalent of the difficulty the human reader has attempting to
just the content of raw HTML source text without a
interpreting all the markup tags.)
The Plane 14 proposal addresses the recurrent and persistent call
a lighter-weight mechanism for text tagging than typical text
mechanisms in Unicode. It proposes a special set of characters
*only* for tagging. These tag characters can be embedded into
Whistler & Adams Informational [Page 3]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
text and can be identified and/or ignored with trivial algorithms
since there is no overloading of usage for these tag characters--
can only express tag values and never textual content itself
The Plane 14 proposal is not intended for general annotation of text
such as textual citations, phonetic readings (e.g. Japanese Yomi),
etc. In its present form, its use is intended to be restriced
to specifying in-line language tags. Future extensions may
this scope of intended usage
4.0
This proposal suggests the use of 97 dedicated tag characters
at the start of Plane 14 of ISO/IEC 10646 consisting of a clone
the 94 printable 7-bit ASCII graphic characters and ASCII SPACE,
well as a tag identification character and a tag cancel character
These tag characters are to be used to spell out any ASCII-
tagging scheme which needs to be embedded in Unicode plain text.
particular, they can be used to spell out language tags in order
meet the expressed requirements of the ACAP protocol and the
requirements of other new protocols following the guidelines of
IAB character workshop (RFC 2130).
The suggested range in Plane 14 for the block reserved for
characters is as follows, expressed in each of the three
generally used encoding schemes for ISO/IEC 10646:
UCS-4
U-000E0000 .. U-000E007
UTF-16
U+DB40 U+DC00 .. U+DB40 U+DC7
UTF-8
0xF3 0xA0 0x80 0x80 .. 0xF3 0xA0 0x81 0
Of this range, U-000E0020 .. U-000E007E is the suggested range
the ASCII clone tag characters themselves
4.1 Names for the Tag
The names for the ASCII clone tag characters should be exactly
ISO 10646 names for 7-bit ASCII, prefixed with the word "TAG".
Whistler & Adams Informational [Page 4]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
In addition, there is one tag identification character and a
TAG character. The use and syntax of these characters is described
detail below
The entire encoding for the proposed Plane 14 tag characters
names of those characters can be derived from the following list
(The encoded values here and throughout this proposal are listed
UCS-4 form, which is easiest to interpret. It is assumed that
Unicode applications will, however, be making use either of UTF-16
UTF-8 encoding forms for actual implementation.)
U-000E0000 <reserved
U-000E0001 LANGUAGE
U-000E0002 <reserved
U-000E001F <reserved
U-000E0020 TAG
U-000E0021 TAG EXCLAMATION
U-000E0041 TAG LATIN CAPITAL LETTER
U-000E007A TAG LATIN SMALL LETTER
U-000E007E TAG
U-000E007F CANCEL
4.2 Range Checking for Tag
The range checks required for code testing for tag characters
be as follows. The same range check is expressed here in C for
of the three significant encoding forms for 10646.
Range check expressed in UCS-4:
if ( ( *s >= 0xE0000 ) || ( *s <= 0xE007F ) )
Range check expressed in UTF-16 (Unicode):
if ( ( *s == 0xDB40 ) && ( *(s+1) >= 0xDC00 ) && ( *(s+1) <= 0xDC7F ) )
Expressed in UTF-8:
if ( ( *s == 0xF3 ) && ( *(s+1) == 0xA0 ) && ( *(s+2) & 0xE0 == 0x80 )
Because of the choice of the range for the tag characters, it
also be possible to express the range check for UCS-4 or UTF-16
terms of bitmask operations, as well
Whistler & Adams Informational [Page 5]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
4.3 Syntax for Embedding
The use of the Plane 14 tag characters is very simple. In order
embed any ASCII-derived tag in Unicode plain text, the tag is
spelled out with the tag characters instead, prefixed with
relevant tag identification character. The resultant string
embedded directly in the text
The tag identification character is used as a mechanism
identifying tags of different types. This enables multiple types
tags to coexist amicably embedded in plain text and solves
problem of delimitation if a tag is concatenated directly
another tag. Although only one type of tag is currently specified
namely the language tag, the encoding of other tag
characters in the future would allow for distinct tag types to
used
No termination character is required for a tag. A tag
either when the first non Plane 14 Tag Character (i.e. any
normal Unicode value) is encountered, or when the next
identification character is encountered
All tag arguments must be encoded only with the tag characters U
000E0020 .. U-000E007E. No other characters are valid for
the tag argument
A detailed BNF syntax for tags is listed below
4.4 Tag Scope and
The value of an established tag continues from the point the tag
embedded in text until either
A. The text itself goes out of scope, as defined by
application. (E.g. for line-oriented protocols, when
the end-of-line or end-of-string; for text streams,
reaching the end-of-stream; etc.)
B. The tag is explicitly cancelled by the CANCEL TAG character
Tags of the same type cannot be nested in any way. The appearance
a new embedded language tag, for example, after text which
already language tagged, simply changes the tagged value
subsequent text to that specified in the new tag
Whistler & Adams Informational [Page 6]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
Tags of different type can have interdigitating scope, but
hierarchical scope. In effect, tags of different type
ignore each other, so that the use of language tags can be
asynchronous with the use of character set source tags (or any
tag type) in the same text in the future
4.5 Cancelling Tag
U-000E007F CANCEL TAG is provided to allow the specific cancelling
a tag value. The use of CANCEL TAG has the following syntax.
cancel a tag value of a particular type, prefix the CANCEL
character with the tag identification character of the
type. For example, the complete string to cancel a language tag is
U-000E0001 U-000E007
The value of the relevant tag type returns to the default state
that tag type, namely: no tag value specified, the same as
text
The use of CANCEL TAG without a prefixed tag identification
cancels *any* Plane 14 tag values which may be defined. Since
language tags are currently provided with an explicit
identification character, only language tags are currently affected
The main function of CANCEL TAG is to make possible such
as blind concatenation of strings in a tagged context without
propagation of inappropriate tag values across the string boundaries
For example, a string tagged with a Japanese language tag can
its tag value "sealed off" with a terminating CANCEL TAG
another string of unknown language value is concatenated to it.
would prevent the string of unknown language from being
marked as being Japanese simply because of a concatenation to
Japanese string
4.6 Tag Syntax
An extended BNF (Backus-Naur Form) description of the tags
in this proposal is found below. Note the following BNF
used in this formalism
1. Semantic constraints are specified by rules in the form of
assertion specified between double braces; the variable $$
the string consisting of all terminal symbols matched by the
non-terminal
Example: {{ Assert ( $$[0] == '?' ); }}
Whistler & Adams Informational [Page 7]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
Meaning: The first character of the string matched by
non-terminal must be '?'
2. A number of predicate functions are employed in
constraint rules which are not otherwise defined; their name
sufficient for determining their predication
Example: IsRFC1766LanguageIdentifier ( tag-argument )
Meaning: tag-argument is a valid RFC1766 language
3. A lexical expander function, TAG, is employed to denote the
form of an ASCII character; the argument to this function
either a character or a character set specified by a range
enumeration expression
Example: TAG('-')
Meaning: TAG HYPHEN-
Example: TAG([A-Z])
Meaning: TAG LATIN CAPITAL LETTER A ...
TAG LATIN CAPITAL LETTER
4. A macro is employed to denote terminal symbols that are
literals which can't be directly represented in ASCII.
argument to the macro is the UNICODE (ISO/IEC 10646)
name
Example: '${TAG CANCEL}'
Meaning: character literal whose code value is U-000E007
5. Occurrence indicators used are '+' (one or more) and '*' (zero
more); optional occurrence is indicated by enclosure in '['
']'.
4.6.1 Formal Tag
tag : language-
| cancel-all-
;
language-tag : language-tag-introducer language-tag-
;
Whistler & Adams Informational [Page 8]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
language-tag-argument : tag-
{{ Assert ( IsRFC1766LanguageIdentifier ( $$ ); }}
| tag-
;
cancel-all-tag : tag-
;
tag-argument : tag-character
;
tag-character : { c : c
TAG( { a : a in printable ASCII characters or SPACE } ) }
;
language-tag-introducer : '${TAG LANGUAGE}'
;
tag-cancel : '${TAG CANCEL}'
;
5.0 Tag
5.1 Language
Language tags are of general interest and should have a high
of interoperability for protocol usage. To this end, a
LANGUAGE TAG tag identification character is provided. A Plane 14
tag string prefixed by U-000E0001 LANGUAGE TAG is specified
constitute a language tag. Furthermore, the tag values for
language tag are to be spelled out as specified in RFC 1766,
use only of registered tag values or of user-defined language
starting with the characters "x-".
For example, to embed a language tag for Japanese, the Plane 14
characters would be used as follows. The Japanese tag from RFC 1766
is "ja" (composed of ISO 639 language id) or, alternatively, "ja-JP
(composed of ISO 639 language id plus ISO 3166 country id).
RFC 1766 specifies that language tags are not case significant, it
recommended that for language tags, the entire tag be
before conversion to Plane 14 tag characters. (This would not
required for Unicode conformance, but should be followed as
practice by protocols making use of RFC 1766 language tags,
simplify and speed up the processing for operations which need
identify or ignore language tags embedded in text.) Lowercasing
Whistler & Adams Informational [Page 9]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
rather than uppercasing, is recommended because it follows
majority practice of expressing language tag values in
letters
Thus the entire language tag (in its longer form) would be
to Plane 14 tag characters as follows
U-000E0001 U-000E006A U-000E0061 U-000E002D U-000E006A U-000E0070
The language tag (in its shorter, "ja" form) could be expressed
follows
U-000E0001 U-000E006A U-000E0061
The value of this string is then expressed in whichever encoding
(UCS-4, UTF-16, UTF-8) is required and embedded in text at
relevant point
5.2 Additional
Additional tag identification characters might be defined in
future. An example would be a CHARACTER SET SOURCE TAG, or a
TAG for private definition of tags
In each case, when a specific tag identification character
encoded, a corresponding reference standard for the values of
tags associated with the identifier should be designated, so
interoperating parties which make use of the tags will know how
interpret the values the tags may take
6.0 Display
All characters in the tag character block are considered to have
visible rendering in normal text. A process which interprets tags
choose to modify the rendering of text based on the tag values (
for example, changing font to preferred style for rendering
versus Japanese). The tag characters themselves have no display;
may be considered similar to a U+200B ZERO WIDTH SPACE in
regard. The tag characters also do not affect breaking, joining,
any other format or layout properties, except insofar as the
interpreting the tag chooses to impose such behavior based on the
value
For debugging or other operations which must render the
themselves visible, it is advisable that the tag characters
rendered using the corresponding ASCII character glyphs (
modified systematically to differentiate them from normal
Whistler & Adams Informational [Page 10]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
characters). But, as noted below, the tag character values are
so that even without display support, the tag characters will
interpretable in most debuggers
7.0 Unicode Conformance
The basic rules for Unicode conformance for the tag characters
exactly the same as for any other Unicode characters. A
process is not required to interpret the tag characters. If it
not interpret tag characters, it should leave their
undisturbed and do whatever it does with any other
characters. If it does interpret them, it should interpret
according to the standard, i.e. as spelled-out tags
So for a non-TagAware Unicode application, any language
characters (or any other kind of tag expressed with Plane 14
characters) encountered would be handled exactly as for
Tibetan from the BMP, uninterpreted Linear B from Plane 1,
uninterpreted Egyptian hieroglyphics from private use space in
15.
A TagAware but TagPhobic Unicode application can recognize the
character range in Plane 14 and choose to deliberately strip them
completely to produce plain text with no tags
The presence of a correctly formed tag cannot be taken as a
that the data so tagged is correctly tagged. For example,
prevents an application from erroneously labelling French data
Spanish, or from labelling JIS-derived data as Japanese, even if
contains Greek or Cyrillic characters
7.1 Note on Encoding Language
The fact that this proposal for encoding tag characters in
includes a mechanism for specifying language tag values does not
that Unicode is departing from one of its basic encoding principles
Unicode encodes scripts, not languages
This is still true of the Unicode encoding (and ISO/IEC 10646),
in the presence of a mechanism for specifying language tags in
text. There is nothing obligatory about the use of Plane 14 tags
whether for language tags or any other kind of tags
Language tagging in no way impacts current encoded characters or
encoding of future scripts
Whistler & Adams Informational [Page 11]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
It is fully anticipated that implementations of Unicode which
make use of out-of-band mechanisms for language tagging or "heavy
weight" in-band mechanisms such as HTML will continue to do
what they are doing and will ignore Plane 14 tag
completely
8.0 Security
There are no known security issues raised by this document
[ISO10646] ISO/IEC 10646-1:1993 International Organization
Standardization. "Information Technology --
Multiple-Octet Coded Character Set (UCS) -- Part 1:
Architecture and Basic Multilingual Plane", Geneva, 1993.
[RFC1766] Alvestrand, H., "Tags for the Identification
Languages", RFC 1766, March 1995.
[RFC2070] Yergeau, F., Nicol, G. Adams, G. and M. Duerst
"Internationalization of the Hypertext Markup Language",
RFC 2070, January 1997.
[RFC2119] Bradner, S., "Key words for use in RFCs to
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2130] Weider, C. Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin, M. and P. Svanberg, "The Report
the IAB Character Set Workshop held 29 February - 1 March
1996", RFC 2130, April 1997.
[UNICODE] The Unicode Standard, Version 2.0, The Unicode Consortium
Addison-Wesley, July 1996.
The following people also contributed to this document, directly
indirectly: Chris Newman, Mark Crispin, Rick McGowan, Joe Becker
John Jenkins, and Asmus Freytag. This document also was reviewed
the Unicode Technical Committee, and the authors wish to thank all
the UTC representatives for their input. The authors are, of course
responsible for any errors or omissions which may remain in the text
Whistler & Adams Informational [Page 12]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
Authors'
Ken
Sybase, Inc
6475 Christie Ave
Emeryville, CA 94608-1050
Phone: +1 510 922 3611
EMail: kenw@sybase.
Glenn
Spyglass, Inc
One Cambridge
Cambridge, MA 02142
Phone: +1 617 679 4652
EMail: glenn@spyglass.
Whistler & Adams Informational [Page 13]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
Full Copyright
Copyright (C) The Internet Society (1999). All Rights Reserved
This document and translations of it may be copied and furnished
others, and derivative works that comment on or otherwise explain
or assist in its implementation may be prepared, copied,
and distributed, in whole or in part, without restriction of
kind, provided that the above copyright notice and this paragraph
included on all such copies and derivative works. However,
document itself may not be modified in any way, such as by
the copyright notice or references to the Internet Society or
Internet organizations, except as needed for the purpose
developing Internet standards in which case the procedures
copyrights defined in the Internet Standards process must
followed, or as required to translate it into languages other
English
The limited permissions granted above are perpetual and will not
revoked by the Internet Society or its successors or assigns
This document and the information contained herein is provided on
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED,
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE
Whistler & Adams Informational [Page 14]
if you see any problems within the linking, don't worry be happy,
this is version 0.1 of the Relevance System and you gotta expect some crappy subroutines sometimes,
just be content we did not write this in Java, which would have made this "bigger and better" HAHAHHA.
RFC documents can be found at I.E.T.F.
Relevance System Copyright © 2002 Spectrum WorldResearch
other technical nosh by ServerMasters Corporation
collaboration of BobX