As per Relevance of the word information, we have this rfc below:
Network Working Group F.
Request for Comments: 1843 Stanford
Category: Informational August 1995
HZ - A Data Format for Exchanging Files
Arbitrarily Mixed Chinese and ASCII
Status of this
This memo provides information for the Internet community. This
does not specify an Internet standard of any kind. Distribution
this memo is unlimited
The content of this memo is identical to an article of the same
written by the author on September 4, 1989. In this memo, GB
for GB2312-80. Note that the title is kept only for
reasons. HZ has been widely used for purposes other than "
exchange".
1.
Most existing computer systems which can handle a text file
arbitrarily mixed Chinese and ASCII characters use 8-bit codes.
exchange such text files through electronic mail on ASCII
systems, it is necessary to encode them in a 7-bit format. A
binary to ASCII encoder is not sufficient, because there is
no universal standard for such 8-bit codes. For example, CCDOS
Macintosh's Chinese OS use different internal codes. Fortunately
there is a PRC national standard, GuoBiao (GB), for the encoding
Chinese characters, and Chinese characters encoded in the
systems can be easily converted to GB by a simple formula. (* The
standard BIG-5 is outside the scope of this article.)
HZ is a 7-bit data format proposed for arbitrarily mixed GB and
text file exchange. HZ is also intended for the design of
emulators that display and edit mixed Chinese and ASCII text files
real time
Lee Informational [Page 1]
RFC 1843 HZ - A Data Format for Exchanging Files August 1995
2.
The format of HZ is described in the following
Without loss of generality, we assume that all Chinese
(HanZi) have already been encoded in GB. A GB (GB1 and GB2) code
a two byte code, where the first byte is in the range $21-$77
(hexadecimal), and the second byte is in the range $21-$7E
A graphical ASCII character is a byte in the range $21-$7E. A non
graphical ASCII character is a byte in the range $0-$20 or of
value $7F
Since the range of a graphical ASCII character overlaps that of a
byte, a byte in the range $21-$7E is interpreted according to
mode it is in. There are two modes, namely ASCII mode and GB mode
By convention, a non-graphical ASCII character should only appear
ASCII mode
The default mode is ASCII mode
In ASCII mode, a byte is interpreted as an ASCII character, unless
'~' is encountered. The character '~' is an escape character.
convention, it must be immediately followed ONLY by '~', '{' or '\n
(), with the following special meaning
o The escape sequence '~~' is interpreted as a '~'.
o The escape-to-GB sequence '~{' switches the mode from ASCII
GB
o The escape sequence '~\n' is a line-continuation marker to
consumed with no output produced
In GB mode, characters are interpreted two bytes at a time as (pure
GB codes until the escape-from-GB code '~}' is read. This
switches the mode from GB back to ASCII. (Note that the escape
from-GB code '~}' ($7E7D) is outside the defined GB range.)
The decoding process is clear from the above description
The encoding process is straightforward. Note that an (ASCII) '~'
always encoded as '~~'. A sequence of GB codes is enclosed in '~{'
and '~}'.
Lee Informational [Page 2]
RFC 1843 HZ - A Data Format for Exchanging Files August 1995
3. Remarks &
We choose to encode any ASCII character except '~' as it is,
than as a two byte code, and we choose ASCII as the default mode
the following reasons. The computer systems we use is ASCII based.
HZ file containing pure ASCII characters (i.e. no Chinese characters
except '~' is precisely a pure ASCII file. In general, the
(ASCII) portion of a HZ file is directly readable
The escape character '~' is chosen not only because it is
used in the ASCII world, but also because '~' ($7E) is outside
defined range ($21-$77) of the first byte of a GB code
In ASCII mode, other potential escape sequences, i.e., two
sequences beginning with '~' (other than '~~', '~{', '~\n')
currently invalid HZ sequences. Hence, they can be used for
extension of HZ with total upward compatibility
The line-continuation marker '~\n' is useful if one wants to
long lines in the original text into short lines in this data
without introducing extra newline characters in the decoding process
There is no limit on the length of a line. In fact, the whole
could be one long line or even contain no newline characters.
DECODER of this HZ data format should not and has no need to
on the concept of a line
It is easy to write encoders and decoders for HZ. An encoder
decoder needs to lookahead at most one character in the input
stream
Given the current mode, it is also possible and easy to decode a
data stream by scanning backward. One of the implication is
"backspaces" can be handled correctly by a terminal emulator
To facilitate the effective use of programs supporting line/
skips such as "more" on UNIX with a terminal emulator
the HZ format, it is RECOMMENDED that the ENCODER (which outputs
HZ) sets a maximum line size of less than 80 characters. Since '\n
is an ASCII character, the syntax of HZ then automatically
that GB codes appearing at the end of a line must be terminated
the escape-from-GB code '~}', and the line-continuation marker '~\n
should be inserted appropriately. The price to paid is that
encoded file size is slightly larger
It is important to understand the following distinction. Note
the above recommendation does NOT change the HZ format. It is
an encoding "style" which follows the syntax of HZ. Note that
Lee Informational [Page 3]
RFC 1843 HZ - A Data Format for Exchanging Files August 1995
"style" is not built into HZ. It is an additional convention
"on top of" HZ. Other applications may require different "styles",
but the same basic HZ DECODER will always work. The essence of HZ
to provide such a flexible basic data format for files of
mixed Chinese and ASCII characters
4.
To illustrate the "stylistic" issue of HZ encoding, we give
following four examples of encoded text, which should produce
same decoded output. (The recommendation in the last section
to Example 2.)
Example 1: (Suppose there is no line size limit.)
This sentence is in ASCII
The next sentence is in GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye
Example 2: (Suppose the maximum line size is 42.)
This sentence is in ASCII
The next sentence is in GB.~{<:Ky2;S{#,~}~
~{NpJ)l6HK!#~}Bye
Example 3: (Suppose a new line is started whenever there is a
switch.)
This sentence is in ASCII
The next sentence is in GB.~
~{<:Ky2;S{#,NpJ)l6HK!#~}~
Bye
Edmund Lai was the first one who brought my attention to this topic
Discussions with Ed, Tin-Fook Ngai, Yagui Wei and Ricky Yeung
very helpful in shaping the ideas in this article. Thanks to Tin-
for his careful review of the draft and numerous
suggestions
[1] Fung Fung Lee, "HZ - A Data Format for Exchanging Files
Arbitrarily Mixed Chinese and ASCII Characters," September 4,
1989.
As part of //ftp.ifcss.org/software/unix/convert/HZ-2.0.tar.
Security
Security issues are not addressed in this memo
Lee Informational [Page 4]
RFC 1843 HZ - A Data Format for Exchanging Files August 1995
Author's
Fung Fung
Computer Systems
Stanford
Stanford, CA 94309
Phone: +1 415 723 1450
EMail: lee@csl.stanford.
Lee Informational [Page 5]
if you see any problems within the linking, don't worry be happy,
this is version 0.1 of the Relevance System and you gotta expect some crappy subroutines sometimes,
just be content we did not write this in Java, which would have made this "bigger and better" HAHAHHA.
RFC documents can be found at I.E.T.F.
Relevance System Copyright © 2002 Spectrum WorldResearch
other technical nosh by ServerMasters Corporation
collaboration of BobX