As per Relevance of the word represent, we have this rfc below:
Network Working Group P.
Request for Comments: 1951 Aladdin
Category: Informational May 1996
DEFLATE Compressed Data Format Specification version 1.3
Status of This
This memo provides information for the Internet community. This
does not specify an Internet standard of any kind. Distribution
this memo is unlimited
IESG Note
The IESG takes no position on the validity of any
Property Rights statements contained in this document
Copyright (c) 1996 L. Peter
Permission is granted to copy and distribute this document for
purpose and without charge, including translations into
languages and incorporation into compilations, provided that
copyright notice and this notice are preserved, and that
substantive changes or deletions from the original are
marked
A pointer to the latest version of this and related documentation
HTML format can be found at the
graphics/png/documents/zlib/zdoc-index.html>.
This specification defines a lossless compressed data format
compresses data using a combination of the LZ77 algorithm and
coding, with efficiency comparable to the best currently
general-purpose compression methods. The data can be produced
consumed, even for an arbitrarily long sequentially presented
data stream, using only an a priori bounded amount of
storage. The format can be implemented readily in a manner
covered by patents
Deutsch Informational [Page 1]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
Table of
1. Introduction ................................................... 2
1.1. Purpose ................................................... 2
1.2. Intended audience ......................................... 3
1.3. Scope ..................................................... 3
1.4. Compliance ................................................ 3
1.5. Definitions of terms and conventions used ................ 3
1.6. Changes from previous versions ............................ 4
2. Compressed representation overview ............................. 4
3. Detailed specification ......................................... 5
3.1. Overall conventions ....................................... 5
3.1.1. Packing into bytes .................................. 5
3.2. Compressed block format ................................... 6
3.2.1. Synopsis of prefix and Huffman coding ............... 6
3.2.2. Use of Huffman coding in the "deflate" format ....... 7
3.2.3. Details of block format ............................. 9
3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
3.2.5. Compressed blocks (length and distance codes) ...... 11
3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
3.3. Compliance ............................................... 14
4. Compression algorithm details ................................. 14
5. References .................................................... 16
6. Security Considerations ....................................... 16
7. Source code ................................................... 16
8. Acknowledgements .............................................. 16
9. Author's Address .............................................. 17
1.
1.1.
The purpose of this specification is to define a
compressed data format that
* Is independent of CPU type, operating system, file system
and character set, and hence can be used for interchange
* Can be produced or consumed, even for an arbitrarily
sequentially presented input data stream, using only an
priori bounded amount of intermediate storage, and
can be used in data communications or similar
such as Unix filters
* Compresses data with efficiency comparable to the
currently available general-purpose compression methods
and in particular considerably better than the "compress
program
* Can be implemented readily in a manner not covered
patents, and hence can be practiced freely
Deutsch Informational [Page 2]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
* Is compatible with the file format produced by the
widely used gzip utility, in that conforming
will be able to read data produced by the existing
compressor
The data format defined by this specification does not attempt to
* Allow random access to compressed data
* Compress specialized data (e.g., raster graphics) as
as the best currently available specialized algorithms
A simple counting argument shows that no lossless
algorithm can compress every possible input data set. For
format defined here, the worst case expansion is 5 bytes per 32K
byte block, i.e., a size increase of 0.015% for large data sets
English text usually compresses by a factor of 2.5 to 3;
executable files usually compress somewhat less; graphical
such as raster images may compress much more
1.2. Intended
This specification is intended for use by implementors of
to compress data into "deflate" format and/or decompress data
"deflate" format
The text of the specification assumes a basic background
programming at the level of bits and other primitive
representations. Familiarity with the technique of Huffman
is helpful but not required
1.3.
The specification specifies a method for representing a
of bytes as a (usually shorter) sequence of bits, and a method
packing the latter bit sequence into bytes
1.4.
Unless otherwise indicated below, a compliant decompressor must
able to accept and decompress any data set that conforms to
the specifications presented here; a compliant compressor
produce data sets that conform to all the specifications
here
1.5. Definitions of terms and conventions
Byte: 8 bits stored or transmitted as a unit (same as an octet).
For this specification, a byte is exactly 8 bits, even on
Deutsch Informational [Page 3]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
which store a character on a number of bits different from eight
See below, for the numbering of bits within a byte
String: a sequence of arbitrary bytes
1.6. Changes from previous
There have been no technical changes to the deflate format
version 1.1 of this specification. In version 1.2,
terminology was changed. Version 1.3 is a conversion of
specification to RFC style
2. Compressed representation
A compressed data set consists of a series of blocks,
to successive blocks of input data. The block sizes are arbitrary
except that non-compressible blocks are limited to 65,535 bytes
Each block is compressed using a combination of the LZ77
and Huffman coding. The Huffman trees for each block are
of those for previous or subsequent blocks; the LZ77 algorithm
use a reference to a duplicated string occurring in a previous block
up to 32K input bytes before
Each block consists of two parts: a pair of Huffman code trees
describe the representation of the compressed data part, and
compressed data part. (The Huffman trees themselves are
using Huffman encoding.) The compressed data consists of a series
elements of two types: literal bytes (of strings that have not
detected as duplicated within the previous 32K input bytes),
pointers to duplicated strings, where a pointer is represented as
pair backward distance>. The representation used in
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except
uncompressible blocks, which are limited as noted above
Each type of value (literals, distances, and lengths) in
compressed data is represented using a Huffman code, using one
tree for literals and lengths and a separate code tree for distances
The code trees for each block appear in a compact form just
the compressed data for that block
Deutsch Informational [Page 4]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
3. Detailed
3.1. Overall conventions In the diagrams below, a box like this
+---+
| | <-- the vertical bars might be
+---+
represents one byte; a box like this
+==============+
| |
+==============+
represents a variable number of bytes
Bytes stored within a computer do not have a "bit order",
they are always treated as a unit. However, a byte considered
an integer between 0 and 255 does have a most- and least
significant bit, and since we write numbers with the most
significant digit on the left, we also write bytes with the most
significant bit on the left. In the diagrams below, we number
bits of a byte so that bit 0 is the least-significant bit, i.e.,
the bits are numbered
+--------+
|76543210|
+--------+
Within a computer, a number may occupy multiple bytes.
multi-byte numbers in the format described here are stored
the least-significant byte first (at the lower memory address).
For example, the decimal number 520 is stored as
0 1
+--------+--------+
|00001000|00000010|
+--------+--------+
^ ^
| |
| + more significant byte = 2 x 256
+ less significant byte = 8
3.1.1. Packing into
This document does not address the issue of the order in
bits of a byte are transmitted on a bit-sequential medium
since the final data format described here is byte- rather
Deutsch Informational [Page 5]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
bit-oriented. However, we describe the compressed block
in below, as a sequence of data elements of various
lengths, not a sequence of bytes. We must therefore
how to pack these data elements into bytes to form the
compressed byte sequence
* Data elements are packed into bytes in order
increasing bit number within the byte, i.e.,
with the least-significant bit of the byte
* Data elements other than Huffman codes are
starting with the least-significant bit of the
element
* Huffman codes are packed starting with the most
significant bit of the code
In other words, if one were to print out the compressed data
a sequence of bytes, starting with the first byte at
*right* margin and proceeding to the *left*, with the most
significant bit of each byte on the left as usual, one would
able to parse the result from right to left, with fixed-
elements in the correct MSB-to-LSB order and Huffman codes
bit-reversed order (i.e., with the first bit of the code in
relative LSB position).
3.2. Compressed block
3.2.1. Synopsis of prefix and Huffman
Prefix coding represents symbols from an a priori
alphabet by bit sequences (codes), one code for each symbol,
a manner such that different symbols may be represented by
sequences of different lengths, but a parser can always
an encoded string unambiguously symbol-by-symbol
We define a prefix code in terms of a binary tree in which
two edges descending from each non-leaf node are labeled 0
1 and in which the leaf nodes correspond one-for-one with (
labeled with) the symbols of the alphabet; then the code for
symbol is the sequence of 0's and 1's on the edges leading
the root to the leaf labeled with that symbol. For example
Deutsch Informational [Page 6]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
/\ Symbol
0 1 ------ ----
/ \ A 00
/\ B B 1
0 1 C 011
/ \ D 010
A /\
0 1
/ \
D
A parser can decode the next symbol from an encoded
stream by walking down the tree from the root, at each
choosing the edge corresponding to the next input bit
Given an alphabet with known symbol frequencies, the
algorithm allows the construction of an optimal prefix
(one which represents strings with those symbol
using the fewest bits of any possible prefix codes for
alphabet). Such a code is called a Huffman code. (
reference [1] in Chapter 5, references for
information on Huffman codes.)
Note that in the "deflate" format, the Huffman codes for
various alphabets must not exceed certain maximum code lengths
This constraint complicates the algorithm for computing
lengths from symbol frequencies. Again, see Chapter 5,
references for details
3.2.2. Use of Huffman coding in the "deflate"
The Huffman codes used for each alphabet in the "deflate
format have two additional rules
* All codes of a given bit length have
consecutive values, in the same order as the
they represent
* Shorter codes lexicographically precede longer codes
Deutsch Informational [Page 7]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
We could recode the example above to follow this rule
follows, assuming that the order of the alphabet is ABCD
Symbol
------ ----
A 10
B 0
C 110
D 111
I.e., 0 precedes 10 which precedes 11x, and 110 and 111
lexicographically consecutive
Given this rule, we can define the Huffman code for an
just by giving the bit lengths of the codes for each symbol
the alphabet in order; this is sufficient to determine
actual codes. In our example, the code is completely
by the sequence of bit lengths (2, 1, 3, 3). The
algorithm generates the codes as integers, intended to be
from most- to least-significant bit. The code lengths
initially in tree[I].Len; the codes are produced
tree[I].Code
1) Count the number of codes for each code length.
bl_count[N] be the number of codes of length N, N >= 1.
2) Find the numerical value of the smallest code for
code length
code = 0;
bl_count[0] = 0;
for (bits = 1; bits <= MAX_BITS; bits++) {
code = (code + bl_count[bits-1]) << 1;
next_code[bits] = code
}
3) Assign numerical values to all codes, using
values for all codes of the same length with the
values determined at step 2. Codes that are never
(which have a bit length of zero) must not be assigned
value
for (n = 0; n <= max_code; n++) {
len = tree[n].Len
if (len != 0) {
tree[n].Code = next_code[len];
next_code[len]++;
}
Deutsch Informational [Page 8]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
}
Example
Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
3, 2, 4, 4). After step 1, we have
N bl_count[N
- -----------
2 1
3 5
4 2
Step 2 computes the following next_code values
N next_code[N
- ------------
1 0
2 0
3 2
4 14
Step 3 produces the following code values
Symbol Length
------ ------ ----
A 3 010
B 3 011
C 3 100
D 3 101
E 3 110
F 2 00
G 4 1110
H 4 1111
3.2.3. Details of block
Each block of compressed data begins with 3 header
containing the following data
first bit
next 2 bits
Note that the header bits do not necessarily begin on a
boundary, since a block does not necessarily occupy an
number of bytes
Deutsch Informational [Page 9]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
BFINAL is set if and only if this is the last block of the
set
BTYPE specifies how the data are compressed, as follows
00 - no
01 - compressed with fixed Huffman
10 - compressed with dynamic Huffman
11 - reserved (error
The only difference between the two compressed cases is how
Huffman codes for the literal/length and distance alphabets
defined
In all cases, the decoding algorithm for the actual data is
follows
read block header from input stream
if stored with no
skip any remaining bits in current
processed
read LEN and NLEN (see next section
copy LEN bytes of data to
if compressed with dynamic Huffman
read representation of code trees (
subsection below
loop (until end of block code recognized
decode literal/length value from input
if value < 256
copy value (literal byte) to output
if value = end of block (256)
break from
otherwise (value = 257..285)
decode distance from input
move backwards distance bytes in the
stream, and copy length bytes from
position to the output stream
end
while not last
Note that a duplicated string reference may refer to a
in a previous block; i.e., the backward distance may cross
or more block boundaries. However a distance cannot refer
the beginning of the output stream. (An application using
Deutsch Informational [Page 10]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
preset dictionary might discard part of the output stream;
distance can refer to that part of the output stream anyway
Note also that the referenced string may overlap the
position; for example, if the last 2 bytes decoded have
X and Y, a string reference with distance = 2>
adds X,Y,X,Y,X to the output stream
We now specify each compression method in turn
3.2.4. Non-compressed blocks (BTYPE=00)
Any bits of input up to the next byte boundary are ignored
The rest of the block consists of the following information
0 1 2 3 4...
+---+---+---+---+================================+
| LEN | NLEN |... LEN bytes of literal data...|
+---+---+---+---+================================+
LEN is the number of data bytes in the block. NLEN is
one's complement of LEN
3.2.5. Compressed blocks (length and distance codes
As noted above, encoded data blocks in the "deflate"
consist of sequences of symbols drawn from three
distinct alphabets: either literal bytes, from the alphabet
byte values (0..255), or backward distance> pairs
where the length is drawn from (3..258) and the distance
drawn from (1..32,768). In fact, the literal and
alphabets are merged into a single alphabet (0..285),
values 0..255 represent literal bytes, the value 256
end-of-block, and values 257..285 represent length
(possibly in conjunction with extra bits following the
code) as follows
Deutsch Informational [Page 11]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
Extra Extra
Code Bits Length(s) Code Bits Lengths Code Bits Length(s
---- ---- ------ ---- ---- ------- ---- ---- -------
257 0 3 267 1 15,16 277 4 67-82
258 0 4 268 1 17,18 278 4 83-98
259 0 5 269 2 19-22 279 4 99-114
260 0 6 270 2 23-26 280 4 115-130
261 0 7 271 2 27-30 281 5 131-162
262 0 8 272 2 31-34 282 5 163-194
263 0 9 273 3 35-42 283 5 195-226
264 0 10 274 3 43-50 284 5 227-257
265 1 11,12 275 3 51-58 285 0 258
266 1 13,14 276 3 59-66
The extra bits should be interpreted as a machine
stored with the most-significant bit first, e.g., bits 1110
represent the value 14.
Extra Extra
Code Bits Dist Code Bits Dist Code Bits
---- ---- ---- ---- ---- ------ ---- ---- --------
0 0 1 10 4 33-48 20 9 1025-1536
1 0 2 11 4 49-64 21 9 1537-2048
2 0 3 12 5 65-96 22 10 2049-3072
3 0 4 13 5 97-128 23 10 3073-4096
4 1 5,6 14 6 129-192 24 11 4097-6144
5 1 7,8 15 6 193-256 25 11 6145-8192
6 2 9-12 16 7 257-384 26 12 8193-12288
7 2 13-16 17 7 385-512 27 12 12289-16384
8 3 17-24 18 8 513-768 28 13 16385-24576
9 3 25-32 19 8 769-1024 29 13 24577-32768
3.2.6. Compression with fixed Huffman codes (BTYPE=01)
The Huffman codes for the two alphabets are fixed, and are
represented explicitly in the data. The Huffman code
for the literal/length alphabet are
Lit Value Bits
--------- ---- -----
0 - 143 8 00110000
10111111
144 - 255 9 110010000
111111111
256 - 279 7 0000000
0010111
280 - 287 8 11000000
11000111
Deutsch Informational [Page 12]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
The code lengths are sufficient to generate the actual codes
as described above; we show the codes in the table for
clarity. Literal/length values 286-287 will never
occur in the compressed data, but participate in the
construction
Distance codes 0-31 are represented by (fixed-length) 5-
codes, with possible additional bits as shown in the
shown in Paragraph 3.2.5, above. Note that distance codes 30-
31 will never actually occur in the compressed data
3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
The Huffman codes for the two alphabets appear in the
immediately after the header bits and before the
compressed data, first the literal/length code and then
distance code. Each code is defined by a sequence of
lengths, as discussed in Paragraph 3.2.2, above. For
greater compactness, the code length sequences themselves
compressed using a Huffman code. The alphabet for code
is as follows
0 - 15: Represent code lengths of 0 - 15
16: Copy the previous code length 3 - 6 times
The next 2 bits indicate repeat
(0 = 3, ... , 3 = 6)
Example: Codes 8, 16 (+2 bits 11),
16 (+2 bits 10) will expand
12 code lengths of 8 (1 + 6 + 5)
17: Repeat a code length of 0 for 3 - 10 times
(3 bits of length
18: Repeat a code length of 0 for 11 - 138
(7 bits of length
A code length of 0 indicates that the corresponding symbol
the literal/length or distance alphabet will not occur in
block, and should not participate in the Huffman
construction algorithm given earlier. If only one
code is used, it is encoded using one bit, not zero bits;
this case there is a single code length of one, with one
code. One distance code of zero bits means that there are
distance codes used at all (the data is all literals).
We can now define the format of the block
5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
Deutsch Informational [Page 13]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
(HCLEN + 4) x 3 bits: code lengths for the code
alphabet given just above, in the order: 16, 17, 18,
0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
These code lengths are interpreted as 3-bit
(0-7); as above, a code length of 0 means
corresponding symbol (literal/length or distance
length) is not used
HLIT + 257 code lengths for the literal/length alphabet
encoded using the code length Huffman
HDIST + 1 code lengths for the distance alphabet
encoded using the code length Huffman
The actual compressed data of the block
encoded using the literal/length and distance
The literal/length symbol 256 (end of data),
encoded using the literal/length Huffman
The code length repeat codes can cross from HLIT + 257 to
HDIST + 1 code lengths. In other words, all code lengths
a single sequence of HLIT + HDIST + 258 values
3.3.
A compressor may limit further the ranges of values specified
the previous section and still be compliant; for example, it
limit the range of backward pointers to some value smaller
32K. Similarly, a compressor may limit the size of blocks so
a compressible block fits in memory
A compliant decompressor must accept the full range of
values defined in the previous section, and must accept blocks
arbitrary size
4. Compression algorithm
While it is the intent of this document to define the "deflate
compressed data format without reference to any
compression algorithm, the format is related to the
formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
since many variations of LZ77 are patented, it is
recommended that the implementor of a compressor follow the
algorithm presented here, which is known not to be patented per se
The material in this section is not part of the definition of
Deutsch Informational [Page 14]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
specification per se, and a compressor need not follow it in order
be compliant
The compressor terminates a block when it determines that starting
new block with fresh trees would be useful, or when the block
fills up the compressor's block buffer
The compressor uses a chained hash table to find duplicated strings
using a hash function that operates on 3-byte sequences. At
given point during compression, let XYZ be the next 3 input bytes
be examined (not necessarily all different, of course). First,
compressor examines the hash chain for XYZ. If the chain is empty
the compressor simply writes out X as a literal byte and advances
byte in the input. If the hash chain is not empty, indicating
the sequence XYZ (or, if we are unlucky, some other 3 bytes with
same hash function value) has occurred recently, the
compares all strings on the XYZ hash chain with the actual input
sequence starting at the current point, and selects the
match
The compressor searches the hash chains starting with the most
strings, to favor small distances and thus take advantage of
Huffman encoding. The hash chains are singly linked. There are
deletions from the hash chains; the algorithm simply discards
that are too old. To avoid a worst-case situation, very long
chains are arbitrarily truncated at a certain length, determined by
run-time parameter
To improve overall compression, the compressor optionally defers
selection of matches ("lazy matching"): after a match of length N
been found, the compressor searches for a longer match starting
the next input byte. If it finds a longer match, it truncates
previous match to a length of one (thus producing a single
byte) and then emits the longer match. Otherwise, it emits
original match, and, as described above, advances N bytes
continuing
Run-time parameters also control this "lazy match" procedure.
compression ratio is most important, the compressor attempts
complete second search regardless of the length of the first match
In the normal case, if the current match is "long enough",
compressor reduces the search for a longer match, thus speeding
the process. If speed is most important, the compressor inserts
strings in the hash table only when no match was found, or when
match is not "too long". This degrades the compression ratio
saves time since there are both fewer insertions and fewer searches
Deutsch Informational [Page 15]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
5.
[1] Huffman, D. A., "A Method for the Construction of
Redundancy Codes", Proceedings of the Institute of
Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
[2] Ziv J., Lempel A., "A Universal Algorithm for Sequential
Compression", IEEE Transactions on Information Theory, Vol. 23,
No. 3, pp. 337-343.
[3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources
available in ftp://ftp.uu.net/pub/archiving/zip/doc
[4] Gailly, J.-L., and Adler, M., GZIP documentation and sources
available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu
[5] Schwartz, E. S., and Kallick, B. "Generating a canonical
encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.
[6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
Comm. ACM, 33,4, April 1990, pp. 449-459.
6. Security
Any data compression method involves the reduction of redundancy
the data. Consequently, any corruption of the data is likely to
severe effects and be difficult to correct. Uncompressed text,
the other hand, will probably still be readable despite the
of some corrupted bytes
It is recommended that systems using this data format provide
means of validating the integrity of the compressed data.
reference [3], for example
7. Source
Source code for a C language implementation of a "deflate"
compressor and decompressor is available within the zlib package
ftp://ftp.uu.net/pub/archiving/zip/zlib/.
8.
Trademarks cited in this document are the property of
respective owners
Phil Katz designed the deflate format. Jean-Loup Gailly and
Adler wrote the related software described in this specification
Glenn Randers-Pehrson converted this document to RFC and HTML format
Deutsch Informational [Page 16]
RFC 1951 DEFLATE Compressed Data Format Specification May 1996
9. Author's
L. Peter
Aladdin
203 Santa Margarita Ave
Menlo Park, CA 94025
Phone: (415) 322-0103 (AM only
FAX: (415) 322-1734
EMail:
Questions about the technical content of this specification can
sent by email to
Jean-Loup Gailly
Mark Adler
Editorial comments on this specification can be sent by email to
L. Peter Deutsch
Glenn Randers-Pehrson
Deutsch Informational [Page 17]
if you see any problems within the linking, don't worry be happy,
this is version 0.1 of the Relevance System and you gotta expect some crappy subroutines sometimes,
just be content we did not write this in Java, which would have made this "bigger and better" HAHAHHA.
RFC documents can be found at I.E.T.F.
Relevance System Copyright © 2002 Spectrum WorldResearch
other technical nosh by ServerMasters Corporation
collaboration of BobX