As per Relevance of the word represent, we have this rfc below:











Network Working Group P.
Request for Comments: 1951 Aladdin
Category: Informational May 1996


DEFLATE Compressed Data Format Specification version 1.3

Status of This

This memo provides information for the Internet community. This
does not specify an Internet standard of any kind. Distribution
this memo is unlimited

IESG Note

The IESG takes no position on the validity of any
Property Rights statements contained in this document



Copyright (c) 1996 L. Peter

Permission is granted to copy and distribute this document for
purpose and without charge, including translations into
languages and incorporation into compilations, provided that
copyright notice and this notice are preserved, and that
substantive changes or deletions from the original are
marked

A pointer to the latest version of this and related documentation
HTML format can be found at the
graphics/png/documents/zlib/zdoc-index.html>.



This specification defines a lossless compressed data format
compresses data using a combination of the LZ77 algorithm and
coding, with efficiency comparable to the best currently
general-purpose compression methods. The data can be produced
consumed, even for an arbitrarily long sequentially presented
data stream, using only an a priori bounded amount of
storage. The format can be implemented readily in a manner
covered by patents








Deutsch Informational [Page 1]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


Table of

1. Introduction ................................................... 2
1.1. Purpose ................................................... 2
1.2. Intended audience ......................................... 3
1.3. Scope ..................................................... 3
1.4. Compliance ................................................ 3
1.5. Definitions of terms and conventions used ................ 3
1.6. Changes from previous versions ............................ 4
2. Compressed representation overview ............................. 4
3. Detailed specification ......................................... 5
3.1. Overall conventions ....................................... 5
3.1.1. Packing into bytes .................................. 5
3.2. Compressed block format ................................... 6
3.2.1. Synopsis of prefix and Huffman coding ............... 6
3.2.2. Use of Huffman coding in the "deflate" format ....... 7
3.2.3. Details of block format ............................. 9
3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
3.2.5. Compressed blocks (length and distance codes) ...... 11
3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
3.3. Compliance ............................................... 14
4. Compression algorithm details ................................. 14
5. References .................................................... 16
6. Security Considerations ....................................... 16
7. Source code ................................................... 16
8. Acknowledgements .............................................. 16
9. Author's Address .............................................. 17

1.

1.1.

The purpose of this specification is to define a
compressed data format that
* Is independent of CPU type, operating system, file system
and character set, and hence can be used for interchange
* Can be produced or consumed, even for an arbitrarily
sequentially presented input data stream, using only an
priori bounded amount of intermediate storage, and
can be used in data communications or similar
such as Unix filters
* Compresses data with efficiency comparable to the
currently available general-purpose compression methods
and in particular considerably better than the "compress
program
* Can be implemented readily in a manner not covered
patents, and hence can be practiced freely



Deutsch Informational [Page 2]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


* Is compatible with the file format produced by the
widely used gzip utility, in that conforming
will be able to read data produced by the existing
compressor

The data format defined by this specification does not attempt to

* Allow random access to compressed data
* Compress specialized data (e.g., raster graphics) as
as the best currently available specialized algorithms

A simple counting argument shows that no lossless
algorithm can compress every possible input data set. For
format defined here, the worst case expansion is 5 bytes per 32K
byte block, i.e., a size increase of 0.015% for large data sets
English text usually compresses by a factor of 2.5 to 3;
executable files usually compress somewhat less; graphical
such as raster images may compress much more

1.2. Intended

This specification is intended for use by implementors of
to compress data into "deflate" format and/or decompress data
"deflate" format

The text of the specification assumes a basic background
programming at the level of bits and other primitive
representations. Familiarity with the technique of Huffman
is helpful but not required

1.3.

The specification specifies a method for representing a
of bytes as a (usually shorter) sequence of bits, and a method
packing the latter bit sequence into bytes

1.4.

Unless otherwise indicated below, a compliant decompressor must
able to accept and decompress any data set that conforms to
the specifications presented here; a compliant compressor
produce data sets that conform to all the specifications
here

1.5. Definitions of terms and conventions

Byte: 8 bits stored or transmitted as a unit (same as an octet).
For this specification, a byte is exactly 8 bits, even on



Deutsch Informational [Page 3]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


which store a character on a number of bits different from eight
See below, for the numbering of bits within a byte

String: a sequence of arbitrary bytes

1.6. Changes from previous

There have been no technical changes to the deflate format
version 1.1 of this specification. In version 1.2,
terminology was changed. Version 1.3 is a conversion of
specification to RFC style

2. Compressed representation

A compressed data set consists of a series of blocks,
to successive blocks of input data. The block sizes are arbitrary
except that non-compressible blocks are limited to 65,535 bytes

Each block is compressed using a combination of the LZ77
and Huffman coding. The Huffman trees for each block are
of those for previous or subsequent blocks; the LZ77 algorithm
use a reference to a duplicated string occurring in a previous block
up to 32K input bytes before

Each block consists of two parts: a pair of Huffman code trees
describe the representation of the compressed data part, and
compressed data part. (The Huffman trees themselves are
using Huffman encoding.) The compressed data consists of a series
elements of two types: literal bytes (of strings that have not
detected as duplicated within the previous 32K input bytes),
pointers to duplicated strings, where a pointer is represented as
pair backward distance>. The representation used in
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except
uncompressible blocks, which are limited as noted above

Each type of value (literals, distances, and lengths) in
compressed data is represented using a Huffman code, using one
tree for literals and lengths and a separate code tree for distances
The code trees for each block appear in a compact form just
the compressed data for that block










Deutsch Informational [Page 4]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


3. Detailed

3.1. Overall conventions In the diagrams below, a box like this

+---+
| | <-- the vertical bars might be
+---+

represents one byte; a box like this

+==============+
| |
+==============+

represents a variable number of bytes

Bytes stored within a computer do not have a "bit order",
they are always treated as a unit. However, a byte considered
an integer between 0 and 255 does have a most- and least
significant bit, and since we write numbers with the most
significant digit on the left, we also write bytes with the most
significant bit on the left. In the diagrams below, we number
bits of a byte so that bit 0 is the least-significant bit, i.e.,
the bits are numbered

+--------+
|76543210|
+--------+

Within a computer, a number may occupy multiple bytes.
multi-byte numbers in the format described here are stored
the least-significant byte first (at the lower memory address).
For example, the decimal number 520 is stored as

0 1
+--------+--------+
|00001000|00000010|
+--------+--------+
^ ^
| |
| + more significant byte = 2 x 256
+ less significant byte = 8

3.1.1. Packing into

This document does not address the issue of the order in
bits of a byte are transmitted on a bit-sequential medium
since the final data format described here is byte- rather



Deutsch Informational [Page 5]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


bit-oriented. However, we describe the compressed block
in below, as a sequence of data elements of various
lengths, not a sequence of bytes. We must therefore
how to pack these data elements into bytes to form the
compressed byte sequence

* Data elements are packed into bytes in order
increasing bit number within the byte, i.e.,
with the least-significant bit of the byte
* Data elements other than Huffman codes are
starting with the least-significant bit of the
element
* Huffman codes are packed starting with the most
significant bit of the code

In other words, if one were to print out the compressed data
a sequence of bytes, starting with the first byte at
*right* margin and proceeding to the *left*, with the most
significant bit of each byte on the left as usual, one would
able to parse the result from right to left, with fixed-
elements in the correct MSB-to-LSB order and Huffman codes
bit-reversed order (i.e., with the first bit of the code in
relative LSB position).

3.2. Compressed block

3.2.1. Synopsis of prefix and Huffman

Prefix coding represents symbols from an a priori
alphabet by bit sequences (codes), one code for each symbol,
a manner such that different symbols may be represented by
sequences of different lengths, but a parser can always
an encoded string unambiguously symbol-by-symbol

We define a prefix code in terms of a binary tree in which
two edges descending from each non-leaf node are labeled 0
1 and in which the leaf nodes correspond one-for-one with (
labeled with) the symbols of the alphabet; then the code for
symbol is the sequence of 0's and 1's on the edges leading
the root to the leaf labeled with that symbol. For example











Deutsch Informational [Page 6]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


/\ Symbol
0 1 ------ ----
/ \ A 00
/\ B B 1
0 1 C 011
/ \ D 010
A /\
0 1
/ \
D

A parser can decode the next symbol from an encoded
stream by walking down the tree from the root, at each
choosing the edge corresponding to the next input bit

Given an alphabet with known symbol frequencies, the
algorithm allows the construction of an optimal prefix
(one which represents strings with those symbol
using the fewest bits of any possible prefix codes for
alphabet). Such a code is called a Huffman code. (
reference [1] in Chapter 5, references for
information on Huffman codes.)

Note that in the "deflate" format, the Huffman codes for
various alphabets must not exceed certain maximum code lengths
This constraint complicates the algorithm for computing
lengths from symbol frequencies. Again, see Chapter 5,
references for details

3.2.2. Use of Huffman coding in the "deflate"

The Huffman codes used for each alphabet in the "deflate
format have two additional rules

* All codes of a given bit length have
consecutive values, in the same order as the
they represent

* Shorter codes lexicographically precede longer codes












Deutsch Informational [Page 7]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


We could recode the example above to follow this rule
follows, assuming that the order of the alphabet is ABCD

Symbol
------ ----
A 10
B 0
C 110
D 111

I.e., 0 precedes 10 which precedes 11x, and 110 and 111
lexicographically consecutive

Given this rule, we can define the Huffman code for an
just by giving the bit lengths of the codes for each symbol
the alphabet in order; this is sufficient to determine
actual codes. In our example, the code is completely
by the sequence of bit lengths (2, 1, 3, 3). The
algorithm generates the codes as integers, intended to be
from most- to least-significant bit. The code lengths
initially in tree[I].Len; the codes are produced
tree[I].Code

1) Count the number of codes for each code length.
bl_count[N] be the number of codes of length N, N >= 1.

2) Find the numerical value of the smallest code for
code length

code = 0;
bl_count[0] = 0;
for (bits = 1; bits <= MAX_BITS; bits++) {
code = (code + bl_count[bits-1]) << 1;
next_code[bits] = code
}

3) Assign numerical values to all codes, using
values for all codes of the same length with the
values determined at step 2. Codes that are never
(which have a bit length of zero) must not be assigned
value

for (n = 0; n <= max_code; n++) {
len = tree[n].Len
if (len != 0) {
tree[n].Code = next_code[len];
next_code[len]++;
}



Deutsch Informational [Page 8]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


}

Example

Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
3, 2, 4, 4). After step 1, we have

N bl_count[N
- -----------
2 1
3 5
4 2

Step 2 computes the following next_code values

N next_code[N
- ------------
1 0
2 0
3 2
4 14

Step 3 produces the following code values

Symbol Length
------ ------ ----
A 3 010
B 3 011
C 3 100
D 3 101
E 3 110
F 2 00
G 4 1110
H 4 1111

3.2.3. Details of block

Each block of compressed data begins with 3 header
containing the following data

first bit
next 2 bits

Note that the header bits do not necessarily begin on a
boundary, since a block does not necessarily occupy an
number of bytes





Deutsch Informational [Page 9]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


BFINAL is set if and only if this is the last block of the
set

BTYPE specifies how the data are compressed, as follows

00 - no
01 - compressed with fixed Huffman
10 - compressed with dynamic Huffman
11 - reserved (error

The only difference between the two compressed cases is how
Huffman codes for the literal/length and distance alphabets
defined

In all cases, the decoding algorithm for the actual data is
follows


read block header from input stream
if stored with no
skip any remaining bits in current
processed
read LEN and NLEN (see next section
copy LEN bytes of data to

if compressed with dynamic Huffman
read representation of code trees (
subsection below
loop (until end of block code recognized
decode literal/length value from input
if value < 256
copy value (literal byte) to output

if value = end of block (256)
break from
otherwise (value = 257..285)
decode distance from input

move backwards distance bytes in the
stream, and copy length bytes from
position to the output stream
end
while not last

Note that a duplicated string reference may refer to a
in a previous block; i.e., the backward distance may cross
or more block boundaries. However a distance cannot refer
the beginning of the output stream. (An application using



Deutsch Informational [Page 10]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


preset dictionary might discard part of the output stream;
distance can refer to that part of the output stream anyway
Note also that the referenced string may overlap the
position; for example, if the last 2 bytes decoded have
X and Y, a string reference with distance = 2>
adds X,Y,X,Y,X to the output stream

We now specify each compression method in turn

3.2.4. Non-compressed blocks (BTYPE=00)

Any bits of input up to the next byte boundary are ignored
The rest of the block consists of the following information

0 1 2 3 4...
+---+---+---+---+================================+
| LEN | NLEN |... LEN bytes of literal data...|
+---+---+---+---+================================+

LEN is the number of data bytes in the block. NLEN is
one's complement of LEN

3.2.5. Compressed blocks (length and distance codes

As noted above, encoded data blocks in the "deflate"
consist of sequences of symbols drawn from three
distinct alphabets: either literal bytes, from the alphabet
byte values (0..255), or backward distance> pairs
where the length is drawn from (3..258) and the distance
drawn from (1..32,768). In fact, the literal and
alphabets are merged into a single alphabet (0..285),
values 0..255 represent literal bytes, the value 256
end-of-block, and values 257..285 represent length
(possibly in conjunction with extra bits following the
code) as follows
















Deutsch Informational [Page 11]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


Extra Extra
Code Bits Length(s) Code Bits Lengths Code Bits Length(s
---- ---- ------ ---- ---- ------- ---- ---- -------
257 0 3 267 1 15,16 277 4 67-82
258 0 4 268 1 17,18 278 4 83-98
259 0 5 269 2 19-22 279 4 99-114
260 0 6 270 2 23-26 280 4 115-130
261 0 7 271 2 27-30 281 5 131-162
262 0 8 272 2 31-34 282 5 163-194
263 0 9 273 3 35-42 283 5 195-226
264 0 10 274 3 43-50 284 5 227-257
265 1 11,12 275 3 51-58 285 0 258
266 1 13,14 276 3 59-66

The extra bits should be interpreted as a machine
stored with the most-significant bit first, e.g., bits 1110
represent the value 14.

Extra Extra
Code Bits Dist Code Bits Dist Code Bits
---- ---- ---- ---- ---- ------ ---- ---- --------
0 0 1 10 4 33-48 20 9 1025-1536
1 0 2 11 4 49-64 21 9 1537-2048
2 0 3 12 5 65-96 22 10 2049-3072
3 0 4 13 5 97-128 23 10 3073-4096
4 1 5,6 14 6 129-192 24 11 4097-6144
5 1 7,8 15 6 193-256 25 11 6145-8192
6 2 9-12 16 7 257-384 26 12 8193-12288
7 2 13-16 17 7 385-512 27 12 12289-16384
8 3 17-24 18 8 513-768 28 13 16385-24576
9 3 25-32 19 8 769-1024 29 13 24577-32768

3.2.6. Compression with fixed Huffman codes (BTYPE=01)

The Huffman codes for the two alphabets are fixed, and are
represented explicitly in the data. The Huffman code
for the literal/length alphabet are

Lit Value Bits
--------- ---- -----
0 - 143 8 00110000
10111111
144 - 255 9 110010000
111111111
256 - 279 7 0000000
0010111
280 - 287 8 11000000
11000111



Deutsch Informational [Page 12]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


The code lengths are sufficient to generate the actual codes
as described above; we show the codes in the table for
clarity. Literal/length values 286-287 will never
occur in the compressed data, but participate in the
construction

Distance codes 0-31 are represented by (fixed-length) 5-
codes, with possible additional bits as shown in the
shown in Paragraph 3.2.5, above. Note that distance codes 30-
31 will never actually occur in the compressed data

3.2.7. Compression with dynamic Huffman codes (BTYPE=10)

The Huffman codes for the two alphabets appear in the
immediately after the header bits and before the
compressed data, first the literal/length code and then
distance code. Each code is defined by a sequence of
lengths, as discussed in Paragraph 3.2.2, above. For
greater compactness, the code length sequences themselves
compressed using a Huffman code. The alphabet for code
is as follows

0 - 15: Represent code lengths of 0 - 15
16: Copy the previous code length 3 - 6 times
The next 2 bits indicate repeat
(0 = 3, ... , 3 = 6)
Example: Codes 8, 16 (+2 bits 11),
16 (+2 bits 10) will expand
12 code lengths of 8 (1 + 6 + 5)
17: Repeat a code length of 0 for 3 - 10 times
(3 bits of length
18: Repeat a code length of 0 for 11 - 138
(7 bits of length

A code length of 0 indicates that the corresponding symbol
the literal/length or distance alphabet will not occur in
block, and should not participate in the Huffman
construction algorithm given earlier. If only one
code is used, it is encoded using one bit, not zero bits;
this case there is a single code length of one, with one
code. One distance code of zero bits means that there are
distance codes used at all (the data is all literals).

We can now define the format of the block

5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)



Deutsch Informational [Page 13]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


(HCLEN + 4) x 3 bits: code lengths for the code
alphabet given just above, in the order: 16, 17, 18,
0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15

These code lengths are interpreted as 3-bit
(0-7); as above, a code length of 0 means
corresponding symbol (literal/length or distance
length) is not used

HLIT + 257 code lengths for the literal/length alphabet
encoded using the code length Huffman

HDIST + 1 code lengths for the distance alphabet
encoded using the code length Huffman

The actual compressed data of the block
encoded using the literal/length and distance


The literal/length symbol 256 (end of data),
encoded using the literal/length Huffman

The code length repeat codes can cross from HLIT + 257 to
HDIST + 1 code lengths. In other words, all code lengths
a single sequence of HLIT + HDIST + 258 values

3.3.

A compressor may limit further the ranges of values specified
the previous section and still be compliant; for example, it
limit the range of backward pointers to some value smaller
32K. Similarly, a compressor may limit the size of blocks so
a compressible block fits in memory

A compliant decompressor must accept the full range of
values defined in the previous section, and must accept blocks
arbitrary size

4. Compression algorithm

While it is the intent of this document to define the "deflate
compressed data format without reference to any
compression algorithm, the format is related to the
formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
since many variations of LZ77 are patented, it is
recommended that the implementor of a compressor follow the
algorithm presented here, which is known not to be patented per se
The material in this section is not part of the definition of



Deutsch Informational [Page 14]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


specification per se, and a compressor need not follow it in order
be compliant

The compressor terminates a block when it determines that starting
new block with fresh trees would be useful, or when the block
fills up the compressor's block buffer

The compressor uses a chained hash table to find duplicated strings
using a hash function that operates on 3-byte sequences. At
given point during compression, let XYZ be the next 3 input bytes
be examined (not necessarily all different, of course). First,
compressor examines the hash chain for XYZ. If the chain is empty
the compressor simply writes out X as a literal byte and advances
byte in the input. If the hash chain is not empty, indicating
the sequence XYZ (or, if we are unlucky, some other 3 bytes with
same hash function value) has occurred recently, the
compares all strings on the XYZ hash chain with the actual input
sequence starting at the current point, and selects the
match

The compressor searches the hash chains starting with the most
strings, to favor small distances and thus take advantage of
Huffman encoding. The hash chains are singly linked. There are
deletions from the hash chains; the algorithm simply discards
that are too old. To avoid a worst-case situation, very long
chains are arbitrarily truncated at a certain length, determined by
run-time parameter

To improve overall compression, the compressor optionally defers
selection of matches ("lazy matching"): after a match of length N
been found, the compressor searches for a longer match starting
the next input byte. If it finds a longer match, it truncates
previous match to a length of one (thus producing a single
byte) and then emits the longer match. Otherwise, it emits
original match, and, as described above, advances N bytes
continuing

Run-time parameters also control this "lazy match" procedure.
compression ratio is most important, the compressor attempts
complete second search regardless of the length of the first match
In the normal case, if the current match is "long enough",
compressor reduces the search for a longer match, thus speeding
the process. If speed is most important, the compressor inserts
strings in the hash table only when no match was found, or when
match is not "too long". This degrades the compression ratio
saves time since there are both fewer insertions and fewer searches





Deutsch Informational [Page 15]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


5.

[1] Huffman, D. A., "A Method for the Construction of
Redundancy Codes", Proceedings of the Institute of
Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.

[2] Ziv J., Lempel A., "A Universal Algorithm for Sequential
Compression", IEEE Transactions on Information Theory, Vol. 23,
No. 3, pp. 337-343.

[3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources
available in ftp://ftp.uu.net/pub/archiving/zip/doc

[4] Gailly, J.-L., and Adler, M., GZIP documentation and sources
available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu

[5] Schwartz, E. S., and Kallick, B. "Generating a canonical
encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.

[6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
Comm. ACM, 33,4, April 1990, pp. 449-459.

6. Security

Any data compression method involves the reduction of redundancy
the data. Consequently, any corruption of the data is likely to
severe effects and be difficult to correct. Uncompressed text,
the other hand, will probably still be readable despite the
of some corrupted bytes

It is recommended that systems using this data format provide
means of validating the integrity of the compressed data.
reference [3], for example

7. Source

Source code for a C language implementation of a "deflate"
compressor and decompressor is available within the zlib package
ftp://ftp.uu.net/pub/archiving/zip/zlib/.

8.

Trademarks cited in this document are the property of
respective owners

Phil Katz designed the deflate format. Jean-Loup Gailly and
Adler wrote the related software described in this specification
Glenn Randers-Pehrson converted this document to RFC and HTML format



Deutsch Informational [Page 16]

RFC 1951 DEFLATE Compressed Data Format Specification May 1996


9. Author's

L. Peter
Aladdin
203 Santa Margarita Ave
Menlo Park, CA 94025

Phone: (415) 322-0103 (AM only
FAX: (415) 322-1734
EMail:
Questions about the technical content of this specification can
sent by email to

Jean-Loup Gailly
Mark Adler
Editorial comments on this specification can be sent by email to

L. Peter Deutsch
Glenn Randers-Pehrson





























Deutsch Informational [Page 17]








if you see any problems within the linking, don't worry be happy,
this is version 0.1 of the Relevance System and you gotta expect some crappy subroutines sometimes,
just be content we did not write this in Java, which would have made this "bigger and better" HAHAHHA.




RFC documents can be found at I.E.T.F.



Relevance System Copyright © 2002 Spectrum WorldResearch
other technical nosh by ServerMasters Corporation
collaboration of BobX







Spectrum