diff options
Diffstat (limited to 'zlib/doc')
| -rw-r--r-- | zlib/doc/algorithm.txt | 209 | ||||
| -rw-r--r-- | zlib/doc/rfc1950.txt | 619 | ||||
| -rw-r--r-- | zlib/doc/rfc1951.txt | 955 | ||||
| -rw-r--r-- | zlib/doc/rfc1952.txt | 675 | ||||
| -rw-r--r-- | zlib/doc/txtvsbin.txt | 107 | 
5 files changed, 2565 insertions, 0 deletions
| diff --git a/zlib/doc/algorithm.txt b/zlib/doc/algorithm.txt new file mode 100644 index 000000000..c97f49502 --- /dev/null +++ b/zlib/doc/algorithm.txt @@ -0,0 +1,209 @@ +1. Compression algorithm (deflate) + +The deflation algorithm used by gzip (also zip and zlib) is a variation of +LZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in +the input data.  The second occurrence of a string is replaced by a +pointer to the previous string, in the form of a pair (distance, +length).  Distances are limited to 32K bytes, and lengths are limited +to 258 bytes. When a string does not occur anywhere in the previous +32K bytes, it is emitted as a sequence of literal bytes.  (In this +description, `string' must be taken as an arbitrary sequence of bytes, +and is not restricted to printable characters.) + +Literals or match lengths are compressed with one Huffman tree, and +match distances are compressed with another tree. The trees are stored +in a compact form at the start of each block. The blocks can have any +size (except that the compressed data for one block must fit in +available memory). A block is terminated when deflate() determines that +it would be useful to start another block with fresh trees. (This is +somewhat similar to the behavior of LZW-based _compress_.) + +Duplicated strings are found using a hash table. All input strings of +length 3 are inserted in the hash table. A hash index is computed for +the next 3 bytes. If the hash chain for this index is not empty, all +strings in the chain are compared with the current input string, and +the longest match is selected. + +The hash chains are searched starting with the most recent strings, to +favor small distances and thus take advantage of the Huffman encoding. +The hash chains are singly linked. There are no deletions from the +hash chains, the algorithm simply discards matches that are too old. + +To avoid a worst-case situation, very long hash chains are arbitrarily +truncated at a certain length, determined by a runtime option (level +parameter of deflateInit). So deflate() does not always find the longest +possible match but generally finds a match which is long enough. + +deflate() also defers the selection of matches with a lazy evaluation +mechanism. After a match of length N has been found, deflate() searches for +a longer match at the next input byte. If a longer match is found, the +previous match is truncated to a length of one (thus producing a single +literal byte) and the process of lazy evaluation begins again. Otherwise, +the original match is kept, and the next match search is attempted only N +steps later. + +The lazy match evaluation is also subject to a runtime parameter. If +the current match is long enough, deflate() reduces the search for a longer +match, thus speeding up the whole process. If compression ratio is more +important than speed, deflate() attempts a complete second search even if +the first match is already long enough. + +The lazy match evaluation is not performed for the fastest compression +modes (level parameter 1 to 3). For these fast modes, new strings +are inserted in the hash table only when no match was found, or +when the match is not too long. This degrades the compression ratio +but saves time since there are both fewer insertions and fewer searches. + + +2. Decompression algorithm (inflate) + +2.1 Introduction + +The key question is how to represent a Huffman code (or any prefix code) so +that you can decode fast.  The most important characteristic is that shorter +codes are much more common than longer codes, so pay attention to decoding the +short codes fast, and let the long codes take longer to decode. + +inflate() sets up a first level table that covers some number of bits of +input less than the length of longest code.  It gets that many bits from the +stream, and looks it up in the table.  The table will tell if the next +code is that many bits or less and how many, and if it is, it will tell +the value, else it will point to the next level table for which inflate() +grabs more bits and tries to decode a longer code. + +How many bits to make the first lookup is a tradeoff between the time it +takes to decode and the time it takes to build the table.  If building the +table took no time (and if you had infinite memory), then there would only +be a first level table to cover all the way to the longest code.  However, +building the table ends up taking a lot longer for more bits since short +codes are replicated many times in such a table.  What inflate() does is +simply to make the number of bits in the first table a variable, and  then +to set that variable for the maximum speed. + +For inflate, which has 286 possible codes for the literal/length tree, the size +of the first table is nine bits.  Also the distance trees have 30 possible +values, and the size of the first table is six bits.  Note that for each of +those cases, the table ended up one bit longer than the ``average'' code +length, i.e. the code length of an approximately flat code which would be a +little more than eight bits for 286 symbols and a little less than five bits +for 30 symbols. + + +2.2 More details on the inflate table lookup + +Ok, you want to know what this cleverly obfuscated inflate tree actually +looks like.  You are correct that it's not a Huffman tree.  It is simply a +lookup table for the first, let's say, nine bits of a Huffman symbol.  The +symbol could be as short as one bit or as long as 15 bits.  If a particular +symbol is shorter than nine bits, then that symbol's translation is duplicated +in all those entries that start with that symbol's bits.  For example, if the +symbol is four bits, then it's duplicated 32 times in a nine-bit table.  If a +symbol is nine bits long, it appears in the table once. + +If the symbol is longer than nine bits, then that entry in the table points +to another similar table for the remaining bits.  Again, there are duplicated +entries as needed.  The idea is that most of the time the symbol will be short +and there will only be one table look up.  (That's whole idea behind data +compression in the first place.)  For the less frequent long symbols, there +will be two lookups.  If you had a compression method with really long +symbols, you could have as many levels of lookups as is efficient.  For +inflate, two is enough. + +So a table entry either points to another table (in which case nine bits in +the above example are gobbled), or it contains the translation for the symbol +and the number of bits to gobble.  Then you start again with the next +ungobbled bit. + +You may wonder: why not just have one lookup table for how ever many bits the +longest symbol is?  The reason is that if you do that, you end up spending +more time filling in duplicate symbol entries than you do actually decoding. +At least for deflate's output that generates new trees every several 10's of +kbytes.  You can imagine that filling in a 2^15 entry table for a 15-bit code +would take too long if you're only decoding several thousand symbols.  At the +other extreme, you could make a new table for every bit in the code.  In fact, +that's essentially a Huffman tree.  But then you spend too much time +traversing the tree while decoding, even for short symbols. + +So the number of bits for the first lookup table is a trade of the time to +fill out the table vs. the time spent looking at the second level and above of +the table. + +Here is an example, scaled down: + +The code being decoded, with 10 symbols, from 1 to 6 bits long: + +A: 0 +B: 10 +C: 1100 +D: 11010 +E: 11011 +F: 11100 +G: 11101 +H: 11110 +I: 111110 +J: 111111 + +Let's make the first table three bits long (eight entries): + +000: A,1 +001: A,1 +010: A,1 +011: A,1 +100: B,2 +101: B,2 +110: -> table X (gobble 3 bits) +111: -> table Y (gobble 3 bits) + +Each entry is what the bits decode as and how many bits that is, i.e. how +many bits to gobble.  Or the entry points to another table, with the number of +bits to gobble implicit in the size of the table. + +Table X is two bits long since the longest code starting with 110 is five bits +long: + +00: C,1 +01: C,1 +10: D,2 +11: E,2 + +Table Y is three bits long since the longest code starting with 111 is six +bits long: + +000: F,2 +001: F,2 +010: G,2 +011: G,2 +100: H,2 +101: H,2 +110: I,3 +111: J,3 + +So what we have here are three tables with a total of 20 entries that had to +be constructed.  That's compared to 64 entries for a single table.  Or +compared to 16 entries for a Huffman tree (six two entry tables and one four +entry table).  Assuming that the code ideally represents the probability of +the symbols, it takes on the average 1.25 lookups per symbol.  That's compared +to one lookup for the single table, or 1.66 lookups per symbol for the +Huffman tree. + +There, I think that gives you a picture of what's going on.  For inflate, the +meaning of a particular symbol is often more than just a letter.  It can be a +byte (a "literal"), or it can be either a length or a distance which +indicates a base value and a number of bits to fetch after the code that is +added to the base value.  Or it might be the special end-of-block code.  The +data structures created in inftrees.c try to encode all that information +compactly in the tables. + + +Jean-loup Gailly        Mark Adler +jloup@gzip.org          madler@alumni.caltech.edu + + +References: + +[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data +Compression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3, +pp. 337-343. + +``DEFLATE Compressed Data Format Specification'' available in +http://tools.ietf.org/html/rfc1951 diff --git a/zlib/doc/rfc1950.txt b/zlib/doc/rfc1950.txt new file mode 100644 index 000000000..ce6428a0f --- /dev/null +++ b/zlib/doc/rfc1950.txt @@ -0,0 +1,619 @@ + + + + + + +Network Working Group                                         P. Deutsch +Request for Comments: 1950                           Aladdin Enterprises +Category: Informational                                      J-L. Gailly +                                                                Info-ZIP +                                                                May 1996 + + +         ZLIB Compressed Data Format Specification version 3.3 + +Status of This Memo + +   This memo provides information for the Internet community.  This memo +   does not specify an Internet standard of any kind.  Distribution of +   this memo is unlimited. + +IESG Note: + +   The IESG takes no position on the validity of any Intellectual +   Property Rights statements contained in this document. + +Notices + +   Copyright (c) 1996 L. Peter Deutsch and Jean-Loup Gailly + +   Permission is granted to copy and distribute this document for any +   purpose and without charge, including translations into other +   languages and incorporation into compilations, provided that the +   copyright notice and this notice are preserved, and that any +   substantive changes or deletions from the original are clearly +   marked. + +   A pointer to the latest version of this and related documentation in +   HTML format can be found at the URL +   <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>. + +Abstract + +   This specification defines a lossless compressed data format.  The +   data can be produced or consumed, even for an arbitrarily long +   sequentially presented input data stream, using only an a priori +   bounded amount of intermediate storage.  The format presently uses +   the DEFLATE compression method but can be easily extended to use +   other compression methods.  It can be implemented readily in a manner +   not covered by patents.  This specification also defines the ADLER-32 +   checksum (an extension and improvement of the Fletcher checksum), +   used for detection of data corruption, and provides an algorithm for +   computing it. + + + + +Deutsch & Gailly             Informational                      [Page 1] + +RFC 1950       ZLIB Compressed Data Format Specification        May 1996 + + +Table of Contents + +   1. Introduction ................................................... 2 +      1.1. Purpose ................................................... 2 +      1.2. Intended audience ......................................... 3 +      1.3. Scope ..................................................... 3 +      1.4. Compliance ................................................ 3 +      1.5.  Definitions of terms and conventions used ................ 3 +      1.6. Changes from previous versions ............................ 3 +   2. Detailed specification ......................................... 3 +      2.1. Overall conventions ....................................... 3 +      2.2. Data format ............................................... 4 +      2.3. Compliance ................................................ 7 +   3. References ..................................................... 7 +   4. Source code .................................................... 8 +   5. Security Considerations ........................................ 8 +   6. Acknowledgements ............................................... 8 +   7. Authors' Addresses ............................................. 8 +   8. Appendix: Rationale ............................................ 9 +   9. Appendix: Sample code ..........................................10 + +1. Introduction + +   1.1. Purpose + +      The purpose of this specification is to define a lossless +      compressed data format that: + +          * Is independent of CPU type, operating system, file system, +            and character set, and hence can be used for interchange; + +          * Can be produced or consumed, even for an arbitrarily long +            sequentially presented input data stream, using only an a +            priori bounded amount of intermediate storage, and hence can +            be used in data communications or similar structures such as +            Unix filters; + +          * Can use a number of different compression methods; + +          * Can be implemented readily in a manner not covered by +            patents, and hence can be practiced freely. + +      The data format defined by this specification does not attempt to +      allow random access to compressed data. + + + + + + + +Deutsch & Gailly             Informational                      [Page 2] + +RFC 1950       ZLIB Compressed Data Format Specification        May 1996 + + +   1.2. Intended audience + +      This specification is intended for use by implementors of software +      to compress data into zlib format and/or decompress data from zlib +      format. + +      The text of the specification assumes a basic background in +      programming at the level of bits and other primitive data +      representations. + +   1.3. Scope + +      The specification specifies a compressed data format that can be +      used for in-memory compression of a sequence of arbitrary bytes. + +   1.4. Compliance + +      Unless otherwise indicated below, a compliant decompressor must be +      able to accept and decompress any data set that conforms to all +      the specifications presented here; a compliant compressor must +      produce data sets that conform to all the specifications presented +      here. + +   1.5.  Definitions of terms and conventions used + +      byte: 8 bits stored or transmitted as a unit (same as an octet). +      (For this specification, a byte is exactly 8 bits, even on +      machines which store a character on a number of bits different +      from 8.) See below, for the numbering of bits within a byte. + +   1.6. Changes from previous versions + +      Version 3.1 was the first public release of this specification. +      In version 3.2, some terminology was changed and the Adler-32 +      sample code was rewritten for clarity.  In version 3.3, the +      support for a preset dictionary was introduced, and the +      specification was converted to RFC style. + +2. Detailed specification + +   2.1. Overall conventions + +      In the diagrams below, a box like this: + +         +---+ +         |   | <-- the vertical bars might be missing +         +---+ + + + + +Deutsch & Gailly             Informational                      [Page 3] + +RFC 1950       ZLIB Compressed Data Format Specification        May 1996 + + +      represents one byte; a box like this: + +         +==============+ +         |              | +         +==============+ + +      represents a variable number of bytes. + +      Bytes stored within a computer do not have a "bit order", since +      they are always treated as a unit.  However, a byte considered as +      an integer between 0 and 255 does have a most- and least- +      significant bit, and since we write numbers with the most- +      significant digit on the left, we also write bytes with the most- +      significant bit on the left.  In the diagrams below, we number the +      bits of a byte so that bit 0 is the least-significant bit, i.e., +      the bits are numbered: + +         +--------+ +         |76543210| +         +--------+ + +      Within a computer, a number may occupy multiple bytes.  All +      multi-byte numbers in the format described here are stored with +      the MOST-significant byte first (at the lower memory address). +      For example, the decimal number 520 is stored as: + +             0     1 +         +--------+--------+ +         |00000010|00001000| +         +--------+--------+ +          ^        ^ +          |        | +          |        + less significant byte = 8 +          + more significant byte = 2 x 256 + +   2.2. Data format + +      A zlib stream has the following structure: + +           0   1 +         +---+---+ +         |CMF|FLG|   (more-->) +         +---+---+ + + + + + + + + +Deutsch & Gailly             Informational                      [Page 4] + +RFC 1950       ZLIB Compressed Data Format Specification        May 1996 + + +      (if FLG.FDICT set) + +           0   1   2   3 +         +---+---+---+---+ +         |     DICTID    |   (more-->) +         +---+---+---+---+ + +         +=====================+---+---+---+---+ +         |...compressed data...|    ADLER32    | +         +=====================+---+---+---+---+ + +      Any data which may appear after ADLER32 are not part of the zlib +      stream. + +      CMF (Compression Method and flags) +         This byte is divided into a 4-bit compression method and a 4- +         bit information field depending on the compression method. + +            bits 0 to 3  CM     Compression method +            bits 4 to 7  CINFO  Compression info + +      CM (Compression method) +         This identifies the compression method used in the file. CM = 8 +         denotes the "deflate" compression method with a window size up +         to 32K.  This is the method used by gzip and PNG (see +         references [1] and [2] in Chapter 3, below, for the reference +         documents).  CM = 15 is reserved.  It might be used in a future +         version of this specification to indicate the presence of an +         extra field before the compressed data. + +      CINFO (Compression info) +         For CM = 8, CINFO is the base-2 logarithm of the LZ77 window +         size, minus eight (CINFO=7 indicates a 32K window size). Values +         of CINFO above 7 are not allowed in this version of the +         specification.  CINFO is not defined in this specification for +         CM not equal to 8. + +      FLG (FLaGs) +         This flag byte is divided as follows: + +            bits 0 to 4  FCHECK  (check bits for CMF and FLG) +            bit  5       FDICT   (preset dictionary) +            bits 6 to 7  FLEVEL  (compression level) + +         The FCHECK value must be such that CMF and FLG, when viewed as +         a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG), +         is a multiple of 31. + + + + +Deutsch & Gailly             Informational                      [Page 5] + +RFC 1950       ZLIB Compressed Data Format Specification        May 1996 + + +      FDICT (Preset dictionary) +         If FDICT is set, a DICT dictionary identifier is present +         immediately after the FLG byte. The dictionary is a sequence of +         bytes which are initially fed to the compressor without +         producing any compressed output. DICT is the Adler-32 checksum +         of this sequence of bytes (see the definition of ADLER32 +         below).  The decompressor can use this identifier to determine +         which dictionary has been used by the compressor. + +      FLEVEL (Compression level) +         These flags are available for use by specific compression +         methods.  The "deflate" method (CM = 8) sets these flags as +         follows: + +            0 - compressor used fastest algorithm +            1 - compressor used fast algorithm +            2 - compressor used default algorithm +            3 - compressor used maximum compression, slowest algorithm + +         The information in FLEVEL is not needed for decompression; it +         is there to indicate if recompression might be worthwhile. + +      compressed data +         For compression method 8, the compressed data is stored in the +         deflate compressed data format as described in the document +         "DEFLATE Compressed Data Format Specification" by L. Peter +         Deutsch. (See reference [3] in Chapter 3, below) + +         Other compressed data formats are not specified in this version +         of the zlib specification. + +      ADLER32 (Adler-32 checksum) +         This contains a checksum value of the uncompressed data +         (excluding any dictionary data) computed according to Adler-32 +         algorithm. This algorithm is a 32-bit extension and improvement +         of the Fletcher algorithm, used in the ITU-T X.224 / ISO 8073 +         standard. See references [4] and [5] in Chapter 3, below) + +         Adler-32 is composed of two sums accumulated per byte: s1 is +         the sum of all bytes, s2 is the sum of all s1 values. Both sums +         are done modulo 65521. s1 is initialized to 1, s2 to zero.  The +         Adler-32 checksum is stored as s2*65536 + s1 in most- +         significant-byte first (network) order. + + + + + + + + +Deutsch & Gailly             Informational                      [Page 6] + +RFC 1950       ZLIB Compressed Data Format Specification        May 1996 + + +   2.3. Compliance + +      A compliant compressor must produce streams with correct CMF, FLG +      and ADLER32, but need not support preset dictionaries.  When the +      zlib data format is used as part of another standard data format, +      the compressor may use only preset dictionaries that are specified +      by this other data format.  If this other format does not use the +      preset dictionary feature, the compressor must not set the FDICT +      flag. + +      A compliant decompressor must check CMF, FLG, and ADLER32, and +      provide an error indication if any of these have incorrect values. +      A compliant decompressor must give an error indication if CM is +      not one of the values defined in this specification (only the +      value 8 is permitted in this version), since another value could +      indicate the presence of new features that would cause subsequent +      data to be interpreted incorrectly.  A compliant decompressor must +      give an error indication if FDICT is set and DICTID is not the +      identifier of a known preset dictionary.  A decompressor may +      ignore FLEVEL and still be compliant.  When the zlib data format +      is being used as a part of another standard format, a compliant +      decompressor must support all the preset dictionaries specified by +      the other format. When the other format does not use the preset +      dictionary feature, a compliant decompressor must reject any +      stream in which the FDICT flag is set. + +3. References + +   [1] Deutsch, L.P.,"GZIP Compressed Data Format Specification", +       available in ftp://ftp.uu.net/pub/archiving/zip/doc/ + +   [2] Thomas Boutell, "PNG (Portable Network Graphics) specification", +       available in ftp://ftp.uu.net/graphics/png/documents/ + +   [3] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification", +       available in ftp://ftp.uu.net/pub/archiving/zip/doc/ + +   [4] Fletcher, J. G., "An Arithmetic Checksum for Serial +       Transmissions," IEEE Transactions on Communications, Vol. COM-30, +       No. 1, January 1982, pp. 247-252. + +   [5] ITU-T Recommendation X.224, Annex D, "Checksum Algorithms," +       November, 1993, pp. 144, 145. (Available from +       gopher://info.itu.ch). ITU-T X.244 is also the same as ISO 8073. + + + + + + + +Deutsch & Gailly             Informational                      [Page 7] + +RFC 1950       ZLIB Compressed Data Format Specification        May 1996 + + +4. Source code + +   Source code for a C language implementation of a "zlib" compliant +   library is available at ftp://ftp.uu.net/pub/archiving/zip/zlib/. + +5. Security Considerations + +   A decoder that fails to check the ADLER32 checksum value may be +   subject to undetected data corruption. + +6. Acknowledgements + +   Trademarks cited in this document are the property of their +   respective owners. + +   Jean-Loup Gailly and Mark Adler designed the zlib format and wrote +   the related software described in this specification.  Glenn +   Randers-Pehrson converted this document to RFC and HTML format. + +7. Authors' Addresses + +   L. Peter Deutsch +   Aladdin Enterprises +   203 Santa Margarita Ave. +   Menlo Park, CA 94025 + +   Phone: (415) 322-0103 (AM only) +   FAX:   (415) 322-1734 +   EMail: <ghost@aladdin.com> + + +   Jean-Loup Gailly + +   EMail: <gzip@prep.ai.mit.edu> + +   Questions about the technical content of this specification can be +   sent by email to + +   Jean-Loup Gailly <gzip@prep.ai.mit.edu> and +   Mark Adler <madler@alumni.caltech.edu> + +   Editorial comments on this specification can be sent by email to + +   L. Peter Deutsch <ghost@aladdin.com> and +   Glenn Randers-Pehrson <randeg@alumni.rpi.edu> + + + + + + +Deutsch & Gailly             Informational                      [Page 8] + +RFC 1950       ZLIB Compressed Data Format Specification        May 1996 + + +8. Appendix: Rationale + +   8.1. Preset dictionaries + +      A preset dictionary is specially useful to compress short input +      sequences. The compressor can take advantage of the dictionary +      context to encode the input in a more compact manner. The +      decompressor can be initialized with the appropriate context by +      virtually decompressing a compressed version of the dictionary +      without producing any output. However for certain compression +      algorithms such as the deflate algorithm this operation can be +      achieved without actually performing any decompression. + +      The compressor and the decompressor must use exactly the same +      dictionary. The dictionary may be fixed or may be chosen among a +      certain number of predefined dictionaries, according to the kind +      of input data. The decompressor can determine which dictionary has +      been chosen by the compressor by checking the dictionary +      identifier. This document does not specify the contents of +      predefined dictionaries, since the optimal dictionaries are +      application specific. Standard data formats using this feature of +      the zlib specification must precisely define the allowed +      dictionaries. + +   8.2. The Adler-32 algorithm + +      The Adler-32 algorithm is much faster than the CRC32 algorithm yet +      still provides an extremely low probability of undetected errors. + +      The modulo on unsigned long accumulators can be delayed for 5552 +      bytes, so the modulo operation time is negligible.  If the bytes +      are a, b, c, the second sum is 3a + 2b + c + 3, and so is position +      and order sensitive, unlike the first sum, which is just a +      checksum.  That 65521 is prime is important to avoid a possible +      large class of two-byte errors that leave the check unchanged. +      (The Fletcher checksum uses 255, which is not prime and which also +      makes the Fletcher check insensitive to single byte changes 0 <-> +      255.) + +      The sum s1 is initialized to 1 instead of zero to make the length +      of the sequence part of s2, so that the length does not have to be +      checked separately. (Any sequence of zeroes has a Fletcher +      checksum of zero.) + + + + + + + + +Deutsch & Gailly             Informational                      [Page 9] + +RFC 1950       ZLIB Compressed Data Format Specification        May 1996 + + +9. Appendix: Sample code + +   The following C code computes the Adler-32 checksum of a data buffer. +   It is written for clarity, not for speed.  The sample code is in the +   ANSI C programming language. Non C users may find it easier to read +   with these hints: + +      &      Bitwise AND operator. +      >>     Bitwise right shift operator. When applied to an +             unsigned quantity, as here, right shift inserts zero bit(s) +             at the left. +      <<     Bitwise left shift operator. Left shift inserts zero +             bit(s) at the right. +      ++     "n++" increments the variable n. +      %      modulo operator: a % b is the remainder of a divided by b. + +      #define BASE 65521 /* largest prime smaller than 65536 */ + +      /* +         Update a running Adler-32 checksum with the bytes buf[0..len-1] +       and return the updated checksum. The Adler-32 checksum should be +       initialized to 1. + +       Usage example: + +         unsigned long adler = 1L; + +         while (read_buffer(buffer, length) != EOF) { +           adler = update_adler32(adler, buffer, length); +         } +         if (adler != original_adler) error(); +      */ +      unsigned long update_adler32(unsigned long adler, +         unsigned char *buf, int len) +      { +        unsigned long s1 = adler & 0xffff; +        unsigned long s2 = (adler >> 16) & 0xffff; +        int n; + +        for (n = 0; n < len; n++) { +          s1 = (s1 + buf[n]) % BASE; +          s2 = (s2 + s1)     % BASE; +        } +        return (s2 << 16) + s1; +      } + +      /* Return the adler32 of the bytes buf[0..len-1] */ + + + + +Deutsch & Gailly             Informational                     [Page 10] + +RFC 1950       ZLIB Compressed Data Format Specification        May 1996 + + +      unsigned long adler32(unsigned char *buf, int len) +      { +        return update_adler32(1L, buf, len); +      } + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Deutsch & Gailly             Informational                     [Page 11] + diff --git a/zlib/doc/rfc1951.txt b/zlib/doc/rfc1951.txt new file mode 100644 index 000000000..403c8c722 --- /dev/null +++ b/zlib/doc/rfc1951.txt @@ -0,0 +1,955 @@ + + + + + + +Network Working Group                                         P. Deutsch +Request for Comments: 1951                           Aladdin Enterprises +Category: Informational                                         May 1996 + + +        DEFLATE Compressed Data Format Specification version 1.3 + +Status of This Memo + +   This memo provides information for the Internet community.  This memo +   does not specify an Internet standard of any kind.  Distribution of +   this memo is unlimited. + +IESG Note: + +   The IESG takes no position on the validity of any Intellectual +   Property Rights statements contained in this document. + +Notices + +   Copyright (c) 1996 L. Peter Deutsch + +   Permission is granted to copy and distribute this document for any +   purpose and without charge, including translations into other +   languages and incorporation into compilations, provided that the +   copyright notice and this notice are preserved, and that any +   substantive changes or deletions from the original are clearly +   marked. + +   A pointer to the latest version of this and related documentation in +   HTML format can be found at the URL +   <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>. + +Abstract + +   This specification defines a lossless compressed data format that +   compresses data using a combination of the LZ77 algorithm and Huffman +   coding, with efficiency comparable to the best currently available +   general-purpose compression methods.  The data can be produced or +   consumed, even for an arbitrarily long sequentially presented input +   data stream, using only an a priori bounded amount of intermediate +   storage.  The format can be implemented readily in a manner not +   covered by patents. + + + + + + + + +Deutsch                      Informational                      [Page 1] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +Table of Contents + +   1. Introduction ................................................... 2 +      1.1. Purpose ................................................... 2 +      1.2. Intended audience ......................................... 3 +      1.3. Scope ..................................................... 3 +      1.4. Compliance ................................................ 3 +      1.5.  Definitions of terms and conventions used ................ 3 +      1.6. Changes from previous versions ............................ 4 +   2. Compressed representation overview ............................. 4 +   3. Detailed specification ......................................... 5 +      3.1. Overall conventions ....................................... 5 +          3.1.1. Packing into bytes .................................. 5 +      3.2. Compressed block format ................................... 6 +          3.2.1. Synopsis of prefix and Huffman coding ............... 6 +          3.2.2. Use of Huffman coding in the "deflate" format ....... 7 +          3.2.3. Details of block format ............................. 9 +          3.2.4. Non-compressed blocks (BTYPE=00) ................... 11 +          3.2.5. Compressed blocks (length and distance codes) ...... 11 +          3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12 +          3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13 +      3.3. Compliance ............................................... 14 +   4. Compression algorithm details ................................. 14 +   5. References .................................................... 16 +   6. Security Considerations ....................................... 16 +   7. Source code ................................................... 16 +   8. Acknowledgements .............................................. 16 +   9. Author's Address .............................................. 17 + +1. Introduction + +   1.1. Purpose + +      The purpose of this specification is to define a lossless +      compressed data format that: +          * Is independent of CPU type, operating system, file system, +            and character set, and hence can be used for interchange; +          * Can be produced or consumed, even for an arbitrarily long +            sequentially presented input data stream, using only an a +            priori bounded amount of intermediate storage, and hence +            can be used in data communications or similar structures +            such as Unix filters; +          * Compresses data with efficiency comparable to the best +            currently available general-purpose compression methods, +            and in particular considerably better than the "compress" +            program; +          * Can be implemented readily in a manner not covered by +            patents, and hence can be practiced freely; + + + +Deutsch                      Informational                      [Page 2] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +          * Is compatible with the file format produced by the current +            widely used gzip utility, in that conforming decompressors +            will be able to read data produced by the existing gzip +            compressor. + +      The data format defined by this specification does not attempt to: + +          * Allow random access to compressed data; +          * Compress specialized data (e.g., raster graphics) as well +            as the best currently available specialized algorithms. + +      A simple counting argument shows that no lossless compression +      algorithm can compress every possible input data set.  For the +      format defined here, the worst case expansion is 5 bytes per 32K- +      byte block, i.e., a size increase of 0.015% for large data sets. +      English text usually compresses by a factor of 2.5 to 3; +      executable files usually compress somewhat less; graphical data +      such as raster images may compress much more. + +   1.2. Intended audience + +      This specification is intended for use by implementors of software +      to compress data into "deflate" format and/or decompress data from +      "deflate" format. + +      The text of the specification assumes a basic background in +      programming at the level of bits and other primitive data +      representations.  Familiarity with the technique of Huffman coding +      is helpful but not required. + +   1.3. Scope + +      The specification specifies a method for representing a sequence +      of bytes as a (usually shorter) sequence of bits, and a method for +      packing the latter bit sequence into bytes. + +   1.4. Compliance + +      Unless otherwise indicated below, a compliant decompressor must be +      able to accept and decompress any data set that conforms to all +      the specifications presented here; a compliant compressor must +      produce data sets that conform to all the specifications presented +      here. + +   1.5.  Definitions of terms and conventions used + +      Byte: 8 bits stored or transmitted as a unit (same as an octet). +      For this specification, a byte is exactly 8 bits, even on machines + + + +Deutsch                      Informational                      [Page 3] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +      which store a character on a number of bits different from eight. +      See below, for the numbering of bits within a byte. + +      String: a sequence of arbitrary bytes. + +   1.6. Changes from previous versions + +      There have been no technical changes to the deflate format since +      version 1.1 of this specification.  In version 1.2, some +      terminology was changed.  Version 1.3 is a conversion of the +      specification to RFC style. + +2. Compressed representation overview + +   A compressed data set consists of a series of blocks, corresponding +   to successive blocks of input data.  The block sizes are arbitrary, +   except that non-compressible blocks are limited to 65,535 bytes. + +   Each block is compressed using a combination of the LZ77 algorithm +   and Huffman coding. The Huffman trees for each block are independent +   of those for previous or subsequent blocks; the LZ77 algorithm may +   use a reference to a duplicated string occurring in a previous block, +   up to 32K input bytes before. + +   Each block consists of two parts: a pair of Huffman code trees that +   describe the representation of the compressed data part, and a +   compressed data part.  (The Huffman trees themselves are compressed +   using Huffman encoding.)  The compressed data consists of a series of +   elements of two types: literal bytes (of strings that have not been +   detected as duplicated within the previous 32K input bytes), and +   pointers to duplicated strings, where a pointer is represented as a +   pair <length, backward distance>.  The representation used in the +   "deflate" format limits distances to 32K bytes and lengths to 258 +   bytes, but does not limit the size of a block, except for +   uncompressible blocks, which are limited as noted above. + +   Each type of value (literals, distances, and lengths) in the +   compressed data is represented using a Huffman code, using one code +   tree for literals and lengths and a separate code tree for distances. +   The code trees for each block appear in a compact form just before +   the compressed data for that block. + + + + + + + + + + +Deutsch                      Informational                      [Page 4] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +3. Detailed specification + +   3.1. Overall conventions In the diagrams below, a box like this: + +         +---+ +         |   | <-- the vertical bars might be missing +         +---+ + +      represents one byte; a box like this: + +         +==============+ +         |              | +         +==============+ + +      represents a variable number of bytes. + +      Bytes stored within a computer do not have a "bit order", since +      they are always treated as a unit.  However, a byte considered as +      an integer between 0 and 255 does have a most- and least- +      significant bit, and since we write numbers with the most- +      significant digit on the left, we also write bytes with the most- +      significant bit on the left.  In the diagrams below, we number the +      bits of a byte so that bit 0 is the least-significant bit, i.e., +      the bits are numbered: + +         +--------+ +         |76543210| +         +--------+ + +      Within a computer, a number may occupy multiple bytes.  All +      multi-byte numbers in the format described here are stored with +      the least-significant byte first (at the lower memory address). +      For example, the decimal number 520 is stored as: + +             0        1 +         +--------+--------+ +         |00001000|00000010| +         +--------+--------+ +          ^        ^ +          |        | +          |        + more significant byte = 2 x 256 +          + less significant byte = 8 + +      3.1.1. Packing into bytes + +         This document does not address the issue of the order in which +         bits of a byte are transmitted on a bit-sequential medium, +         since the final data format described here is byte- rather than + + + +Deutsch                      Informational                      [Page 5] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +         bit-oriented.  However, we describe the compressed block format +         in below, as a sequence of data elements of various bit +         lengths, not a sequence of bytes.  We must therefore specify +         how to pack these data elements into bytes to form the final +         compressed byte sequence: + +             * Data elements are packed into bytes in order of +               increasing bit number within the byte, i.e., starting +               with the least-significant bit of the byte. +             * Data elements other than Huffman codes are packed +               starting with the least-significant bit of the data +               element. +             * Huffman codes are packed starting with the most- +               significant bit of the code. + +         In other words, if one were to print out the compressed data as +         a sequence of bytes, starting with the first byte at the +         *right* margin and proceeding to the *left*, with the most- +         significant bit of each byte on the left as usual, one would be +         able to parse the result from right to left, with fixed-width +         elements in the correct MSB-to-LSB order and Huffman codes in +         bit-reversed order (i.e., with the first bit of the code in the +         relative LSB position). + +   3.2. Compressed block format + +      3.2.1. Synopsis of prefix and Huffman coding + +         Prefix coding represents symbols from an a priori known +         alphabet by bit sequences (codes), one code for each symbol, in +         a manner such that different symbols may be represented by bit +         sequences of different lengths, but a parser can always parse +         an encoded string unambiguously symbol-by-symbol. + +         We define a prefix code in terms of a binary tree in which the +         two edges descending from each non-leaf node are labeled 0 and +         1 and in which the leaf nodes correspond one-for-one with (are +         labeled with) the symbols of the alphabet; then the code for a +         symbol is the sequence of 0's and 1's on the edges leading from +         the root to the leaf labeled with that symbol.  For example: + + + + + + + + + + + +Deutsch                      Informational                      [Page 6] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +                          /\              Symbol    Code +                         0  1             ------    ---- +                        /    \                A      00 +                       /\     B               B       1 +                      0  1                    C     011 +                     /    \                   D     010 +                    A     /\ +                         0  1 +                        /    \ +                       D      C + +         A parser can decode the next symbol from an encoded input +         stream by walking down the tree from the root, at each step +         choosing the edge corresponding to the next input bit. + +         Given an alphabet with known symbol frequencies, the Huffman +         algorithm allows the construction of an optimal prefix code +         (one which represents strings with those symbol frequencies +         using the fewest bits of any possible prefix codes for that +         alphabet).  Such a code is called a Huffman code.  (See +         reference [1] in Chapter 5, references for additional +         information on Huffman codes.) + +         Note that in the "deflate" format, the Huffman codes for the +         various alphabets must not exceed certain maximum code lengths. +         This constraint complicates the algorithm for computing code +         lengths from symbol frequencies.  Again, see Chapter 5, +         references for details. + +      3.2.2. Use of Huffman coding in the "deflate" format + +         The Huffman codes used for each alphabet in the "deflate" +         format have two additional rules: + +             * All codes of a given bit length have lexicographically +               consecutive values, in the same order as the symbols +               they represent; + +             * Shorter codes lexicographically precede longer codes. + + + + + + + + + + + + +Deutsch                      Informational                      [Page 7] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +         We could recode the example above to follow this rule as +         follows, assuming that the order of the alphabet is ABCD: + +            Symbol  Code +            ------  ---- +            A       10 +            B       0 +            C       110 +            D       111 + +         I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are +         lexicographically consecutive. + +         Given this rule, we can define the Huffman code for an alphabet +         just by giving the bit lengths of the codes for each symbol of +         the alphabet in order; this is sufficient to determine the +         actual codes.  In our example, the code is completely defined +         by the sequence of bit lengths (2, 1, 3, 3).  The following +         algorithm generates the codes as integers, intended to be read +         from most- to least-significant bit.  The code lengths are +         initially in tree[I].Len; the codes are produced in +         tree[I].Code. + +         1)  Count the number of codes for each code length.  Let +             bl_count[N] be the number of codes of length N, N >= 1. + +         2)  Find the numerical value of the smallest code for each +             code length: + +                code = 0; +                bl_count[0] = 0; +                for (bits = 1; bits <= MAX_BITS; bits++) { +                    code = (code + bl_count[bits-1]) << 1; +                    next_code[bits] = code; +                } + +         3)  Assign numerical values to all codes, using consecutive +             values for all codes of the same length with the base +             values determined at step 2. Codes that are never used +             (which have a bit length of zero) must not be assigned a +             value. + +                for (n = 0;  n <= max_code; n++) { +                    len = tree[n].Len; +                    if (len != 0) { +                        tree[n].Code = next_code[len]; +                        next_code[len]++; +                    } + + + +Deutsch                      Informational                      [Page 8] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +                } + +         Example: + +         Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3, +         3, 2, 4, 4).  After step 1, we have: + +            N      bl_count[N] +            -      ----------- +            2      1 +            3      5 +            4      2 + +         Step 2 computes the following next_code values: + +            N      next_code[N] +            -      ------------ +            1      0 +            2      0 +            3      2 +            4      14 + +         Step 3 produces the following code values: + +            Symbol Length   Code +            ------ ------   ---- +            A       3        010 +            B       3        011 +            C       3        100 +            D       3        101 +            E       3        110 +            F       2         00 +            G       4       1110 +            H       4       1111 + +      3.2.3. Details of block format + +         Each block of compressed data begins with 3 header bits +         containing the following data: + +            first bit       BFINAL +            next 2 bits     BTYPE + +         Note that the header bits do not necessarily begin on a byte +         boundary, since a block does not necessarily occupy an integral +         number of bytes. + + + + + +Deutsch                      Informational                      [Page 9] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +         BFINAL is set if and only if this is the last block of the data +         set. + +         BTYPE specifies how the data are compressed, as follows: + +            00 - no compression +            01 - compressed with fixed Huffman codes +            10 - compressed with dynamic Huffman codes +            11 - reserved (error) + +         The only difference between the two compressed cases is how the +         Huffman codes for the literal/length and distance alphabets are +         defined. + +         In all cases, the decoding algorithm for the actual data is as +         follows: + +            do +               read block header from input stream. +               if stored with no compression +                  skip any remaining bits in current partially +                     processed byte +                  read LEN and NLEN (see next section) +                  copy LEN bytes of data to output +               otherwise +                  if compressed with dynamic Huffman codes +                     read representation of code trees (see +                        subsection below) +                  loop (until end of block code recognized) +                     decode literal/length value from input stream +                     if value < 256 +                        copy value (literal byte) to output stream +                     otherwise +                        if value = end of block (256) +                           break from loop +                        otherwise (value = 257..285) +                           decode distance from input stream + +                           move backwards distance bytes in the output +                           stream, and copy length bytes from this +                           position to the output stream. +                  end loop +            while not last block + +         Note that a duplicated string reference may refer to a string +         in a previous block; i.e., the backward distance may cross one +         or more block boundaries.  However a distance cannot refer past +         the beginning of the output stream.  (An application using a + + + +Deutsch                      Informational                     [Page 10] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +         preset dictionary might discard part of the output stream; a +         distance can refer to that part of the output stream anyway) +         Note also that the referenced string may overlap the current +         position; for example, if the last 2 bytes decoded have values +         X and Y, a string reference with <length = 5, distance = 2> +         adds X,Y,X,Y,X to the output stream. + +         We now specify each compression method in turn. + +      3.2.4. Non-compressed blocks (BTYPE=00) + +         Any bits of input up to the next byte boundary are ignored. +         The rest of the block consists of the following information: + +              0   1   2   3   4... +            +---+---+---+---+================================+ +            |  LEN  | NLEN  |... LEN bytes of literal data...| +            +---+---+---+---+================================+ + +         LEN is the number of data bytes in the block.  NLEN is the +         one's complement of LEN. + +      3.2.5. Compressed blocks (length and distance codes) + +         As noted above, encoded data blocks in the "deflate" format +         consist of sequences of symbols drawn from three conceptually +         distinct alphabets: either literal bytes, from the alphabet of +         byte values (0..255), or <length, backward distance> pairs, +         where the length is drawn from (3..258) and the distance is +         drawn from (1..32,768).  In fact, the literal and length +         alphabets are merged into a single alphabet (0..285), where +         values 0..255 represent literal bytes, the value 256 indicates +         end-of-block, and values 257..285 represent length codes +         (possibly in conjunction with extra bits following the symbol +         code) as follows: + + + + + + + + + + + + + + + + +Deutsch                      Informational                     [Page 11] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +                 Extra               Extra               Extra +            Code Bits Length(s) Code Bits Lengths   Code Bits Length(s) +            ---- ---- ------     ---- ---- -------   ---- ---- ------- +             257   0     3       267   1   15,16     277   4   67-82 +             258   0     4       268   1   17,18     278   4   83-98 +             259   0     5       269   2   19-22     279   4   99-114 +             260   0     6       270   2   23-26     280   4  115-130 +             261   0     7       271   2   27-30     281   5  131-162 +             262   0     8       272   2   31-34     282   5  163-194 +             263   0     9       273   3   35-42     283   5  195-226 +             264   0    10       274   3   43-50     284   5  227-257 +             265   1  11,12      275   3   51-58     285   0    258 +             266   1  13,14      276   3   59-66 + +         The extra bits should be interpreted as a machine integer +         stored with the most-significant bit first, e.g., bits 1110 +         represent the value 14. + +                  Extra           Extra               Extra +             Code Bits Dist  Code Bits   Dist     Code Bits Distance +             ---- ---- ----  ---- ----  ------    ---- ---- -------- +               0   0    1     10   4     33-48    20    9   1025-1536 +               1   0    2     11   4     49-64    21    9   1537-2048 +               2   0    3     12   5     65-96    22   10   2049-3072 +               3   0    4     13   5     97-128   23   10   3073-4096 +               4   1   5,6    14   6    129-192   24   11   4097-6144 +               5   1   7,8    15   6    193-256   25   11   6145-8192 +               6   2   9-12   16   7    257-384   26   12  8193-12288 +               7   2  13-16   17   7    385-512   27   12 12289-16384 +               8   3  17-24   18   8    513-768   28   13 16385-24576 +               9   3  25-32   19   8   769-1024   29   13 24577-32768 + +      3.2.6. Compression with fixed Huffman codes (BTYPE=01) + +         The Huffman codes for the two alphabets are fixed, and are not +         represented explicitly in the data.  The Huffman code lengths +         for the literal/length alphabet are: + +                   Lit Value    Bits        Codes +                   ---------    ----        ----- +                     0 - 143     8          00110000 through +                                            10111111 +                   144 - 255     9          110010000 through +                                            111111111 +                   256 - 279     7          0000000 through +                                            0010111 +                   280 - 287     8          11000000 through +                                            11000111 + + + +Deutsch                      Informational                     [Page 12] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +         The code lengths are sufficient to generate the actual codes, +         as described above; we show the codes in the table for added +         clarity.  Literal/length values 286-287 will never actually +         occur in the compressed data, but participate in the code +         construction. + +         Distance codes 0-31 are represented by (fixed-length) 5-bit +         codes, with possible additional bits as shown in the table +         shown in Paragraph 3.2.5, above.  Note that distance codes 30- +         31 will never actually occur in the compressed data. + +      3.2.7. Compression with dynamic Huffman codes (BTYPE=10) + +         The Huffman codes for the two alphabets appear in the block +         immediately after the header bits and before the actual +         compressed data, first the literal/length code and then the +         distance code.  Each code is defined by a sequence of code +         lengths, as discussed in Paragraph 3.2.2, above.  For even +         greater compactness, the code length sequences themselves are +         compressed using a Huffman code.  The alphabet for code lengths +         is as follows: + +               0 - 15: Represent code lengths of 0 - 15 +                   16: Copy the previous code length 3 - 6 times. +                       The next 2 bits indicate repeat length +                             (0 = 3, ... , 3 = 6) +                          Example:  Codes 8, 16 (+2 bits 11), +                                    16 (+2 bits 10) will expand to +                                    12 code lengths of 8 (1 + 6 + 5) +                   17: Repeat a code length of 0 for 3 - 10 times. +                       (3 bits of length) +                   18: Repeat a code length of 0 for 11 - 138 times +                       (7 bits of length) + +         A code length of 0 indicates that the corresponding symbol in +         the literal/length or distance alphabet will not occur in the +         block, and should not participate in the Huffman code +         construction algorithm given earlier.  If only one distance +         code is used, it is encoded using one bit, not zero bits; in +         this case there is a single code length of one, with one unused +         code.  One distance code of zero bits means that there are no +         distance codes used at all (the data is all literals). + +         We can now define the format of the block: + +               5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286) +               5 Bits: HDIST, # of Distance codes - 1        (1 - 32) +               4 Bits: HCLEN, # of Code Length codes - 4     (4 - 19) + + + +Deutsch                      Informational                     [Page 13] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +               (HCLEN + 4) x 3 bits: code lengths for the code length +                  alphabet given just above, in the order: 16, 17, 18, +                  0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 + +                  These code lengths are interpreted as 3-bit integers +                  (0-7); as above, a code length of 0 means the +                  corresponding symbol (literal/length or distance code +                  length) is not used. + +               HLIT + 257 code lengths for the literal/length alphabet, +                  encoded using the code length Huffman code + +               HDIST + 1 code lengths for the distance alphabet, +                  encoded using the code length Huffman code + +               The actual compressed data of the block, +                  encoded using the literal/length and distance Huffman +                  codes + +               The literal/length symbol 256 (end of data), +                  encoded using the literal/length Huffman code + +         The code length repeat codes can cross from HLIT + 257 to the +         HDIST + 1 code lengths.  In other words, all code lengths form +         a single sequence of HLIT + HDIST + 258 values. + +   3.3. Compliance + +      A compressor may limit further the ranges of values specified in +      the previous section and still be compliant; for example, it may +      limit the range of backward pointers to some value smaller than +      32K.  Similarly, a compressor may limit the size of blocks so that +      a compressible block fits in memory. + +      A compliant decompressor must accept the full range of possible +      values defined in the previous section, and must accept blocks of +      arbitrary size. + +4. Compression algorithm details + +   While it is the intent of this document to define the "deflate" +   compressed data format without reference to any particular +   compression algorithm, the format is related to the compressed +   formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below); +   since many variations of LZ77 are patented, it is strongly +   recommended that the implementor of a compressor follow the general +   algorithm presented here, which is known not to be patented per se. +   The material in this section is not part of the definition of the + + + +Deutsch                      Informational                     [Page 14] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +   specification per se, and a compressor need not follow it in order to +   be compliant. + +   The compressor terminates a block when it determines that starting a +   new block with fresh trees would be useful, or when the block size +   fills up the compressor's block buffer. + +   The compressor uses a chained hash table to find duplicated strings, +   using a hash function that operates on 3-byte sequences.  At any +   given point during compression, let XYZ be the next 3 input bytes to +   be examined (not necessarily all different, of course).  First, the +   compressor examines the hash chain for XYZ.  If the chain is empty, +   the compressor simply writes out X as a literal byte and advances one +   byte in the input.  If the hash chain is not empty, indicating that +   the sequence XYZ (or, if we are unlucky, some other 3 bytes with the +   same hash function value) has occurred recently, the compressor +   compares all strings on the XYZ hash chain with the actual input data +   sequence starting at the current point, and selects the longest +   match. + +   The compressor searches the hash chains starting with the most recent +   strings, to favor small distances and thus take advantage of the +   Huffman encoding.  The hash chains are singly linked. There are no +   deletions from the hash chains; the algorithm simply discards matches +   that are too old.  To avoid a worst-case situation, very long hash +   chains are arbitrarily truncated at a certain length, determined by a +   run-time parameter. + +   To improve overall compression, the compressor optionally defers the +   selection of matches ("lazy matching"): after a match of length N has +   been found, the compressor searches for a longer match starting at +   the next input byte.  If it finds a longer match, it truncates the +   previous match to a length of one (thus producing a single literal +   byte) and then emits the longer match.  Otherwise, it emits the +   original match, and, as described above, advances N bytes before +   continuing. + +   Run-time parameters also control this "lazy match" procedure.  If +   compression ratio is most important, the compressor attempts a +   complete second search regardless of the length of the first match. +   In the normal case, if the current match is "long enough", the +   compressor reduces the search for a longer match, thus speeding up +   the process.  If speed is most important, the compressor inserts new +   strings in the hash table only when no match was found, or when the +   match is not "too long".  This degrades the compression ratio but +   saves time since there are both fewer insertions and fewer searches. + + + + + +Deutsch                      Informational                     [Page 15] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +5. References + +   [1] Huffman, D. A., "A Method for the Construction of Minimum +       Redundancy Codes", Proceedings of the Institute of Radio +       Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101. + +   [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data +       Compression", IEEE Transactions on Information Theory, Vol. 23, +       No. 3, pp. 337-343. + +   [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources, +       available in ftp://ftp.uu.net/pub/archiving/zip/doc/ + +   [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources, +       available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/ + +   [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix +       encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169. + +   [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes," +       Comm. ACM, 33,4, April 1990, pp. 449-459. + +6. Security Considerations + +   Any data compression method involves the reduction of redundancy in +   the data.  Consequently, any corruption of the data is likely to have +   severe effects and be difficult to correct.  Uncompressed text, on +   the other hand, will probably still be readable despite the presence +   of some corrupted bytes. + +   It is recommended that systems using this data format provide some +   means of validating the integrity of the compressed data.  See +   reference [3], for example. + +7. Source code + +   Source code for a C language implementation of a "deflate" compliant +   compressor and decompressor is available within the zlib package at +   ftp://ftp.uu.net/pub/archiving/zip/zlib/. + +8. Acknowledgements + +   Trademarks cited in this document are the property of their +   respective owners. + +   Phil Katz designed the deflate format.  Jean-Loup Gailly and Mark +   Adler wrote the related software described in this specification. +   Glenn Randers-Pehrson converted this document to RFC and HTML format. + + + +Deutsch                      Informational                     [Page 16] + +RFC 1951      DEFLATE Compressed Data Format Specification      May 1996 + + +9. Author's Address + +   L. Peter Deutsch +   Aladdin Enterprises +   203 Santa Margarita Ave. +   Menlo Park, CA 94025 + +   Phone: (415) 322-0103 (AM only) +   FAX:   (415) 322-1734 +   EMail: <ghost@aladdin.com> + +   Questions about the technical content of this specification can be +   sent by email to: + +   Jean-Loup Gailly <gzip@prep.ai.mit.edu> and +   Mark Adler <madler@alumni.caltech.edu> + +   Editorial comments on this specification can be sent by email to: + +   L. Peter Deutsch <ghost@aladdin.com> and +   Glenn Randers-Pehrson <randeg@alumni.rpi.edu> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Deutsch                      Informational                     [Page 17] + diff --git a/zlib/doc/rfc1952.txt b/zlib/doc/rfc1952.txt new file mode 100644 index 000000000..a8e51b456 --- /dev/null +++ b/zlib/doc/rfc1952.txt @@ -0,0 +1,675 @@ + + + + + + +Network Working Group                                         P. Deutsch +Request for Comments: 1952                           Aladdin Enterprises +Category: Informational                                         May 1996 + + +               GZIP file format specification version 4.3 + +Status of This Memo + +   This memo provides information for the Internet community.  This memo +   does not specify an Internet standard of any kind.  Distribution of +   this memo is unlimited. + +IESG Note: + +   The IESG takes no position on the validity of any Intellectual +   Property Rights statements contained in this document. + +Notices + +   Copyright (c) 1996 L. Peter Deutsch + +   Permission is granted to copy and distribute this document for any +   purpose and without charge, including translations into other +   languages and incorporation into compilations, provided that the +   copyright notice and this notice are preserved, and that any +   substantive changes or deletions from the original are clearly +   marked. + +   A pointer to the latest version of this and related documentation in +   HTML format can be found at the URL +   <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>. + +Abstract + +   This specification defines a lossless compressed data format that is +   compatible with the widely used GZIP utility.  The format includes a +   cyclic redundancy check value for detecting data corruption.  The +   format presently uses the DEFLATE method of compression but can be +   easily extended to use other compression methods.  The format can be +   implemented readily in a manner not covered by patents. + + + + + + + + + + +Deutsch                      Informational                      [Page 1] + +RFC 1952             GZIP File Format Specification             May 1996 + + +Table of Contents + +   1. Introduction ................................................... 2 +      1.1. Purpose ................................................... 2 +      1.2. Intended audience ......................................... 3 +      1.3. Scope ..................................................... 3 +      1.4. Compliance ................................................ 3 +      1.5. Definitions of terms and conventions used ................. 3 +      1.6. Changes from previous versions ............................ 3 +   2. Detailed specification ......................................... 4 +      2.1. Overall conventions ....................................... 4 +      2.2. File format ............................................... 5 +      2.3. Member format ............................................. 5 +          2.3.1. Member header and trailer ........................... 6 +              2.3.1.1. Extra field ................................... 8 +              2.3.1.2. Compliance .................................... 9 +      3. References .................................................. 9 +      4. Security Considerations .................................... 10 +      5. Acknowledgements ........................................... 10 +      6. Author's Address ........................................... 10 +      7. Appendix: Jean-Loup Gailly's gzip utility .................. 11 +      8. Appendix: Sample CRC Code .................................. 11 + +1. Introduction + +   1.1. Purpose + +      The purpose of this specification is to define a lossless +      compressed data format that: + +          * Is independent of CPU type, operating system, file system, +            and character set, and hence can be used for interchange; +          * Can compress or decompress a data stream (as opposed to a +            randomly accessible file) to produce another data stream, +            using only an a priori bounded amount of intermediate +            storage, and hence can be used in data communications or +            similar structures such as Unix filters; +          * Compresses data with efficiency comparable to the best +            currently available general-purpose compression methods, +            and in particular considerably better than the "compress" +            program; +          * Can be implemented readily in a manner not covered by +            patents, and hence can be practiced freely; +          * Is compatible with the file format produced by the current +            widely used gzip utility, in that conforming decompressors +            will be able to read data produced by the existing gzip +            compressor. + + + + +Deutsch                      Informational                      [Page 2] + +RFC 1952             GZIP File Format Specification             May 1996 + + +      The data format defined by this specification does not attempt to: + +          * Provide random access to compressed data; +          * Compress specialized data (e.g., raster graphics) as well as +            the best currently available specialized algorithms. + +   1.2. Intended audience + +      This specification is intended for use by implementors of software +      to compress data into gzip format and/or decompress data from gzip +      format. + +      The text of the specification assumes a basic background in +      programming at the level of bits and other primitive data +      representations. + +   1.3. Scope + +      The specification specifies a compression method and a file format +      (the latter assuming only that a file can store a sequence of +      arbitrary bytes).  It does not specify any particular interface to +      a file system or anything about character sets or encodings +      (except for file names and comments, which are optional). + +   1.4. Compliance + +      Unless otherwise indicated below, a compliant decompressor must be +      able to accept and decompress any file that conforms to all the +      specifications presented here; a compliant compressor must produce +      files that conform to all the specifications presented here.  The +      material in the appendices is not part of the specification per se +      and is not relevant to compliance. + +   1.5. Definitions of terms and conventions used + +      byte: 8 bits stored or transmitted as a unit (same as an octet). +      (For this specification, a byte is exactly 8 bits, even on +      machines which store a character on a number of bits different +      from 8.)  See below for the numbering of bits within a byte. + +   1.6. Changes from previous versions + +      There have been no technical changes to the gzip format since +      version 4.1 of this specification.  In version 4.2, some +      terminology was changed, and the sample CRC code was rewritten for +      clarity and to eliminate the requirement for the caller to do pre- +      and post-conditioning.  Version 4.3 is a conversion of the +      specification to RFC style. + + + +Deutsch                      Informational                      [Page 3] + +RFC 1952             GZIP File Format Specification             May 1996 + + +2. Detailed specification + +   2.1. Overall conventions + +      In the diagrams below, a box like this: + +         +---+ +         |   | <-- the vertical bars might be missing +         +---+ + +      represents one byte; a box like this: + +         +==============+ +         |              | +         +==============+ + +      represents a variable number of bytes. + +      Bytes stored within a computer do not have a "bit order", since +      they are always treated as a unit.  However, a byte considered as +      an integer between 0 and 255 does have a most- and least- +      significant bit, and since we write numbers with the most- +      significant digit on the left, we also write bytes with the most- +      significant bit on the left.  In the diagrams below, we number the +      bits of a byte so that bit 0 is the least-significant bit, i.e., +      the bits are numbered: + +         +--------+ +         |76543210| +         +--------+ + +      This document does not address the issue of the order in which +      bits of a byte are transmitted on a bit-sequential medium, since +      the data format described here is byte- rather than bit-oriented. + +      Within a computer, a number may occupy multiple bytes.  All +      multi-byte numbers in the format described here are stored with +      the least-significant byte first (at the lower memory address). +      For example, the decimal number 520 is stored as: + +             0        1 +         +--------+--------+ +         |00001000|00000010| +         +--------+--------+ +          ^        ^ +          |        | +          |        + more significant byte = 2 x 256 +          + less significant byte = 8 + + + +Deutsch                      Informational                      [Page 4] + +RFC 1952             GZIP File Format Specification             May 1996 + + +   2.2. File format + +      A gzip file consists of a series of "members" (compressed data +      sets).  The format of each member is specified in the following +      section.  The members simply appear one after another in the file, +      with no additional information before, between, or after them. + +   2.3. Member format + +      Each member has the following structure: + +         +---+---+---+---+---+---+---+---+---+---+ +         |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->) +         +---+---+---+---+---+---+---+---+---+---+ + +      (if FLG.FEXTRA set) + +         +---+---+=================================+ +         | XLEN  |...XLEN bytes of "extra field"...| (more-->) +         +---+---+=================================+ + +      (if FLG.FNAME set) + +         +=========================================+ +         |...original file name, zero-terminated...| (more-->) +         +=========================================+ + +      (if FLG.FCOMMENT set) + +         +===================================+ +         |...file comment, zero-terminated...| (more-->) +         +===================================+ + +      (if FLG.FHCRC set) + +         +---+---+ +         | CRC16 | +         +---+---+ + +         +=======================+ +         |...compressed blocks...| (more-->) +         +=======================+ + +           0   1   2   3   4   5   6   7 +         +---+---+---+---+---+---+---+---+ +         |     CRC32     |     ISIZE     | +         +---+---+---+---+---+---+---+---+ + + + + +Deutsch                      Informational                      [Page 5] + +RFC 1952             GZIP File Format Specification             May 1996 + + +      2.3.1. Member header and trailer + +         ID1 (IDentification 1) +         ID2 (IDentification 2) +            These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 +            (0x8b, \213), to identify the file as being in gzip format. + +         CM (Compression Method) +            This identifies the compression method used in the file.  CM +            = 0-7 are reserved.  CM = 8 denotes the "deflate" +            compression method, which is the one customarily used by +            gzip and which is documented elsewhere. + +         FLG (FLaGs) +            This flag byte is divided into individual bits as follows: + +               bit 0   FTEXT +               bit 1   FHCRC +               bit 2   FEXTRA +               bit 3   FNAME +               bit 4   FCOMMENT +               bit 5   reserved +               bit 6   reserved +               bit 7   reserved + +            If FTEXT is set, the file is probably ASCII text.  This is +            an optional indication, which the compressor may set by +            checking a small amount of the input data to see whether any +            non-ASCII characters are present.  In case of doubt, FTEXT +            is cleared, indicating binary data. For systems which have +            different file formats for ascii text and binary data, the +            decompressor can use FTEXT to choose the appropriate format. +            We deliberately do not specify the algorithm used to set +            this bit, since a compressor always has the option of +            leaving it cleared and a decompressor always has the option +            of ignoring it and letting some other program handle issues +            of data conversion. + +            If FHCRC is set, a CRC16 for the gzip header is present, +            immediately before the compressed data. The CRC16 consists +            of the two least significant bytes of the CRC32 for all +            bytes of the gzip header up to and not including the CRC16. +            [The FHCRC bit was never set by versions of gzip up to +            1.2.4, even though it was documented with a different +            meaning in gzip 1.2.4.] + +            If FEXTRA is set, optional extra fields are present, as +            described in a following section. + + + +Deutsch                      Informational                      [Page 6] + +RFC 1952             GZIP File Format Specification             May 1996 + + +            If FNAME is set, an original file name is present, +            terminated by a zero byte.  The name must consist of ISO +            8859-1 (LATIN-1) characters; on operating systems using +            EBCDIC or any other character set for file names, the name +            must be translated to the ISO LATIN-1 character set.  This +            is the original name of the file being compressed, with any +            directory components removed, and, if the file being +            compressed is on a file system with case insensitive names, +            forced to lower case. There is no original file name if the +            data was compressed from a source other than a named file; +            for example, if the source was stdin on a Unix system, there +            is no file name. + +            If FCOMMENT is set, a zero-terminated file comment is +            present.  This comment is not interpreted; it is only +            intended for human consumption.  The comment must consist of +            ISO 8859-1 (LATIN-1) characters.  Line breaks should be +            denoted by a single line feed character (10 decimal). + +            Reserved FLG bits must be zero. + +         MTIME (Modification TIME) +            This gives the most recent modification time of the original +            file being compressed.  The time is in Unix format, i.e., +            seconds since 00:00:00 GMT, Jan.  1, 1970.  (Note that this +            may cause problems for MS-DOS and other systems that use +            local rather than Universal time.)  If the compressed data +            did not come from a file, MTIME is set to the time at which +            compression started.  MTIME = 0 means no time stamp is +            available. + +         XFL (eXtra FLags) +            These flags are available for use by specific compression +            methods.  The "deflate" method (CM = 8) sets these flags as +            follows: + +               XFL = 2 - compressor used maximum compression, +                         slowest algorithm +               XFL = 4 - compressor used fastest algorithm + +         OS (Operating System) +            This identifies the type of file system on which compression +            took place.  This may be useful in determining end-of-line +            convention for text files.  The currently defined values are +            as follows: + + + + + + +Deutsch                      Informational                      [Page 7] + +RFC 1952             GZIP File Format Specification             May 1996 + + +                 0 - FAT filesystem (MS-DOS, OS/2, NT/Win32) +                 1 - Amiga +                 2 - VMS (or OpenVMS) +                 3 - Unix +                 4 - VM/CMS +                 5 - Atari TOS +                 6 - HPFS filesystem (OS/2, NT) +                 7 - Macintosh +                 8 - Z-System +                 9 - CP/M +                10 - TOPS-20 +                11 - NTFS filesystem (NT) +                12 - QDOS +                13 - Acorn RISCOS +               255 - unknown + +         XLEN (eXtra LENgth) +            If FLG.FEXTRA is set, this gives the length of the optional +            extra field.  See below for details. + +         CRC32 (CRC-32) +            This contains a Cyclic Redundancy Check value of the +            uncompressed data computed according to CRC-32 algorithm +            used in the ISO 3309 standard and in section 8.1.1.6.2 of +            ITU-T recommendation V.42.  (See http://www.iso.ch for +            ordering ISO documents. See gopher://info.itu.ch for an +            online version of ITU-T V.42.) + +         ISIZE (Input SIZE) +            This contains the size of the original (uncompressed) input +            data modulo 2^32. + +      2.3.1.1. Extra field + +         If the FLG.FEXTRA bit is set, an "extra field" is present in +         the header, with total length XLEN bytes.  It consists of a +         series of subfields, each of the form: + +            +---+---+---+---+==================================+ +            |SI1|SI2|  LEN  |... LEN bytes of subfield data ...| +            +---+---+---+---+==================================+ + +         SI1 and SI2 provide a subfield ID, typically two ASCII letters +         with some mnemonic value.  Jean-Loup Gailly +         <gzip@prep.ai.mit.edu> is maintaining a registry of subfield +         IDs; please send him any subfield ID you wish to use.  Subfield +         IDs with SI2 = 0 are reserved for future use.  The following +         IDs are currently defined: + + + +Deutsch                      Informational                      [Page 8] + +RFC 1952             GZIP File Format Specification             May 1996 + + +            SI1         SI2         Data +            ----------  ----------  ---- +            0x41 ('A')  0x70 ('P')  Apollo file type information + +         LEN gives the length of the subfield data, excluding the 4 +         initial bytes. + +      2.3.1.2. Compliance + +         A compliant compressor must produce files with correct ID1, +         ID2, CM, CRC32, and ISIZE, but may set all the other fields in +         the fixed-length part of the header to default values (255 for +         OS, 0 for all others).  The compressor must set all reserved +         bits to zero. + +         A compliant decompressor must check ID1, ID2, and CM, and +         provide an error indication if any of these have incorrect +         values.  It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC +         at least so it can skip over the optional fields if they are +         present.  It need not examine any other part of the header or +         trailer; in particular, a decompressor may ignore FTEXT and OS +         and always produce binary output, and still be compliant.  A +         compliant decompressor must give an error indication if any +         reserved bit is non-zero, since such a bit could indicate the +         presence of a new field that would cause subsequent data to be +         interpreted incorrectly. + +3. References + +   [1] "Information Processing - 8-bit single-byte coded graphic +       character sets - Part 1: Latin alphabet No.1" (ISO 8859-1:1987). +       The ISO 8859-1 (Latin-1) character set is a superset of 7-bit +       ASCII. Files defining this character set are available as +       iso_8859-1.* in ftp://ftp.uu.net/graphics/png/documents/ + +   [2] ISO 3309 + +   [3] ITU-T recommendation V.42 + +   [4] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification", +       available in ftp://ftp.uu.net/pub/archiving/zip/doc/ + +   [5] Gailly, J.-L., GZIP documentation, available as gzip-*.tar in +       ftp://prep.ai.mit.edu/pub/gnu/ + +   [6] Sarwate, D.V., "Computation of Cyclic Redundancy Checks via Table +       Look-Up", Communications of the ACM, 31(8), pp.1008-1013. + + + + +Deutsch                      Informational                      [Page 9] + +RFC 1952             GZIP File Format Specification             May 1996 + + +   [7] Schwaderer, W.D., "CRC Calculation", April 85 PC Tech Journal, +       pp.118-133. + +   [8] ftp://ftp.adelaide.edu.au/pub/rocksoft/papers/crc_v3.txt, +       describing the CRC concept. + +4. Security Considerations + +   Any data compression method involves the reduction of redundancy in +   the data.  Consequently, any corruption of the data is likely to have +   severe effects and be difficult to correct.  Uncompressed text, on +   the other hand, will probably still be readable despite the presence +   of some corrupted bytes. + +   It is recommended that systems using this data format provide some +   means of validating the integrity of the compressed data, such as by +   setting and checking the CRC-32 check value. + +5. Acknowledgements + +   Trademarks cited in this document are the property of their +   respective owners. + +   Jean-Loup Gailly designed the gzip format and wrote, with Mark Adler, +   the related software described in this specification.  Glenn +   Randers-Pehrson converted this document to RFC and HTML format. + +6. Author's Address + +   L. Peter Deutsch +   Aladdin Enterprises +   203 Santa Margarita Ave. +   Menlo Park, CA 94025 + +   Phone: (415) 322-0103 (AM only) +   FAX:   (415) 322-1734 +   EMail: <ghost@aladdin.com> + +   Questions about the technical content of this specification can be +   sent by email to: + +   Jean-Loup Gailly <gzip@prep.ai.mit.edu> and +   Mark Adler <madler@alumni.caltech.edu> + +   Editorial comments on this specification can be sent by email to: + +   L. Peter Deutsch <ghost@aladdin.com> and +   Glenn Randers-Pehrson <randeg@alumni.rpi.edu> + + + +Deutsch                      Informational                     [Page 10] + +RFC 1952             GZIP File Format Specification             May 1996 + + +7. Appendix: Jean-Loup Gailly's gzip utility + +   The most widely used implementation of gzip compression, and the +   original documentation on which this specification is based, were +   created by Jean-Loup Gailly <gzip@prep.ai.mit.edu>.  Since this +   implementation is a de facto standard, we mention some more of its +   features here.  Again, the material in this section is not part of +   the specification per se, and implementations need not follow it to +   be compliant. + +   When compressing or decompressing a file, gzip preserves the +   protection, ownership, and modification time attributes on the local +   file system, since there is no provision for representing protection +   attributes in the gzip file format itself.  Since the file format +   includes a modification time, the gzip decompressor provides a +   command line switch that assigns the modification time from the file, +   rather than the local modification time of the compressed input, to +   the decompressed output. + +8. Appendix: Sample CRC Code + +   The following sample code represents a practical implementation of +   the CRC (Cyclic Redundancy Check). (See also ISO 3309 and ITU-T V.42 +   for a formal specification.) + +   The sample code is in the ANSI C programming language. Non C users +   may find it easier to read with these hints: + +      &      Bitwise AND operator. +      ^      Bitwise exclusive-OR operator. +      >>     Bitwise right shift operator. When applied to an +             unsigned quantity, as here, right shift inserts zero +             bit(s) at the left. +      !      Logical NOT operator. +      ++     "n++" increments the variable n. +      0xNNN  0x introduces a hexadecimal (base 16) constant. +             Suffix L indicates a long value (at least 32 bits). + +      /* Table of CRCs of all 8-bit messages. */ +      unsigned long crc_table[256]; + +      /* Flag: has the table been computed? Initially false. */ +      int crc_table_computed = 0; + +      /* Make the table for a fast CRC. */ +      void make_crc_table(void) +      { +        unsigned long c; + + + +Deutsch                      Informational                     [Page 11] + +RFC 1952             GZIP File Format Specification             May 1996 + + +        int n, k; +        for (n = 0; n < 256; n++) { +          c = (unsigned long) n; +          for (k = 0; k < 8; k++) { +            if (c & 1) { +              c = 0xedb88320L ^ (c >> 1); +            } else { +              c = c >> 1; +            } +          } +          crc_table[n] = c; +        } +        crc_table_computed = 1; +      } + +      /* +         Update a running crc with the bytes buf[0..len-1] and return +       the updated crc. The crc should be initialized to zero. Pre- and +       post-conditioning (one's complement) is performed within this +       function so it shouldn't be done by the caller. Usage example: + +         unsigned long crc = 0L; + +         while (read_buffer(buffer, length) != EOF) { +           crc = update_crc(crc, buffer, length); +         } +         if (crc != original_crc) error(); +      */ +      unsigned long update_crc(unsigned long crc, +                      unsigned char *buf, int len) +      { +        unsigned long c = crc ^ 0xffffffffL; +        int n; + +        if (!crc_table_computed) +          make_crc_table(); +        for (n = 0; n < len; n++) { +          c = crc_table[(c ^ buf[n]) & 0xff] ^ (c >> 8); +        } +        return c ^ 0xffffffffL; +      } + +      /* Return the CRC of the bytes buf[0..len-1]. */ +      unsigned long crc(unsigned char *buf, int len) +      { +        return update_crc(0L, buf, len); +      } + + + + +Deutsch                      Informational                     [Page 12] + diff --git a/zlib/doc/txtvsbin.txt b/zlib/doc/txtvsbin.txt new file mode 100644 index 000000000..3d0f0634f --- /dev/null +++ b/zlib/doc/txtvsbin.txt @@ -0,0 +1,107 @@ +A Fast Method for Identifying Plain Text Files +============================================== + + +Introduction +------------ + +Given a file coming from an unknown source, it is sometimes desirable +to find out whether the format of that file is plain text.  Although +this may appear like a simple task, a fully accurate detection of the +file type requires heavy-duty semantic analysis on the file contents. +It is, however, possible to obtain satisfactory results by employing +various heuristics. + +Previous versions of PKZip and other zip-compatible compression tools +were using a crude detection scheme: if more than 80% (4/5) of the bytes +found in a certain buffer are within the range [7..127], the file is +labeled as plain text, otherwise it is labeled as binary.  A prominent +limitation of this scheme is the restriction to Latin-based alphabets. +Other alphabets, like Greek, Cyrillic or Asian, make extensive use of +the bytes within the range [128..255], and texts using these alphabets +are most often misidentified by this scheme; in other words, the rate +of false negatives is sometimes too high, which means that the recall +is low.  Another weakness of this scheme is a reduced precision, due to +the false positives that may occur when binary files containing large +amounts of textual characters are misidentified as plain text. + +In this article we propose a new, simple detection scheme that features +a much increased precision and a near-100% recall.  This scheme is +designed to work on ASCII, Unicode and other ASCII-derived alphabets, +and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.) +and variable-sized encodings (ISO-2022, UTF-8, etc.).  Wider encodings +(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however. + + +The Algorithm +------------- + +The algorithm works by dividing the set of bytecodes [0..255] into three +categories: +- The white list of textual bytecodes: +  9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255. +- The gray list of tolerated bytecodes: +  7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC). +- The black list of undesired, non-textual bytecodes: +  0 (NUL) to 6, 14 to 31. + +If a file contains at least one byte that belongs to the white list and +no byte that belongs to the black list, then the file is categorized as +plain text; otherwise, it is categorized as binary.  (The boundary case, +when the file is empty, automatically falls into the latter category.) + + +Rationale +--------- + +The idea behind this algorithm relies on two observations. + +The first observation is that, although the full range of 7-bit codes +[0..127] is properly specified by the ASCII standard, most control +characters in the range [0..31] are not used in practice.  The only +widely-used, almost universally-portable control codes are 9 (TAB), +10 (LF) and 13 (CR).  There are a few more control codes that are +recognized on a reduced range of platforms and text viewers/editors: +7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these +codes are rarely (if ever) used alone, without being accompanied by +some printable text.  Even the newer, portable text formats such as +XML avoid using control characters outside the list mentioned here. + +The second observation is that most of the binary files tend to contain +control characters, especially 0 (NUL).  Even though the older text +detection schemes observe the presence of non-ASCII codes from the range +[128..255], the precision rarely has to suffer if this upper range is +labeled as textual, because the files that are genuinely binary tend to +contain both control characters and codes from the upper range.  On the +other hand, the upper range needs to be labeled as textual, because it +is used by virtually all ASCII extensions.  In particular, this range is +used for encoding non-Latin scripts. + +Since there is no counting involved, other than simply observing the +presence or the absence of some byte values, the algorithm produces +consistent results, regardless what alphabet encoding is being used. +(If counting were involved, it could be possible to obtain different +results on a text encoded, say, using ISO-8859-16 versus UTF-8.) + +There is an extra category of plain text files that are "polluted" with +one or more black-listed codes, either by mistake or by peculiar design +considerations.  In such cases, a scheme that tolerates a small fraction +of black-listed codes would provide an increased recall (i.e. more true +positives).  This, however, incurs a reduced precision overall, since +false positives are more likely to appear in binary files that contain +large chunks of textual data.  Furthermore, "polluted" plain text should +be regarded as binary by general-purpose text detection schemes, because +general-purpose text processing algorithms might not be applicable. +Under this premise, it is safe to say that our detection method provides +a near-100% recall. + +Experiments have been run on many files coming from various platforms +and applications.  We tried plain text files, system logs, source code, +formatted office documents, compiled object code, etc.  The results +confirm the optimistic assumptions about the capabilities of this +algorithm. + + +-- +Cosmin Truta +Last updated: 2006-May-28 | 
