Compression theory reference?

Thu Sep 2 13:26:41 EDT 2004

Matt Crawford wrote:

 >>Plus a string of log(N) bits telling you how many times to apply the
 >>decompression function!
 >>Uh-oh, now goes over the judge's head ...

Hadmut Danisch wrote:

> The problem is that if you ask for a string of log(N) bits, then 
> someone else could take this as a proof that this actually works, 
> because a string of log(N) bits is obviously shorter than the 
> message of N bits, thus the compression scheme is working. Hooray!

That misses the point of the construction that was the subject of
Matt's remark.  The point was (and remains) that the compressed
output (whether it be 1 bit, or 1+log(N) bits, or 1+log^*(N) bits)
is ridiculous because it is manifestly undecodeable.  It is far, far
too good to be true.

The only question is whether the construction is simple enough
for the judge to understand.

There is no question whether the construction is a valid _reductio
ad absurdum_.

   While we are on the subject, I recommend the clean and elegant
   argument submitted by Victor Duchovni (08/31/2004 03:50 PM) and
   also in more detail by Matt Crawford (08/31/2004 06:04 PM).  It
   uses mathematical induction rather than proof-by-contradiction.
   It is a less showy argument, but probably it is understandable
   by a wider audience.  It proves a less-spectacular point, but it
   is quite sufficient to show that the we-can-compress-anything
   claim is false.  (Although with either approach, at least *some*
   mathematical sophistication is required.  Neither PbC nor MI will
   give you any traction with ten-year-olds.)

   So it appears we have many different ways of approaching things:

    1) The pigeon-hole argument.  (This disproves the claim that all
     N-bit strings are compressible ... even if the claim is restricted
     to a particular fixed N.)

    2) The mathematical induction argument.  (Requires the claimed
     algorithm to be non-expansive for a range of N.)

    3) The proof-by-contradiction.  (Requires the claimed algorithm
     to be compressive -- not just non-expansive -- for a range of N.)

    4) Readily-demonstrable failure of *any* particular claimed example,
     including Lempel-Ziv and all the rest.

    *) Others?

   Harumph.  That really ought to be enough.  Indeed *one* disproof
   should have been enough.

> The problem is, that the number of iterations is not in the order of 
> N, but in the order of 2^N, so it takes log2(around 2^N) = around N bits to
> store the number of iterations.

I don't see why the number of iterations should be exponential in
the length (N) of the input string.  A compression function is
supposed to decrease N.  It is not supposed to decrement the
number represented by some N-bit numeral .... after all, the string
might not represent a number at all.

Also I repeat that there exist special cases (e.g. inputs of
known fixed length) for which no extra bits need be represented,
as I explained previously.

> The recursion convertes a message of 
> N bit recursively into a message of 1 or zero bit length (to your
> taste), *and* a number which takes around N bit to be stored. 
> Nothing is won. But proof that. 

I disagree, for the reasons given above.

In the worst case, you need log^*(N) extra bits, not N bits.  In
special cases, you don't need any extra bits at all.  The "win"
is very substantial.  The "win" is extreme.

> This recursion game is far more complicated than it appears to be. 

Maybe.  But there's no need to make it more complicated than it
really is.

> Note also that storing a number takes in reality more than log(N)
> bits. Why? Because you don't know N in advance. We don't have any
> limit for the message length. 

For general N, that's true.

 > So you'r counting register needs
> theoretically inifinte many bits. 

Maybe.  For many practical purposes, the required number of bits
is considerably less than infinity.

 > When you're finished you know
> how many bits your number took. But storing your number needs an 
> end symbol or a tristate-bit (0,1,void). That's a common mistake. 

We agree that there are many common mistakes.  We agree that
it is a mistake to have undelimited strings.  But it is also a
mistake to think that you need to reserve a special symbol to
mark the end of the string.  Yes, that is one option, but from a
data-compression point of view it is an inefficient option.

Anybody who is interested in this stuff reeeally ought to read
Chaitin's work.  He makes a big fuss about the existence of
self-delimiting strings and self-delimiting programs.  There
are many examples of such:
  -- The codewords of many commonly-used compression algorithms
   are self-delimiting.  This is related to the property of being
   "instantaneously decodable".
  -- As Chaitin points out, you can set up a dialect of Lisp such
   that Lisp programs are self-delimiting.
  -- If you want to be able to represent M, where M is *any* N-bit
   number, you need more than log(M) bits (i.e. more than N bits).
   That's because you need to specify how many bits are used to
   represent log(M), which adds another log(log(M)) bits.  On top
   of that, you need to specify how many bits are used to specify
   log(log(M)), which adds another ..... you get the idea.
   Fortunately the series converges verrry fast.  This series
   defines a new function, named log^*(), pronounced log-star.
   Constructive examples are known of representation schemes
   that give a self-delimiting representation of *any* integer,
   no matter how large.  These are called 'universal' representations.
-- etc. etc. etc.

I'm not suggesting that the judge needs to understand the log^*
function ... just that it is a mistake to assume that a special
end-marker is needed.  Also I'm suggesting that if you give
examples of the encoding/decoding process, make sure you
indicate whatever metadata (if any!) is required to make the
input and output words properly delimited.  (You will notice
that in the example I posted of diagramming the pigeon-hole
argument, all my words were instantaneously decodable and
therefore self-delimiting.  No metadata required.  No reserved
end-symbol required.)  You don't need to make a big fuss about
it, just make sure the requirements are met, in case anybody
decides to check.  In most cases you can duck the issue entirely
by saying that you are compressing "objects".  It is often not
necessary to specify which part of the object is considered
metadata and which part is considered payload.

> When determining the compression rate for a file people often 
> forget, that some information is not stored in the file itself, but in
> the file system, e.g. the file length (telling where the compressed
> data stops) and the file name (telling you, that the file was
> compressed). That's basically the same problem.

Yes, it's the same problem, or at least a closely-related problem.

In any case it is a solvable problem.

Note that the *filesystem* itself must be self-delimiting.  So it
must have a self-delimiting way of representing the name of the file
and the length of the file, et cetera.  In many cases the filesystem
wimps out and places artificial (perhaps huge) limits on the filesize,
but one could readily fix things up to use a universal representation.

When comparing compressors, IMHO the only fair thing to do is to
consider this metadata in the filesystem to be part of the *object*
to be compressed.  Theoretically this is more-or-less tantamount to
restricting the discussion to inputs and outputs that are
self-delimiting.

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com