cbloom rants


The Natural Lambda

Every compressor has a natural innate or implicit "lambda" , the Lagrange parameter for space-speed optimization.

Excluding compressors like cmix that are purely aiming for the smallest size, every other compressor is balancing ratio vs. complexity in some way. The author has made countless decisions about space-speed tradeoffs in the design of their codec; are they using huffman or ANS or arithmetic coding? are they using LZ77 or ROLZ or PPM? are they optimal parsing? etc. etc.

There are always ways to spend more CPU time and get more compression, so you choose some balance that you are targeting and try to make decisions that optimize for that balance. (as usual when I talk about "time" or "speed" here I'm talking about decode time ; you can of course also look at balancing encode time vs. ratio, or memory use vs. ratio, or code size, etc.)

Assuming you have done everything well, you may produce a compressor which is "pareto optimal" over some range of speeds. That is if you do something like this : 03-02-15 - Oodle LZ Pareto Frontier , you make a speedup graph over various simulated disk speeds, you are on the pareto frontier if your compressor's curve is the topmost for some range. If your compressor is nowhere topmost, as in eg see the graph with zlib at the bottom : Introducing Oodle Mermaid and Selkie , then try again before proceeding.

The range over which your compressor is pareto optimal is the "natural range" for it. That is a range where it is working better than any other compressor in the world.

Outside of that range, the optimal way to use your compressor is to *not* use it. A client that wants optimal behavior outside of that range should simply switch to a different compressor.

As an example of this - people often take a compressor which is designed for a certain space-speed performance, and then ask it questions that are outside of its natural range, such as "well if I change the settings, just how small of a file can it produce?" , or "just how fast can it decode at maximum speed?". These are bogus questions because they are outside of the operating range. It's like asking how delicate an operation can you do with a hammer, or how much weight can you put on an egg. You're using the tool wrong and should switch to a different tool.

(perhaps a better analogy that's closer to home is taking traditional JPEG Huff and trying to use it for very low bit rates (as many ass-hat authors do when trying to show how much better they are than JPEG), like 0.1 bpp, or at very high bit rates trying to get near-lossless quality. It's simply not built to run outside of its target range, and any examination outside of that range is worse than useless. (worse because it can lead you to conclusions that are entirely wrong))

The "natural lambda" for a compressor is at the center of its natural range, where it is pareto optimal. This is the space-speed tradeoff point where the compressor is working its best, where all those implicit decisions are right.

Say you chose to do an LZ compressor with Huffman and to forbid overlapping matches with offset less than 16 (as in Kraken) - those decisions are wrong at some lambda. If you are lucky they are right decisions at the natural lambda.

Of course if you have developed a compressor in an ad-hoc way, it is inevitable that you have made many decisions wrong. The way this has been traditionally done in the past is for the developer to just try some ideas, often they don't really even measure the alternatives. (for example they might choose to pack their LZ offsets in a certain way, but don't really try other ways). If they do seriously try alternatives, they just look at the numbers for compression & speed and sort of eye ball them and make a judgement call. They do not actually have a well defined space-speed goal and a way to measure if they are improving that score.

The previously posted "correct" Weissman Score provides a systematic way of identifying the space-speed range you wish to target and thus identifying if your decisions are right in a space-speed sense.

Most ad-hoc compressors have illogical contradictory decisions in the sense that they have chosen to do something that makes them slower (to decode), but have failed to do something else which would be a better space-speed step. That is, say you have a compressor, and there are various decisions you face as an implementor. The implicit "natural lambda" for your compressor give you a desired slope for space-speed tradeoffs. Any decision you can make (either adding a new mode, or disabling a current mode; for example adding context mixing, or disabling arithmetic coding) should be measured against that lambda.

Over the past few years, we at RAD have been trying to go about compressor development in this new systematic way. People often ask us for tips on their compressor, hoping that we have some magical answers, like "oh you just do this and that makes it work perfectly". They're disappointed when we don't really have those answers. Often my advice leads nowhere; I suggest "try this and try that" and they don't really pan out and they think "this guy doesn't know anything". But that's what we do - we just try this and that, and usually they don't work, so we try something else. What we have is a way of working that is careful and methodical and thorough. For every decision, you legitimately try it both ways with well-optimized implementations. You measure that decision with a space-speed score that is targetted at the correct range for your compressor. You measure on a variety of test sets and look at randomized subsets of those test sets, not averages or totals. There's no magic, there's just doing the work to try make these decisions correctly.

Usually the way that compressor development works starts from a spark of inspiration.

That is, you don't set out by saying "I want to develop a compressor for the Weissman range of 10 - 100 mb/s" and then proceed to make decisions that optimize that score. Instead it starts with some grain of an idea, like "what if I send matches with LZSA and code literals with context mixed RANS". So you start going about that grain of idea and get something working. Thus all compressor development starts out ad hoc.

At some point (if you work at RAD), you need to transition to having a more defined target range for your compressor so you can beginning making well-justified decisions about its implementation. Here are a few ways that I have used to do that in the past.

One is to look at a space-speed "speedup" curve, such as the "pareto charts" I linked to above. You can visually pick out the range where your compressor is performing well. If you are pareto optimal vs the competition, you can start from that range. Perhaps early in development you are not yet pareto optimal, but you can still see where you are close, you can spot the area where if you improved a bit you would be pareto optimal. I then take that range and expand it slightly to account for the fact that the compressor is likely to be used slightly outside of its natural range. eg. if it was optimal from 10 - 100 mb/s , I might expand that to 5 - 200 mb/.

Once you have a Weissman range you can now measure your compressor's "correct" Weissman score. You then construct your compressor to make decisions using a Lagrange parameter, such as :

J = size + lambda * time

for each decision, the compressor tries to minimize J. But you don't know what lambda to use. One way to find it is to numerically optimize for the lambda that maximize the Weissman score over your desired performance interval.

When I first developed Kraken, Mermaid & Selkie, I found their lambdas by using the method of transition points.

Kraken, Mermaid & Selkie are all pareto optimal over large ranges. At some point as you dial lambda on Kraken to favor more speed, you should instead switch to Mermaid. As you dial lambda further towards speed in Mermaid, you should switch to Selkie. We can find what this lambda is where the transition point occurs.

It's simply where the J for each compressor is equal. That is, an outer meta-compressor (like Oodle Hydra ) which could choose between them would have an even choice at this point.

how to set the K/M/S lambdas

    measure space & speed of each

    find the lambda point at which J(Kraken) = J(Mermaid)
        that's the switchover of compressor choice

    call that lambda_KM

    size_K + lambda_KM * time_K = size_M + lambda_KM * time_M
    lambda_KM = (size_M - size_K) / (time_K - time_M)

    same for lambda_MS between Mermaid and Selkie

    choose Mermaid to be at the geometric mean of the two transitions :

    lambda_M = sqrt( lambda_KM * lambda_MS )

    then set lambda_K and lambda_S such that the transitions are at the geometric mean
        between the points, eg :

    lambda_KM = sqrt( lambda_K * lambda_M )
    lambda_K = lambda_KM^2 / lambda_M = lambda_KM * sqrt(lambda_KM / lambda_MS)
    lambda_S = lambda_MS^2 / lambda_M = lambda_MS * sqrt(lambda_MS / lambda_KM)

Of course always beware that these depend on a size & time measurement on a specific file on a specific machine, so gather a bunch on different samples and take the median, discard outliers, etc.

(the geometric mean seems to work out well because lambdas seem to like to go in a logarithmic way. This is because the time that we spend to save a byte seems to increase in a power scale way. That is, saving each additional byte in a compressor is 2X harder than the last, so compressors that decrease size more by some linear amount will have times that increase by an exponential amount. Put another way, if J was instead defined to be J = size + lambda * log(time) , with logarithmic time, then lambda would be linear and arithmetic mean would be approprite here.)

It's easy to get these things wrong, so it's important to do many and frequent sanity checks.

An important one is that : more options should always make the compressor better. That is, giving the encoder more choices of encoding, if it makes a space-speed decision with the right lambda, if the rate measurement is correct, if the speed estimation is correct, if there are no unaccounted for side effects - having that choice should lead to a better space-speed performance. Any time you enable a new option and performance gets worse, something is wrong.

There are many ways this can go wrong. Non-local side effects of decisions (such as parse-statistics feedback) is a confounding one. Another common problem is if the speed estimate doesn't match the actual performance of the code, or if it doesn't match the particular machine you are testing on.

An important sanity check is to have macro-scale options. For example if you have other compressors to switch to and consider J against them - if they are being chosen in the range where you think you should be optimal, something is wrong. One "compressor" that should always be included in this is the "null" compressor (memcpy). If you're targetting high speed, you should always be space-speed testing against memcpy and if you can't beat that, either your compressor is not working well or your supposed speed target range is not right.

In the high compression range, you can sanity check by comparing to preprocessors. eg. rather than spend N more cycles on making your compressor slower, compare to other ways of trading off time for compression, such as perhaps word-dictionary preprocessing on text, delta filtering on wav, etc. If those are better options, then they should be done instead.

One of the things I'm getting at is that every compressor has a "natural lambda" , even ones that are not conscious of it. The fundamental way they function implies certain decisions about space-speed tradeoffs and where they are suitable for use.

You can deduce the natural lambda of a compressor by looking at the choices and alternatives in its design.

For example, you can take a compressor like ZStd. ZStd entropy codes literals with Huffman and other things (ML, LRL, offset log2) with TANS (FSE). You can look at switching those choices - switch the literals to coding with TANS. That will cost you some speed and gain some compression. There is a certain lambda where that decision is moot - either way gives equal J. If ZStd is logically designed, then its natural lambda must lie to the speed side of that decision. Similarly try switching coding ML to Huffman, again you will find a lambda where that decision is moot, and ZStd's natural lambda should lie on the slower side of that decision.

You can look at all the options in a codec and consider alternatives and find this whole set of lambda points where that decision makes sense. Some will constrain your natural lambda from above, some below, and you should lie somewhere in the middle there.

This is true and possible even for codecs that are not pareto. For however good they are, their construction implies a set of decisions that target an implicit natural lambda.


Oodle tuneability with space-speed tradeoff

Oodle's modern encoders take a parameter called the "space-speed tradeoff". (specifically OodleLZ_CompressOptions:: spaceSpeedTradeoffBytes).

"speed" here always refers to decode speed - this is about the encoder making choices about how it forms the compressed bit stream.

This parameter allows the encoders to make decisions that optimize for a space-speed goal which is of your choosing. You can make those decisions favor size more, or you can favor decode speed more.

If you like, a modern compressor is a bit a like a compiler. The compressed data is a kind of program in bytecode, and the decompressor is just an intepreter that runs that bytecode. An optimal parser is like an optimizing compiler; you're considering different programs that produce the same output, and trying to find the program that maximizes some metric. The "space-speed tradeoff" parameter is a bit like -Ox vs -Os, optimize for speed vs size in a compiler.

Oodle of course includes Hydra (the many headed beast) which can tune performance by selecting compressors based on their space-speed performance.

But even without Hydra the individual compressors are tuneable, none more so than Mermaid. Mermaid can stretch itself from Selkie-like (LZ4 domain) up to standard LZH compression (ZStd domain).

I thought I would show an example of how flexible Mermaid is. Here's Mermaid level 4 (Normal) with some different space-speed tradeoff parameters :

sstb = space speed tradeoff bytes

sstb 32 :  ooMermaid4  :  2.29:1 ,   33.6 enc mbps , 1607.2 dec mbps
sstb 64 :  ooMermaid4  :  2.28:1 ,   33.8 enc mbps , 1675.4 dec mbps
sstb 128:  ooMermaid4  :  2.23:1 ,   34.1 enc mbps , 2138.9 dec mbps
sstb 256:  ooMermaid4  :  2.19:1 ,   33.9 enc mbps , 2390.0 dec mbps
sstb 512:  ooMermaid4  :  2.05:1 ,   34.3 enc mbps , 2980.5 dec mbps
sstb 1024: ooMermaid4  :  1.89:1 ,   34.4 enc mbps , 3637.5 dec mbps

compare to : (*)

zstd9       :  2.18:1 ,   37.8 enc mbps ,  590.2 dec mbps
lz4hc       :  1.67:1 ,   29.8 enc mbps , 2592.0 dec mbps

(* MSVC build of ZStd/LZ4 , not a fair speed measurement (they're faster in GCC), just use as a general reference point)

Point being - not only can Mermaid span a large range of performance but it's *good* at both ends of that range, it's not getting terrible as it out of its comfort zone.

You may notice that as sstb goes below 128 you're losing a lot of decode speed and not gaining much size. The problem is you're trying to squeeze a lot of ratio out of a compressor that just doesn't target high ratio. As you get into that domain you need to switch to Kraken. That is, there comes a point where the space-speed benefit of squeezing the last drop out of Mermaid is harder than just making the jump to Kraken. And that's where Hydra comes in, it will do that for you at the right spot.

ADD : Put another way, in Oodle there are *two* speed-ratio tradeoff dials. Most people are just familiar with the compression "level" dial, as in Zip, where higher levels = slower to encode, but more compression ratio. In Oodle you have that, but also a dial for decode time :

CompressionLevel = trade off encode time for compression ratio

SpaceSpeedTradeoffBytes = trade off decode time for compression ratio

Perhaps I'll show some sample use cases :

Default initial setting :

CompressionLevel = Normal (4)
SpaceSpeedTradeoffBytes = 256

Reasonably fast encode & decode.  This is a balance between caring about encode time, decode time,
and compression ratio.  Tries to do a decent job of all 3.

To maximize compression ratio, when you don't care about encode time or decode time :

CompressionLevel = Optimal4 (8)
SpaceSpeedTradeoffBytes = 1

You want every possible byte of compression and you don't care how much time it costs you to encode or
decode.  In practice this is a bit silly, rather like the "placebo" mode in x264.  You're spending
potentially a lot of CPU time for very small gains.

A more reasonable very high compression setting :

CompressionLevel = Optimal3 (7)
SpaceSpeedTradeoffBytes = 16

This still says you strongly value ratio over encode time or decode time, but you don't want to chase
tiny gains in ratio that cost a huge amount of decode time.

If you care about decode time but not encode time :

CompressionLevel = Optimal4 (8)
SpaceSpeedTradeoffBytes = 256

Crank up the encode level to spend lots of time making the best possible compressed stream, but make
decisions in the encoder that balance decode time.


The SpaceSpeedTradeoffBytes is a number of bytes that Oodle must be able to save in order to accept a certain time increase in the decoder. In Kraken that unit of time is 25600 cycles on the artifical machine model that we use. (that's 8.53 microseconds at 3 GHz). So at the default value of 256, it must save 1 byte in compressed size to take an increased time of 100 cycles.

Some learnings from ZStd

I've spent some time in the last month looking into cases where ZStd beats Kraken & Mermaid.

Most of the time Kraken gets better ratio than ZStd, but there were exceptions to that (mainly text), and it always kind of bothered me, since Kraken is roughly a superset of ZStd (not exactly), and the differences are small, it shouldn't have been winning by more than 1% (which is the variation I'd expect due to small differences). On text files, I have no edge over ZStd, all my advantages are moot, so we're reduced to both being pretty basic LZ-Huffs; so we should be equal, but I was losing. So I dug in to see what was going on.

Thanks of course to Yann for making his great work open source so that I'm able to look at it; open source and sharing code is a wonderful and helpful thing when people choose to do so voluntarily, not so nice when your work is stolen from you against your will and shown to the world like phone-hacked dick-pics *cough* *assholes*. Since I'm learning from open source, I figured I should give back, so I'm posting what I learned.

A lot of the differences are a question of binary vs. text focus. ZStd has some tweaking that clearly comes from testing on text and corpora with a lot of text (like silesia). On the other hand, I've been focusing very much on binary and that has caused me to miss some important things that only show up when you look closely at text performance.

This is what I found :

Long hashes are good for text, bad for binary

ZStd non-optimal levels use hash lengths of 5 or even 6 or 7 at the fastest levels. This helps on text because text has many long matches, so it's important to have a hash long enough that it can differentiate between "boogie" and "booger" and put them in different hash table bins. (this is most important at the fastest levels which are cache table with no ways).

On binary you really want to hash len 4 because there are important matches of exactly len 4, and longer hashes can make you miss them.

zstd2 hash len 6 :
PD3D    : zstd2 : 31,941,800 ->11,342,055 =  2.841 bpb =  2.816 to 1 

zstd2 hash len 4 :
PD3D    : zstd2 : 31,941,800 ->10,828,309 =  2.712 bpb =  2.950 to 1 

zstd2 hash len 6 :
dickens : zstd2 : 10,192,446 -> 3,909,882 =  3.069 bpb =  2.607 to 1 

zstd2 hash len 4 :
dickens : zstd2 : 10,192,446 -> 4,387,536 =  3.444 bpb =  2.323 to 1 

Longer hashes help the fast modes a *lot* on text. If you care about fast compression of text you really want those longer hashes.

This is a big issue and because of it ZStd fast modes will continue to be better than Oodle on text (and Oodle will be better on binary); or we have to find a good way to detect the data type and tune the hash length to match.

lazy2 is helpful on text

Standard lazy parsing looks for a match at ptr, if one is found it also looks at ptr+1 to see if something better is there. Lazy2 also looks at ptr+2.

I wasn't doing 2-ahead lazy parsing, because on binary it doesn't help much. But on text it's a nice little win :

Zstd level 9 has 2-step lazy normally :

zstd9 : 41,458,703 ->10,669,424 =  2.059 bpb =  3.886 to 1 

disabled : (1-step lazy) :

zstd9 : 41,458,703 ->10,825,637 =  2.089 bpb =  3.830 to 1 

optimal parser all len reductions helps on text

I once wrote that in codecs that do strong rep0 exclusion (rep0len1 literal can't occur immediately after a match), that you can just always send max-length matches, and not have to consider match length reductions. (because max-length matches maintain rep0 exclusion but shorter ones violate it).

That is not quite right. It tends to be true on binary, but is wrong on text. The issue is that you only get the rep0 exclusion benefit if you actually send a literal after the match.

That happens often on binary. Binary frequently goes match-literal-match-literal , with some near-random bytes between predictable regions. Text has very few literals. Many text files go match-match-match which means the rep0 literal exclusion does nothing for you.

On text files you often have many short & medium length overlapping matches, and trying len reductions is important to find the parse that traces through them optimally.


and the optimal parse might be


which you would only find if you tried the len reduction of A

this kind of thing. Text is all about making the best normal-match decisions.

with all len reductions :

zstd22 : 10,000,000 -> 2,800,209 =  2.240 bpb =  3.571 to 1 

without :

zstd22 : 10,000,000 -> 2,833,168 =  2.267 bpb =  3.530 to 1 

Getting len 3 matches right in the optimal parser is really important on text

Part of the "text is all matches" issue. My codecs are mostly MML 4 in the non-optimal modes, then I switch to MML3 at level 7 (Optimal3). Adding MML3 generally lets you get a bit more compression ratio, but hurts decode speed a bit.

(BTW MML3 in the non-optimal modes generally *hurts* compression ratio, because they can't make the decision correctly about when to use it. A len 3 match is always marginal, it's only slightly cheaper than 3 literals (depending on the literals), and you probably don't want it if you can find any longer match within those next 3 bytes. Non-optimal parsers just make these decisions wrong and muck it all up, they do better with MML 4 or even higher sometimes. (there are definitely files where you can crank up MML to 6 or 8 and improve ratio))

So, I was doing that *but* I was using the statistics from a greedy pre-pass to seed the optimal parse decisions, and the greedy pre-pass was MML 4, which was biasing the optimal against len 3 matches. It was just a fuckup, and it wasn't hurting me on binary, but when I compared to ZStd's optimal parse on text I could immediately see it had a lot more len 3 matches than me.

(this is also an example of the parse-statistics feedback problem, which I believe is the most important problem in LZ compresion)


zstd22 : 10,192,446 -> 2,856,038 =  2.242 bpb =  3.569 to 1

before :
ooKraken7 : 10,192,446 -> 2,905,719 =  2.281 bpb =  3.508 to 1

after  :
ooKraken7 : 10,192,446 -> 2,862,710 =  2.247 bpb =  3.560 to 1 

ZStd is full of small clever bits

There's lot of little clever nuggets that are hard to see. They aren't generally commented and they're buried in chunks of copy-pasted code that all looks the same so it's easy to gloss over the variations.

I looked over this code many times :

        if ((offset_1 > 0) & (MEM_read32(ip+1-offset_1) == MEM_read32(ip+1))) {
            mLength = ZSTD_count(ip+1+4, ip+1+4-offset_1, iend) + 4;
            ZSTD_storeSeq(seqStorePtr, ip-anchor, anchor, 0, mLength-MINMATCH);
        } else {
            U32 offset;
            if ( (matchIndex <= lowestIndex) || (MEM_read32(match) != MEM_read32(ip)) ) {
                ip += ((ip-anchor) >> g_searchStrength) + 1;
            // [ got match etc... ]

and I thought - okay, look for a 4 byte rep match, if found take it unconditionally and don't look for normal match. That's the same thing I do (I think it came from me?), no biggie.

But there's a wrinkle. The rep check is not at the same position as the normal match. It's at pos+1.

This is actually a mini-lazy-parse. It doesn't do a full match & rep find at pos & (pos+1). It's just scanning through, at each pos it only does one rep find and one match find, but the rep find is offset forward by +1. That means it will take {literal + rep} even if match is available, which a normal non-lazy parser can't do.

(aside : you might think that this misses a rep find, when the literal run starts, right after a match, it starts find the first rep at pos+1 so there's a spot where it does no rep find. But that spot is where the rep0 exclusion applies - there can be no rep there, so it's all good!)

This is a solid win and it's totally for free, so very cool.

Seven testset 

with rep-ahead search :

total : zstd3       : 80,000,000 ->34,464,878 =  3.446 bpb =  2.321 to 1 

with rep at same pos as match :

total : zstd3       : 80,000,000 ->34,521,261 =  3.452 bpb =  2.317 to 1 

The end.

ADD : a couple more notes on ZStd (that aren't from the recent investigation) while I'm at it :

ZStd uses a unique approach to the lrl0-rep0 exclusion

After a match (of full length), that same offset cannot match again. If your offsets are in a rep match cache, the most recently used offset is the top (0th) entry, rep0. This is the lrl0-rep0 exclusion.

rep0 is usually the most likely match, so it will get the largest share of the entropy coder probability space. Therefore if you're in an exclusion where that symbol is impossible, you're wasting a lot of bits.

There are two ways that I would call "traditional" or straightforward data compression ways to model the lrl0-rep0 exclusion. One is to use a single bit for (lrl == 0) as context for the rep-index coding event. eg. you have two entropy coding states for offsets, one for lrl == 0 and one for lrl != 0. The other classical method would be to combine lrl with rep-index in a larger alphabet, which allows you to model their correlation using only order-0 entropy coding. The minimum alphabet size here is only 2 bits, 1 bit for (lrl == 0) or not, and one for (match == rep0) or not.

ZStd does not use either of these methods. Instead it shifts the rep index by (lrl == 0). That is, ZStd has 3 reps, and normally they are in match offset slots 0,1,2. But right after the end of a match (when lrl is 0) those offset values change to mean rep 1,2,3 ; and there is no rep3, that's a virtual offset equal to (rep0 - 1).

The ZStd format documentation is a good reference for these things.

I can't say how well the ZStd method here compares to the alternatives as it's a bit more effort to check than I'd like to do. (if you want to try it, you could double the size of ZStd's offset coding alphabet to put 1 bit of lrl == 0 into the offset coding; then the decode sequence grabs an offset and only pulls an lrl code if the offset bit says so).

ZStd uses TANS in a limited and efficient way

ZStd does not use TANS (FSE) on its literals, which are the largest class of entropy coded symbols. Presumably Yann found, like us, that the compression gains on literals (over Huffman) are small, and the speed cost is not worth it. ZStd only uses TANS on the LZ match components - LRL, offset, ML.

Each of these has a small alphabet (52,35,28), and therefore can use a small # of bits for the TANS tables (9,9,8). This is a sweet spot for TANS, so it works well in ZStd.

For large alphabets (eg. 256 for literals), TANS needs a higher # of bits for its code tables (at least 11), which means 2048 entries being filled. This makes the table setup time rather large. By cutting the table size to 8 or 9 bits you cut that down by 4-8X. With large alphabets you also may as well just go Huff. But with small alphabets, Huff gets worse and worse. Consider the extreme - in an alphabet of 2 symbols Huff becomes no compression at all, while TANS can still do entropy coding. With small alphabets to use Huffman you need to combine symbols (eg. in a 2-bit alphabet you would code 4 at once as an 8-bit symbol). BUT that means going up to big decoder tables again, which adds to your constant overhead.

FSE uses the prime-scatter method to fill the TANS decode table. (this is using a relatively-prime step to just walk around the circular array, using the property that you can just keep stepping that way and you will eventually hit every slot once and only once). I evaluated the prime-scatter method before and concluded that the compression penalty was unacceptably large. I was mistaken. I had just implemented it wrong, so my results were much worse than they should be.

(the mistake I made was that I did the prime-scatter in one pass; for each symbol, take the steps and fill table entries, increment "from_state" as you step, "to_state" steps around with the prime-modulo. This causes a non-monotonic relationship between from_state and to_state which is very bad. The right way to do it (the way ZStd/FSE does it) is to use some kind of two-pass scheme, so that you do the shuffle-scatter first (which can step around the loop non-monotonically) but then assign the from_state relationship in a second pass which ensures the monotonic relationship).

With a correct implementation, prime-scatter's compression ratio is totally fine (*). The two-pass method that ZStd/FSE uses would be slow for large alphabets or large L, but ZStd only uses FSE for small alphabets and small L. The entropy coder and application are well matched. (* = if you special case singletons, as below)

The worst case for prime-scatter is low counts, and counts of 1 are the worst. ZStd/FSE uses a special case for counts of 1 that are "below 1". Back in the "Understanding TANS" series I looked at the "precise sort" method of table building and found that artificially skewing the bias to put counts of 1 at the end was a big win in practice. The issue there is that the counts we see at that point are normalized, and zeros were forced up to 1 for codeability. The true count might be much lower. Say you're coding an array of size 64k and symbol 'x' only occurs 1 time. If you have a TANS L of 1024 , the true probability should be 1/64k , but normalized forces it up to 1/1024. Putting the singleton counts at th end of the TANS array gives them the maximum codelen (end of the array has maximum fractional bits). The sort bias I did before was a hack that relies on the fact that most singleton counts come from below-1 normalized probabilities. ZStd/FSE explicitly signals the difference, it can send a "true 1" (eg. closest normalized probability really is 1/1024 ; eg. in the 64k array, count is near 64), or a "below 1" , some very low count that got forced up to 1. The "below 1" symbols are forced to the end of the TANS array while the true 1's are allowed to prime-scatter like other symbols.

The end.


Oodle 2.5.5 - encoder bug fix

Oodle 2.5.5 fixes a bug in the Kraken & Mermaid encoders which could cause them to make compressed data that decodes incorrectly (producing output different than the original) or could cause the decoder to return failure.

This bug was present from Oodle 2.5.0 to 2.5.4 ; if you use those versions you should update to 2.5.5

When the bug occurs, the OodleLZ_Compress call returns success, thinking it made valid compressed data, but it has actually made a damaged bit stream. When you call Decompress it might return failure, or it might return success but produce decompressed output that does not match the original bits.

Any compressed data that you have made which decodes successfully (and matches the original uncompressed data) is fine. The presence of the bug can only be detected by attempting to decode compressed data and checking that it matches the original uncompressed data.

The decoder is not affected by this bug, so if you have shipped user installations that only do decoding, they don't need to be updated. If you have compressed files which were made incorrectly because of this bug, you can patch only those individual compressed files.

Technical details :

This bug was caused by one of the internal bit stream write pointers writing past the end of its bits, potentially over-writing another previously written bit stream. This caused some of the previously written bits to become garbage, causing them to decode into something other than what they had been encoded from.

This only occured with 64-bit encoders. Any data written by 32-bit encoders is not affected by this bug.

This bug could in theory occur on any Kraken & Mermaid compressed data. In practice it's very rare and I've only seen it in one particular case - "whole huff chunks" on data that is only getting a little bit of compression, with uncompressed data that has a trinary byte structure (such as 24-bit RGB). It's also much more likely in pre-2.3.0 compatibility mode (eg. with OodleLZ_BackwardsCompatible_MajorVersion=2 or lower).

BTW it's probably a good idea in general to decode and verify the data after every compress.

I don't do it automatically in Oodle because it would add to encode time, but on second thought that might be a mistake. Pretty much all the Oodle codecs are so asymmetric, that doing a full decode every time wouldn't add much to the encode time. For example :

Kraken Normal level encodes at 50 MB/s
Kraken decodes at 1000 MB/s

To encode 1 MB is 0.02 s
To decode 1 MB is 0.001 s

To decode after every encode changes the encode time to 0.021 s = 47.6 MB/s

it's not a very significant penalty to encode time, and it's worth it to verify that your data definitely decodes correctly. I think it's a good idea to go ahead and add this to your tools.

I may add a "verify" option to the Compress API in the future to automate this.


Oodle 2.5.4 - now with Windows UWP

Oodle 2.5.4 is out. There's now a separate Windows UWP SDK (separate from Win32).

Oodle for Windows UWP comes with only the "core" library that does memory to memory compression. The Oodle Core library uses no threads, has minimal dependencies (just the CRT), no funny business, making it very portable.

For full details see the Oodle Change Log


Well Crap

I was cleaning my blog, deleting a bunch old posts, and accidentally deleted some I didn't want to. I'm going to repost a few, so if you have a subscription you may see odd old posts floating in because of that.

Unfortunately there's no blogger recover or trash can feature that I can just undo the delete. Frowny face. Also, while I can repost them, the comments are gone. And unfortunately it seems I can't post them to the same URL. The blogger post URL seems to be irrevocably marked with the post date, and even if I retro-date the post, it munges the URL to not be the same as the original.

ADD : I reposted a few of the ones I wanted to save. The new links are :

cbloom rants 09-27-08 On LZ and ACB
cbloom rants 10-05-08 Rant on New Arithmetic Coders
cbloom rants 10-06-08 Followup on the Russian Range Coder
cbloom rants 10-07-08 Random file stuff I've learned
cbloom rants 10-07-08 A little more on arithmetic coding ...
cbloom rants 10-08-08 Arithmetic coders throw away accuracy in lots of little places.
cbloom rants 10-10-08 On LZ Optimal Parsing
cbloom rants 10-10-08 On the Art of Good Arithmetic Coder Use


Oodle Perf with Chunking and Dictionary Size

I get a lot of customers that want to cut their data into small blocks for paging, who ask "what's the benefit of using larger blocks" ?

The larger the block = more compression, and can help throughput (decode speed).

Obviously larger block = longer latency (to load & decode one whole block).

(though you can get data out incrementally, you don't have to wait for the whole decode to get the first byte out; but if you only needed the last byte of the block, it's strictly longer latency).

If you need fine grain paging, you have to trade off the desire to get precise control of your loading with small blocks & the benefits of larger blocks.

(obviously always follow general good paging practice, like amortize disk seeks, combine small resources into paging units, don't load a 256k chunk and just keep 1k of it and throw the rest away, etc.)

As a reference point, here's Kraken on Silesia with various chunk sizes :

Silesia : (Kraken Normal -z4)

 16k : ooKraken    : 211,938,580 ->75,624,641 =  2.855 bpb =  2.803 to 1 
 16k : decode           : 264.190 millis, 4.24 c/b, rate= 802.22 mb/s

 32k : ooKraken    : 211,938,580 ->70,906,686 =  2.676 bpb =  2.989 to 1 
 32k : decode           : 217.339 millis, 3.49 c/b, rate= 975.15 mb/s

 64k : ooKraken    : 211,938,580 ->67,562,203 =  2.550 bpb =  3.137 to 1 
 64k : decode           : 195.793 millis, 3.14 c/b, rate= 1082.46 mb/s

128k : ooKraken    : 211,938,580 ->65,274,250 =  2.464 bpb =  3.247 to 1 
128k : decode           : 183.232 millis, 2.94 c/b, rate= 1156.67 mb/s

256k : ooKraken    : 211,938,580 ->63,548,390 =  2.399 bpb =  3.335 to 1 
256k : decode           : 182.080 millis, 2.92 c/b, rate= 1163.99 mb/s

512k : ooKraken    : 211,938,580 ->61,875,640 =  2.336 bpb =  3.425 to 1 
512k : decode           : 182.018 millis, 2.92 c/b, rate= 1164.38 mb/s

1024k: ooKraken    : 211,938,580 ->60,602,177 =  2.288 bpb =  3.497 to 1 
1024k: decode           : 181.486 millis, 2.91 c/b, rate= 1167.80 mb/s

files: ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files: decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s

16k   :  2.80:1 ,   15.7 enc mbps ,  802.2 dec mbps
32k   :  2.99:1 ,   19.7 enc mbps ,  975.2 dec mbps
64k   :  3.14:1 ,   22.8 enc mbps , 1082.5 dec mbps
128k  :  3.25:1 ,   24.6 enc mbps , 1156.7 dec mbps
256k  :  3.34:1 ,   25.5 enc mbps , 1164.0 dec mbps
512k  :  3.43:1 ,   25.4 enc mbps , 1164.4 dec mbps
1024k :  3.50:1 ,   24.6 enc mbps , 1167.8 dec mbps
files :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

(note these are *chunks* not a window size; no carry-over of compressor state or dictionary is allowed across chunks. "files" means compress the individual files of silesia as whole units, but reset compressor between files.)

You may have noticed that the chunked files (once you get past the very small 16k,32k) are somewhat faster to decode. This is due to keeping match references in the CPU cache in the decoder.

Limitting the match window (OodleLZ_CompressOptions::dictionarySize) gives the same speed benefit for staying in cache, but with a smaller compression win.

window 128k : ooKraken    : 211,938,580 ->61,939,885 =  2.338 bpb =  3.422 to 1 
window 128k : decode           : 181.967 millis, 2.92 c/b, rate= 1164.71 mb/s

window 256k : ooKraken    : 211,938,580 ->60,688,467 =  2.291 bpb =  3.492 to 1 
window 256k : decode           : 182.316 millis, 2.93 c/b, rate= 1162.48 mb/s

window 512k : ooKraken    : 211,938,580 ->59,658,759 =  2.252 bpb =  3.553 to 1 
window 512k : decode           : 184.702 millis, 2.97 c/b, rate= 1147.46 mb/s

window 1M : ooKraken    : 211,938,580 ->58,878,065 =  2.222 bpb =  3.600 to 1 
window 1M : decode           : 184.912 millis, 2.97 c/b, rate= 1146.16 mb/s

window 2M :  ooKraken    : 211,938,580 ->58,396,432 =  2.204 bpb =  3.629 to 1 
window 2M :  decode           : 182.231 millis, 2.93 c/b, rate= 1163.02 mb/s

window 4M :  ooKraken    : 211,938,580 ->58,018,936 =  2.190 bpb =  3.653 to 1 
window 4M : decode           : 182.950 millis, 2.94 c/b, rate= 1158.45 mb/s

window 8M : ooKraken    : 211,938,580 ->57,657,484 =  2.176 bpb =  3.676 to 1 
window 8M : decode           : 189.241 millis, 3.04 c/b, rate= 1119.94 mb/s

window 16M: ooKraken    : 211,938,580 ->57,525,174 =  2.171 bpb =  3.684 to 1 
window 16M: decode           : 202.384 millis, 3.25 c/b, rate= 1047.21 mb/s

files     : ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files     : decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s

window 128k:  3.42:1 ,   20.1 enc mbps , 1164.7 dec mbps
window 256k:  3.49:1 ,   20.1 enc mbps , 1162.5 dec mbps
window 512k:  3.55:1 ,   20.1 enc mbps , 1147.5 dec mbps
window 1M  :  3.60:1 ,   20.0 enc mbps , 1146.2 dec mbps
window 2M  :  3.63:1 ,   19.7 enc mbps , 1163.0 dec mbps
window 4M  :  3.65:1 ,   19.3 enc mbps , 1158.5 dec mbps
window 8M  :  3.68:1 ,   18.9 enc mbps , 1119.9 dec mbps
window 16M :  3.68:1 ,   18.8 enc mbps , 1047.2 dec mbps
files      :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

WARNING : tuning perf to cache size is obviously very machine dependent; I don't really recommend fiddling with it unless you know the exact hardware you will be decoding on. The test machine here has a 4 MB L3, so speed falls off slightly as window size approaches 4 MB.

If you do need to use tiny chunks with Oodle ("tiny" being 32k or smaller; 128k or above is in the normal intended operating range) here are a few tips to consider :

1. Consider pre-allocating the Decoder object and passing in the memory to the OodleLZ_Decompress calls. This avoids doing a malloc per call, which may or may not be significant overhead.

2. Consider changing OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size . The default is 2k bytes. Buffers smaller than that will use LZB16 instead of the requested compressor, because many of the new ones don't do well on tiny buffers. If you want to have control of this yourself, you can set this to 0.

3. Consider changing OodleLZ_CompressOptions::spaceSpeedTradeoffBytes . This is the number of bytes that must be saved from the compressed output size before the encoder will choose a slower decode mode. eg. it controls decisions like whether literals are sent raw or with entropy coding. This number is scaled for full size buffers (128k bytes or more). When using tiny buffers, it will choose to avoid entropy coding more often. You may wish to dial down this value to scale to your buffers. The default is 256 ; I recommend trying 128 to see what the effect is.


Oodle Hydra

02-01-17 | Oodle Hydra

Oodle Hydra - the many headed beast.

Hydra is a meta-compressor which selects Kraken, Mermaid, or Selkie per block. It uses the speed fit model of each compressor to do a lagrangian space-speed optimization decision about which compressor is maximizing the desired lagrange cost (size + lambda*time).

It turns out to be quite interesting.

(this is of course in addition to each of those compressors internally making space-speed decisions; each of them can enable or disable internal processing modes using the same lagrange optimization model. (eg. they can turn on and off entropy coding for various streams). And there are additional per-block implicit decisions such as choosing uncompressed blocks and huff-only blocks.)

Hydra is a single entry point to all the Oodle compressors. You simply choose how much you care about size vs. decode speed, that corresponds to a certain lagrange lambda. In Oodle this is called "spaceSpeedTradeoffBytes". It's the # of bytes that compression must save in order to take up N cycles more of decode time. You then no longer think about "do I want Kraken or Mermaid" , Oodle makes the right decision for you that optimizes the goal.

Hydra can interpolate the performance of Kraken & Mermaid to create a meta-compressor that targets the points in between. That in itself is a somewhat surprising result. Say Kraken is at 1000 mb/s , Mermaid is at 2000 mb/s decode speed, but you really want a compressor that's around 1500 mb/s with compression between the two. We don't know of a Pareto-optimal compressor that is between Kraken and Mermaid, so you're sunk, right? No, you can use Hydra.

I should note that Hydra is very much about *whole corpus* performance. That is, if your target is 1500 mb/s, you may not hit that on any one file - that file could go either all-Kraken or all-Mermaid. The target is hit overall. This is intentional and good, but if for whatever reason you are trying to hit a specific speed for an individual file then Hydra is not the way to do that.

It leads to an idea that I've tried to advocate for before : corpus lagrange optimization for bit rate allocation. If you are dealing with a limited resource that you want to allocate well, such as disk size or download size or time to load - you want to allocate that resource to the data that can make the best use of it. eg. spend your decode time where it makes the biggest size difference. (I encourage this for lossy bit rate allocation as well). So with Hydra some files decode slower and some decode faster, but when they are slower it's because the time was worth it.

And now some reports. We're going to look at 3 copora. On Silesia and gametestset, Hydra interpolates as expected. But then on PD3D, something magic happens ...

(Oodle 2.4.2 , level 7, Core i7-3770 x64)

Silesia :

total                : Kraken     : 4.106 to 1 : 994.036 MB/s
total                : Mermaid    : 3.581 to 1 : 1995.919 MB/s
total                : Hydra200   : 4.096 to 1 : 1007.692 MB/s
total                : Hydra288   : 4.040 to 1 : 1082.211 MB/s
total                : Hydra416   : 3.827 to 1 : 1474.452 MB/s
total                : Hydra601   : 3.685 to 1 : 1780.476 MB/s
total                : Hydra866   : 3.631 to 1 : 1906.823 MB/s
total                : Hydra1250  : 3.572 to 1 : 2002.683 MB/s

gametestset :

total                : Kraken     : 2.593 to 1 : 1309.865 MB/s
total                : Mermaid    : 2.347 to 1 : 2459.442 MB/s
total                : Hydra200   : 2.593 to 1 : 1338.429 MB/s
total                : Hydra288   : 2.581 to 1 : 1397.465 MB/s
total                : Hydra416   : 2.542 to 1 : 1581.959 MB/s
total                : Hydra601   : 2.484 to 1 : 1836.988 MB/s
total                : Hydra866   : 2.431 to 1 : 2078.516 MB/s
total                : Hydra1250  : 2.366 to 1 : 2376.828 MB/s

PD3D :

total                : Kraken     : 3.678 to 1 : 1054.380 MB/s
total                : Mermaid    : 3.403 to 1 : 1814.660 MB/s
total                : Hydra200   : 3.755 to 1 : 1218.745 MB/s
total                : Hydra288   : 3.738 to 1 : 1249.838 MB/s
total                : Hydra416   : 3.649 to 1 : 1381.570 MB/s
total                : Hydra601   : 3.574 to 1 : 1518.151 MB/s
total                : Hydra866   : 3.487 to 1 : 1666.634 MB/s
total                : Hydra1250  : 3.279 to 1 : 1965.039 MB/s

Whoah! Magic!

On PD3D, Hydra finds big free wins - it not only compresses more than Kraken, it decodes significantly faster, repeating the above to point it out :

total                : Kraken     : 3.678 to 1 : 1054.380 MB/s

total                : Hydra288   : 3.738 to 1 : 1249.838 MB/s
 Kraken compression ratio is in between here, around 1300 MB/s
total                : Hydra416   : 3.649 to 1 : 1381.570 MB/s

You can see it visually in the loglog plot; if you draw a line between Kraken & Mermaid, the Hydra data points are above that line (more compression) and to the right (faster).

What's happening is that once in a while there's a block where Mermaid gets the same or more compression than Kraken. While that's rare, when it does happen you just get a big free win from switching to Mermaid on that block. More often, Mermaid only gets a little bit less compression than Kraken but a lot less decode time, so switching is advantageous in the space-speed lagrange cost.

Crucial to Hydra is having a decoder speed fit for every compressor that can simulate decoding a block and count cycles needed to decode on an imaginary machine. You need a model because you don't want to actually measure the time by running the decoder on the current machine - it would take lots of runs to get reliable timing, and it would mean that you are optimizing for the exact machine that you are encoding on. I currently use a single virtual machine that is a blend of various real platforms; in the future I might expose the ability to use virtual machines that simulate specific target machines (because Hydra might make decisions differently if it knows it is targeting PC-x64 vs. Jaguar-x64 vs. Aarch64-on-A57 , etc.).

Hydra is exciting to me as a general framework for the future of Oodle. It provides a way to add in new compression modes and be sure that they are never worse. That is, you always can start with Kraken per block, and then new modes could be picked block by block only when they are known to beat Kraken (in a space-speed sense). It lets you mix in compressors that you specifically don't expect to be good in general on all data, but that might be amazing once in a while on certain data.

(Hydra requires compressors that carry no state across blocks, so you can't naively mix in something like PPM or CM/PAQ. To optimize a switching choice with compressors that carry state requires a trellis-quantization like lattice dynamic programming optimization and is rather more complex to do quickly)


Oodle on the Nintendo Switch

Oodle is coming soon (in 2.4.2) to the Nintendo Switch (NX), an ARM A57 device.

Quick performance test vs. the software zlib (1.2.8) provided in the Nintendo SDK :

ADD : Update with new numbers from Oodle 2.6.0 pre-release (11-20-2017) :

file  : compressor  :  ratio      : decode speed

lzt99 : nn_deflate  :  1.883 to 1 : 74.750 MB/s

lzt99 : Kraken  -z8 :  2.615 to 1 : 275.75 mb/s  (threadphased 470.13 mb/s)
lzt99 : Kraken  -z6 :  2.527 to 1 : 289.06 mb/s
lzt99 : Hydra 300 z6:  2.571 to 1 : 335.68 mb/s
lzt99 : Hydra 800 z6:  2.441 to 1 : 458.66 mb/s
lzt99 : Mermaid -z6 :  2.363 to 1 : 556.85 mb/s
lzt99 : Selkie  -z6 :  1.939 to 1 : 988.04 mb/s

Kraken (z6) is 3.86X faster to decode than zlib, with way more compression (35% more).
Selkie gets a little more compression than zlib and is 13.35 X faster to decode.

All tests single threaded, 64-bit. (except "threadphased" which uses 2 threads to decode)

I've included Hydra at a space-speed tradeoff value between Kraken & Mermaid (sstb=300). It's a bit subtle, perhaps you can see it best in the loglog chart (below), but Hydra here is not just interpolating between Kraken & Mermaid performance, it's actually beating both of them in a Pareto frontier sense.


This post was originally done with a pre-release version of Oodle 2.4.2 when we had just gotten Oodle running on the NX. There was still a lot of work to be done to get it running really properly.

lzt99                : nn_deflate : 1.883 to 1 : 74.750 MB/s
lzt99                : LZNA       : 2.723 to 1 : 24.886 MB/s
lzt99                : Kraken     : 2.549 to 1 : 238.881 MB/s
lzt99                : Hydra 300  : 2.519 to 1 : 274.433 MB/s
lzt99                : Mermaid    : 2.393 to 1 : 328.930 MB/s
lzt99                : Selkie     : 1.992 to 1 : 660.859 MB/s


PNG without ZLib

If you need to send PNG images in a compressed archive, here's a tip.

PNG's are internally compressed with Zlib. When you run another compressor (such as Oodle) on an already-compressed file like PNG, it won't be able to do much with it. It might get a few bytes out of the headers, but typically the space-speed tradeoff decision in Oodle will not think that gain is worth bothering with, so the PNG will just be sent uncompressed.

There are a few reasons why you might want to use an Oodle compressor rather than the Zlib inside PNG. One is to reduce size; some of the Oodle compressors can make the files smaller than Zlib can. Another is for speed, if you use Kraken or Mermaid the decoder is much faster than the Zlib decompression in PNG.

Now obviously if you want the smallest possible lossless image, you should use an image-specific codec like webp-ll , but we will assume here that that isn't an option.

You could of course just decode the PNG to BMP or TGA or some kind of simple sample format, but that is not desirable. For one thing it changes the format, and your end usage loader might be expecting PNG. Your PNG's might be using PNG-specific features like borders or transparency or whatever that is hard to translate to other formats.

But another is that we want the PNG to keep doing its filtering. Filtered image samples from PNG will usually be more compressible by the back-end compressor than the raw samples in a BMP.

The easy way to do this all is just to take an existing PNG and set its ZLib compression level to 0 (just store). You keep all the PNG headers, and you still get the pixel filtering. But the samples are now uncompressed, so the back-end compressor (Oodle or whatever) gets to work on them instead of passing through already-ZLibbed data.


pngcp is a utility from the official libpng distribution. It reads & writes a png and can change some options.

Usage for what we want is :

pngcp --level=0 --text-level=0 from.png to.png

I have made a Win32 build with static libs of pngcp for your convenience :


I also added a --help option ; run "pngcp --help". The official pngcp seems to have no help or readme at all that explains usage.

I *think* that pngcp preserves headers & options & pixel formats BUT I'M NOT SURE, it's not my code, YMMV, don't go fuck up your pngs without testing it. If it doesn't work - hey you can get pngcp from the official distro and fix it.

I used libpng 1624. The vc7.1 project in libpng worked fine for me. pngcp needed a little bit of de-unixification to build in VC but it was straightforward. You need zlib ; I used 1.2.8 and it worked fine; you need to make a dir named "zlib" at the same level as libpng. I did "mklink /j zlib zlib-1.2.8".

* CAVEAT : this isn't really the way I'd like to do this. pngcp loads the PNG and then saves it out again, which introduces the possibility of losing metadata that was stuffed in the file or just screwing it up somehow. I'd much rather do this conversion without ever actually loading it as an image. That is, take the PNG file as just a binary blob, find the zlib streams and unpack them, store them with a level 0 header, and pass through the PNG headers totally untouched. That would be a much more robust way to ensure you don't lose anything.


cbpngz0 usage :

cbpngz0 from to

cbpngz0 uses the cblib loaders, so it can load bmp,tga,png,jpeg and so on. It writes a PNG at zlib level 0. Unlike pngcp, cbpngz0 does NOT support lots of weird formats; it only writes 8-bit gray, 24-bit RGB, and 32-bit RGBA. This is not a general purpose PNG zlib level changer!! Nevertheless I find it useful because of the wider range of formats it can load.


cbpngz0 is an x64 exe and uses the DLLs included.

Some sample results.

I take an original PNG, then try compressing it with Oodle two ways. First, convert it to a BMP and compress the BMP. Second, convert to a Zlib level 0 PNG (the "_z0.png") and then compress with Oodle. The differene between the two is that the _z0.png gets the PNG filters, and of course stays a PNG if that's what your loader expects. If you give the original PNG to Oodle, it passes it through uncompressed.

porsche640.png             529,821

porsche640.bmp             921,654

porsche640.bmp.ooz         711,273

porsche640_z0.png.ooz      508,091


blinds.png                 328,754

blinds.bmp               1,028,826

blinds.bmp.ooz             193,130

blinds_z0.png.ooz          195,558


xxx.png                    420,149

xxx.bmp                    915,054

xxx.bmp.ooz                521,861

xxx_z0.png.ooz             409,311

The ooz files are made with Oodle LZNA -z6 (level Optimal2).

You can see there are some big gains possible with replacing Zlib (on "blinds"). On normal photographic continuous tone images Zlib does okay so the gains are small. On those images, compressing the BMP without filters is very bad.

Another small note : if your end usage PNG loader supports the optional MNG format LOCO color transform, that usually helps compression.

ADD : Chris Maiwald points out that he gets better PNG filter choice by using "Z_FIXED" (which is the zlib option for fixed huffman tables instead of per-file huffman). A bit weird, but perhaps it biases the filter choice to be more consistent?

I wonder if choosing a single PNG filter for the whole image would be better than letting PNG do its per-row thing? (to try to make the post-filter residuals more consistent for the back end modeling stage). For max compression you would use something like a png optimizer that tried various filter strategies, but instead of rating them using zlib, rate with the back-end of your choice.

old rants