7/25/2016

Introducing Oodle Mermaid and Selkie

I'm pleased to announce the release of two new compressors in Oodle 2.3.0 : Mermaid and Selkie.

Mermaid and Selkie are the super-fast-to-decode distant relatives of Kraken. They use some of the same ideas and technology as Kraken, but are independent compressors targetted at even higher speed and lower compression. Mermaid & Selkie make huge strides in what's possible in compression in the high-speed domain, the same way that Kraken did in the high-compression domain.

Mermaid is about twice as fast as Kraken, but with compression around Zlib levels.

Selkie is one of the fastest decompressors in the world, and also gets much more compression than other very-high-speed compressors.

( Oodle is my data compression library that we sell at RAD Game Tools , read more about it there )

Kraken, Mermaid, and Selkie all use an architecture that makes space-speed decisions in the encoder to give the best tradeoff of compressed size vs decoding speed. The three compressors have different performance targets and make decisions suited for each one's usage domain (Kraken favors more compression and will give up some speed, Selkie strongly favors speed, Mermaid is in between).

For detailed information about the new Mermaid and Selkie I've written a series of posts :

cbloom rants Introducing Oodle Mermaid and Selkie
cbloom rants Oodle 2.3.0 All Test Sets
cbloom rants Oodle 2.3.0 ARM Report
cbloom rants Oodle Mermaid and Selkie on PS4
cbloom rants Oodle Mermaid
cbloom rants Oodle Selkie
RAD Game Tools - Oodle Network and Data Compression

Here are some representative numbers on the Silesia test set : (sum of time and size on individual files)


Oodle 2.3.0 Silesia -z6

Kraken     : 4.082 to 1 : 999.389 MB/s
Mermaid    : 3.571 to 1 : 2022.038 MB/s
Selkie     : 3.053 to 1 : 2929.770 MB/s

zstdmax    : 4.013 to 1 : 468.497 MB/s
zlib9      : 3.128 to 1 : 358.681 MB/s
lz4hc      : 2.723 to 1 : 2267.021 MB/s

on Win64 (Core i7-3770 3.4 GHz)

On Silesia, Mermaid is 5.65X faster to decode than zlib, and gets 14% more compression. Selkie is 1.3X faster to decode than LZ4 and gets 12% more compression.

Charts on Silesia total : (charts show time and size - lower is better!)

And the speedup chart on Silesia, which demonstrates the space-speed efficiency of a compressor in different usage domains.

Kraken was a huge step in the Pareto frontier that pushed the achievable speedup factor way up beyond what other compressers were doing. There's a pre-Kraken curve where we thought the best possible tradeoff existed, that most other compressors in the world roughly lie on (or under). Kraken set a new frontier way up on its own with nobody to join it; Mermaid & Selkie are the partners on that new curve that have their peaks at higher speeds than Kraken.

You can also see this big jump of the new family very easily in scatter plots, which we'll see in later posts .

Oodle Mermaid and Selkie on PS4

The PS4 is a lovely platform to benchmark on because it's standard. It's also very easy to run tests on. Performance of these compressors on the Xbox One is extremely similar, small variatation due to clock rate (1.75 vs 1.6 GHz) and compiler (MSVC vs clang/llvm).

Everything is slow on the PS4 in absolute terms (it's a slow chip and difficult to optimize for). The Oodle compressors do very well, even better in relative terms on PS4 than on typical PC's.

Kraken is usually around ~2X faster than ZStd on PC's, but is 3X faster on PS4. Mermaid is usually just slightly slower than LZ4 on PC's, but is solidly faster than LZ4 on PS4.


lzt99 :

Kraken     : 2.477 to 1 : 390.582 MB/s
Mermaid    : 2.279 to 1 : 749.896 MB/s
Selkie     : 1.937 to 1 : 1159.064 MB/s

zstd       : 2.374 to 1 : 133.498 MB/s
miniz      : 1.883 to 1 : 85.654 MB/s
lz4hc-safe : 1.669 to 1 : 673.616 MB/s
LZSSE8     : 1.626 to 1 : 767.106 MB/s

Mermaid is faster than LZ4 on PS4 !! Wow! And the compression level is in a totally different domain than other super-fast decompressors like LZ4 or LZSSE.

lzt99 is a good case for Selkie & Mermaid. Selkie beats zlib compression ratio while being 75% faster than LZ4.

All compressors here are fuzz-safe, and run in safe mode if they have optional safe/unsafe modes.

Charts : (showing time and size - lower is better!)

lzt99 :

the raw data :


PS4 : Oodle 230 : (-z6)

inName : lzt:/lzt99

reference :

miniz      : 24,700,820 ->13,120,668 =  4.249 bpb =  1.883 to 1
miniz_decompress_time : 288.379 millis, 18.61 c/b, rate= 85.65 mb/s

zstd       : 24,700,820 ->10,403,228 =  3.369 bpb =  2.374 to 1
zstd_decompress_time : 185.028 millis, 11.94 c/b, rate= 133.50 mb/s

lz4hc      : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1
LZ4_decompress_safe_time : 36.669 millis, 2.37 c/b, rate= 673.62 mb/s

LZSSE8     : 24,700,820 ->15,190,395 =  4.920 bpb =  1.626 to 1
decode_time      : 32.200 millis, 2.08 c/b, rate= 767.11 mb/s

Oodle :

Kraken     : 24,700,820 -> 9,970,882 =  3.229 bpb =  2.477 to 1
decode           : 63.241 millis, 4.08 c/b, rate= 390.58 mb/s

Mermaid    : 24,700,820 ->10,838,455 =  3.510 bpb =  2.279 to 1
decode           : 32.939 millis, 2.13 c/b, rate= 749.90 mb/s

Selkie : 24,700,820 ->12,752,506 =  4.130 bpb =  1.937 to 1
decode           : 21.311 millis, 1.38 c/b, rate= 1159.06 mb/s

BTW for reference, the previous best compressor in Mermaid's domain was LZNIB. Before these new compressors, LZNIB was quite unique in that it got good decode speeds, much faster than the LZ-Huffs of the time (eg. 3X faster than ZStd) but with compression usually better than ZLib. Well, LZNIB is still quite good compared to other competition, but it's just clobbered by the new Oceanic Bestiary compressors. The new compressor in this domain is Mermaid and it creams LZNIB for both size and speed :


LZNIB -z6  : 24,700,820 ->12,015,591 =  3.892 bpb =  2.056 to 1 
decode           : 58.710 millis, 3.79 c/b, rate= 420.73 mb/s

Mermaid    : 24,700,820 ->10,838,455 =  3.510 bpb =  2.279 to 1
decode           : 32.939 millis, 2.13 c/b, rate= 749.90 mb/s

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle Mermaid

Mermaid is a new compressor with a unique balance of space and speed. Mermaid is very close to LZ4 decode speeds, while usually beating Zlib compression ratios.

There's really nothing even close. It's way beyond what was previously thought possible.

Mermaid supports unbounded distance match references. This is part of how it gets such high compression. It does so in a new way which reduces the speed penalty normally incurred by large-window LZ's.

Mermaid almost always compresses better than ZLib. The only exception is on small files, less than 32k or so. The whole Oceanic Bestiary family is best suited to files over 64k. They work fine on smaller files, but they lose their huge advantage. It's always best to combine small files into larger units for compression, particularly so with these compressors.

There's not really any single compressor to compare Mermaid to. What we can do is compare vs. Zlib's compression ratio and LZ4's speed. A kind of mythological hybrid like a Chimera, the head of a Zlib and the legs of an LZ4.

Tests on Win64 (Core i7-3770 3.4 GHz) :

Silesia :

On Silesia, Mermaid is just slightly slower than LZ4 but compresses much more than Zlib !!

PD3D :

On PD3D, Mermaid gets almost exactly the compression level of ZLib but the decode speed of LZ4. Magic! It turns out you *can* have your cake and eat it too.

Game Test Set :

lzt99 :

Mermaid really compresses well on lzt99 ; not only does it kill Zlib, it gets close to high compression LZ-Huffs like RAR. (RAR gets 10826108 , Mermaid 10838455 bytes).

Seven :

Because of the space-speed optimizing nature of Mermaid, it will make decisions to be slower than LZ4 when it can find big compression gains. For example if you look at the individual files of the "Seven" test set below - Mermaid is typically right around the same speed as LZ4 or even faster (baby7,dds7,exe7,game7,wad7 - all same speed or faster than LZ4). On a few files it decides to take an encoding slower to decode than LZ4 - model7,enwik7, and records7. The biggest differences are enwik7 and records7, but if you look at the compression ratios - those are all the files where it found huge size differences over LZ4. It has an internal exchange rate for time vs. bytes that it must meet in order to take that encoding, trying to optimize for its space-speed target usage.

Seven files :


Silesia              : Mermaid    : 3.571 to 1 : 2022.038 MB/s
Silesia              : lz4hc      : 2.723 to 1 : 2267.021 MB/s
Silesia              : zlib9      : 3.128 to 1 : 358.681 MB/s

GameTestSet          : Mermaid    : 2.284 to 1 : 2718.095 MB/s
GameTestSet          : lz4hc      : 1.776 to 1 : 3226.887 MB/s
GameTestSet          : zlib9      : 1.992 to 1 : 337.986 MB/s

lzt99                : Mermaid    : 2.279 to 1 : 2283.278 MB/s
lzt99                : lz4hc      : 1.669 to 1 : 2605.366 MB/s
lzt99                : zlib9      : 1.883 to 1 : 309.304 MB/s

PD3D                 : Mermaid    : 2.875 to 1 : 2308.830 MB/s
PD3D                 : lz4hc      : 2.238 to 1 : 2369.666 MB/s
PD3D                 : zlib9      : 2.886 to 1 : 382.349 MB/s

Seven                : Mermaid    : 2.462 to 1 : 2374.212 MB/s
Seven                : lz4hc      : 2.000 to 1 : 2521.296 MB/s
Seven                : zlib9      : 2.329 to 1 : 315.370 MB/s

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle Selkie

Selkie is the faster cousin of Mermaid, distant relative of Kraken.

Selkie is all about decode speed, it aims to be the fastest mainstream decompressor in the world, and still gets more compression than anything in the high-speed domain.

Selkie does not currently have a super fast encoder. It's got good optimal parse encoders that produce carefully tuned encoded file which offer excellent space-speed tradeoff.

The closest compressors to Selkie are the fast byte-wise small-window coders like LZ4 and LZSSE (and Oodle's LZB16). These are all just obsolete now (in terms of ratio vs decode speed), Selkie gets a lot more compression (sometimes close to Zlib compression levels!) and is also much faster.

Selkie will not compress tiny buffers, or files that only compress a little bit. For example if you give Selkie something like an mp3, it might be able to compress it to 95% of its original size, saving a few bytes. Selkie will refuse to do that and just give you the original uncompressed file. If you wanted that compression, that means you wanted to save only a few bytes at a large time cost, which means you don't actually want a fast compressor like Selkie. You in fact wanted a compressor that was more willing to trade time for bytes, such as Mermaid or Kraken. Selkie will not abide logical inconsistency.

Selkie generally beats LZ4 compression even on small files (under 64k) but really gets ahead on files larger than 64k where the unbounded match distances can find big wins.

As usual, I'm not picking on LZ4 here because it's bad; I'm comparing to it because it's the best of the rest, and it's widely known. Both decompressors are run fuzz-safe.


Tests on Win64 (Core i7-3770 3.4 GHz) : 
(for reference, this machine runs memcpy at roughly 8 GB/s)
(total of time & size on each test set)

gametestset : ooSelkie    : 143,579,361 ->70,716,380 =  3.940 bpb =  2.030 to 1 
gametestset : decode      : 29.239 millis, 0.69 c/b, rate= 4910.61 mb/s

gametestset : lz4hc       : 143,579,361 ->80,835,018 =  4.504 bpb =  1.776 to 1 
gametestset : decode      : 44.495 millis, 1.05 c/b, rate= 3226.89 mb/s

pd3d : ooSelkie    : 31,941,800 ->13,428,298 =  3.363 bpb =  2.379 to 1 
pd3d : decode      : 8.381 millis, 0.89 c/b, rate= 3811.29 mb/s

pd3d : lz4hc       : 31,941,800 ->14,273,195 =  3.575 bpb =  2.238 to 1 
pd3d : decode      : 13.479 millis, 1.44 c/b, rate= 2369.67 mb/s

seven : ooSelkie    : 80,000,000 ->36,460,084 =  3.646 bpb =  2.194 to 1 
seven : decode      : 21.458 millis, 0.91 c/b, rate= 3728.26 mb/s

seven : lz4hc       : 80,000,000 ->39,990,656 =  3.999 bpb =  2.000 to 1 
seven : decode      : 31.730 millis, 1.35 c/b, rate= 2521.30 mb/s

silesia : ooSelkie    : 211,938,580 ->69,430,966 =  2.621 bpb =  3.053 to 1 
silesia : decode      : 72.340 millis, 1.16 c/b, rate= 2929.77 mb/s

silesia : lz4hc       : 211,938,580 ->77,841,566 =  2.938 bpb =  2.723 to 1 
silesia : decode      : 93.488 millis, 1.50 c/b, rate= 2267.02 mb/s

The edge that Selkie has over LZ4 is even greater on more difficult platforms like the PS4.

To get a better idea of the magic of Selkie it's useful to look at the other Oodle compressors that are similar to Selkie.

LZB16 is Oodle's LZ4 variant; it gets slightly more compression and slightly more decode speed, but they're roughly equal. It's included here for comparison to LZBLW.

Oodle's LZBLW is perhaps the most similar compressor to Selkie. It's like LZB16 (LZ4) but adds large-window matches. That ability to do long-distance matches hurts speed a tiny bit (2873 mb/s -> 2596 mb/s), but helps compression a lot.

Oodle's LZNIB is nibble-wise, with unbounded offsets and a rep match. It gets good compression, generally better than Zlib, with speed much higher than any LZ-Huff. LZNIB is in a pretty unique space speed tradeoff zone without much competition outside of Oodle.


lz4hc     : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1 
decode    : 9.481 millis, 1.31 c/b, rate= 2605.37 mb/s

ooLZB16   : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1
decode    : 8.597 millis, 1.18 c/b, rate= 2873.17 mb/s

ooLZNIB   : 24,700,820 ->12,014,368 =  3.891 bpb =  2.056 to 1
decode    : 17.420 millis, 2.40 c/b, rate= 1417.93 mb/s

ooLZBLW   : 24,700,820 ->13,349,800 =  4.324 bpb =  1.850 to 1
decode    : 9.512 millis, 1.31 c/b, rate= 2596.80 mb/s

ooSelkie  : 24,700,820 ->12,752,506 =  4.130 bpb =  1.937 to 1 
decode    : 6.410 millis, 0.88 c/b, rate= 3853.57 mb/s

LZNIB and LZBLW were both pretty cool before Selkie, but now they're just obsolete.

LZBLW gets a nice compression gain over LZB16, but Selkie gets even more, and is way faster!

LZNIB beats Selkie compression, but is way slower, around 3X slower, in fact it's slower than Mermaid (2283.28 mb/s and compresses to 10,838,455 = 3.510 bpb = 2.279 to 1).

You can see from the curves that Selkie just completely covers the curves of LZB16,LZBLW, and LZ4. When a curve is completely covered like that, it means that it was beaten for both space and speed, so there is no domain where that compressor is ever better. LZNIB just peeks out of the Selkie curve because it gets higher compression (albeit at lower speed), so there is a domain where it is the better choice - but in that domain Mermaid just completely dominates LZNIB, so it too is obsolete.

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle 2.3.0 ARM Report

I prepared a detailed report of Oodle's performance on ARM mobile devices (Android and iOS).

The full report is here :

oodle_arm_report on cbloom.com

It's a thorough test on many devices and several corpora. See the full details there.

Cliff notes is : Oodle's great on ARM.

For example on the iPadAir2 64-bit , on Silesia :

We found that the iOS devices are generally very good and easy to program for. They're more like desktop Intel chips; they don't have any terrible performance cliffs. The Android ARM devices we tested on were rather more difficult. For one thing they have horrible thermal saturation problems that makes testing on them very difficult. They also have some odd performance quirks.

I'm sure we could get a lot more speed on ARM, but it's rather nasty to optimize for. For one thing the thermal problems mean that iterating and getting good numbers is a huge pain. It's hard to tell if a change helped or not. For another, there's a wide variety of devices and it's hard to tell which to optimize for, and they have different performance shortfalls. So there's definitely a lot left on the table here.

Mermaid & Selkie are quite special on ARM. Many of these devices have small caches (as small as 512k L2) and very slow main memory (slow wrst latency; they often have huge bandwidth, but latency is what I need). Mermaid & Selkie are able to use unbounded windows for LZ without suffering a huge speed hit, due to the unique way they are structured. Kraken doesn't have the same magic trick so it benefits from a limited window, as demonstrated in the report.

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle 2.3.0 All Test Sets

Putting the total performance on various testsets together in one place. Tests on Win64 Core i7-3770 3.4 GHz as usual.

Showing speed & ratio here, higher is better.

As usual the total on a test set is total size of all individually compressed files, and total time.

I think the scatter plot most clearly shows the way Kraken, Mermaid & Selkie are just on a whole new Pareto Frontier than the older compressors. You can connect the dots of K-M-S performance for each test set and they form a very consistent space-speed tradeoff curve that's way above the previous best.

The raw numbers :

gametestset          : Kraken     : 2.566 to 1 : 1363.283 MB/s
gametestset          : Mermaid    : 2.284 to 1 : 2711.458 MB/s
gametestset          : Selkie     : 2.030 to 1 : 4870.413 MB/s
gametestset          : lz4hc      : 1.776 to 1 : 3223.279 MB/s
gametestset          : zlib9      : 1.992 to 1 : 338.063 MB/s
gametestset          : lzma       : 2.756 to 1 : 43.782 MB/s

pd3d                 : Kraken     : 3.647 to 1 : 1072.833 MB/s
pd3d                 : Mermaid    : 2.875 to 1 : 2299.860 MB/s
pd3d                 : Selkie     : 2.379 to 1 : 3784.850 MB/s
pd3d                 : lz4hc      : 2.238 to 1 : 2370.193 MB/s
pd3d                 : zlib9      : 2.886 to 1 : 382.226 MB/s
pd3d                 : lzma       : 4.044 to 1 : 63.878 MB/s

seven                : Kraken     : 2.914 to 1 : 1053.961 MB/s
seven                : Mermaid    : 2.462 to 1 : 2374.796 MB/s
seven                : Selkie     : 2.194 to 1 : 3717.074 MB/s
seven                : lz4hc      : 2.000 to 1 : 2522.824 MB/s
seven                : zlib9      : 2.329 to 1 : 315.344 MB/s
seven                : lzma       : 3.186 to 1 : 52.660 MB/s

silesia              : Kraken     : 4.082 to 1 : 1004.014 MB/s
silesia              : Mermaid    : 3.571 to 1 : 2002.079 MB/s
silesia              : Selkie     : 3.053 to 1 : 2889.536 MB/s
silesia              : lz4hc      : 2.723 to 1 : 2269.788 MB/s
silesia              : zlib9      : 3.128 to 1 : 358.593 MB/s
silesia              : lzma       : 4.369 to 1 : 78.655 MB/s

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

7/24/2016

Oodle 2.3.0 : Kraken Improvement

Oodle 2.3.0 includes some pretty solid improvements to Kraken. The result is around a 10% gain in decode speed.

There were two major factors in the gains. One was just some more time optimizing some inner loops (including some new super-tight pathways from Fabian).

The other was more rigorous analysis of the space-speed tradeoff decisions inside Kraken. One of the fundamental things that makes Kraken work is the fact that it consider space-speed when making its internal decisions, but before 230 those decisions were made in a rather ad-hoc way. Making those decisions better means that even with the same decoder, the new encoder is able to create files that are the same size but decode faster.

The tradeoff point (technically, the lagrange lambda, or the exchange rate from time to bytes) that's used by Oodle to make space-speed decisions is exposed to the client in the OodleLZ_CompressOptions so you can adjust it to bias for compression or decode speed. Each compressor sets what I believe to be a reasonable default for its usage domain, so adjustments to this value should typically be small (you can't massively change behavior with it; Kraken won't start arithmetic coding things if you set the tradeoff really small, for example, there's a small window where the compressor works well and you can just bias sightly within that window).

Some dry numbers for reference :

On PS4 :


Oodle 230 Kraken -zl4 : 24,700,820 ->10,377,556 =  3.361 bpb =  2.380 to 1
decode only      : 65.547 millis, 4.23 c/b, rate= 376.84 mb/s

Oodle 230 Kraken -zl6 : 24,700,820 -> 9,970,882 =  3.229 bpb =  2.477 to 1
decode           : 63.453 millis, 4.09 c/b, rate= 389.28 mb/s

Oodle 230 Kraken -zl7 : 24,700,820 -> 9,734,771 =  3.153 bpb =  2.537 to 1
decode           : 67.915 millis, 4.38 c/b, rate= 363.70 mb/s

Oodle 220 Kraken -zl4 : 24,700,820 ->10,326,584 =  3.345 bpb =  2.392 to 1
decode only      : 0.073 seconds, 211.30 b/kc, rate= 336.76 mb/s

Oodle 220 Kraken -zl6 : 24,700,820 ->10,011,486 =  3.242 bpb =  2.467 to 1
decode           : 0.074 seconds, 208.83 b/kc, rate= 332.82 mb/s

Oodle 220 Kraken -zl7 : 24,700,820 -> 9,773,112 =  3.165 bpb =  2.527 to 1
decode           : 0.079 seconds, 196.70 b/kc, rate= 313.49 mb/s

On Win64 (Core i7-3770 3.4 GHz) :


Oodle 2.3.0 :

Silesia Kraken -z6

total   : 211,938,580 ->51,918,269 =  1.960 bpb =  4.082 to 1
decode           : 210.685 millis, 3.38 c/b, rate= 1005.95 mb/s
Weissman 1-256 : [8.575]

mozilla : 51,220,480 ->14,410,181 =  2.251 bpb =  3.554 to 1
decode only      : 51.280 millis, 3.41 c/b, rate= 998.83 mb/s

lzt99 : 24,700,820 -> 9,970,882 =  3.229 bpb =  2.477 to 1
decode only      : 20.943 millis, 2.89 c/b, rate= 1179.44 mb/s

win81 : 104,857,600 ->38,222,311 =  2.916 bpb =  2.743 to 1
decode only      : 108.344 millis, 3.52 c/b, rate= 967.82 mb/s

Oodle 2.2.0 :

Silesia Kraken -z6

total : 211,938,580 ->51,857,427 =  1.957 bpb =  4.087 to 1
decode   : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s
Weissman 1-256 : [8.431]

"silesia_mozilla"

Kraken 230  :  3.55:1 ,   998.8 dec mb/s
Kraken 220  :  3.60:1 ,   896.5 dec mb/s
Kraken 215  :  3.51:1 ,   928.0 dec mb/s

"lzt99"

Kraken 230  :  2.48:1 ,   998.8 dec mb/s
Kraken 220  :  2.53:1 ,   912.0 dec mb/s
Kraken 215  :  2.46:1 ,   957.1 dec mb/s

"win81"

Kraken 230  :  2.74:1 ,   967.8 dec mb/s
Kraken 220  :  2.77:1 ,   818.0 dec mb/s
Kraken 215  :  2.70:1 ,   877.0 dec mb/s

NOTE : Oodle 2.3.0 Kraken data cannot be read by Oodle 2.2.0 or earlier. Oodle 230 can load all old Oodle data (new versions of Oodle can always load all data created by older versions). If you need to make data that be loaded with an older version using Oodle 230, then you can set the minimum decoder version to something lower (by default it's the current version). Contact Oodle support for details.

Some of the biggest gains were found on ARM, which I'll post about more in the future.

5/27/2016

PS4 Battle : MiniZ vs Zlib-NG vs ZStd vs Brotli vs Oodle

(see charts at the bottom)

Everything run at max compression options, level 99, max dict size. All libs are the latest on github, downloaded today. Zlib-NG has the arch/x86 stuff enabled. PS4 is AMD Jaguar , x64.

I'm going to omit encode speeds on the per-file results for simplicity, these are pretty representative :


aow3_skin_giants.clb :
zlib-ng encode   : 2.699 seconds, 1.65 b/kc, rate= 2.63 mb/s
miniz encode     : 2.950 seconds, 1.51 b/kc, rate= 2.41 mb/s
zstd encode      : 5.464 seconds, 0.82 b/kc, rate= 1.30 mb/s
brotli-9  encode    : 23.110 seconds, 0.19 b/kc, rate= 307.44 kb/s
brotli-10 encode    : 68.072 seconds, 0.07 b/kc, rate= 104.38 kb/s
brotli-11 encode    : 79.844 seconds, 0.06 b/kc, rate= 88.99 kb/s

Results :

PS4 clang-3.5.0

-------------

lzt99 :

MiniZ : 24,700,820 ->13,120,668 =  4.249 bpb =  1.883 to 1
miniz_decompress_time : 0.292 seconds, 53.15 b/kc, rate= 84.71 mb/s

zlib-ng : 24,700,820 ->13,158,385 =  4.262 bpb =  1.877 to 1
miniz_decompress_time : 0.226 seconds, 68.58 b/kc, rate= 109.30 mb/s

ZStd : 24,700,820 ->10,403,228 =  3.369 bpb =  2.374 to 1
zstd_decompress_time : 0.184 seconds, 84.12 b/kc, rate= 134.07 mb/s

Brotli-9 : 24,700,820 ->10,473,560 =  3.392 bpb =  2.358 to 1
brotli_decompress_time : 0.259 seconds, 59.83 b/kc, rate= 95.36 mb/s

Brotli-10 : 24,700,820 -> 9,949,740 =  3.222 bpb =  2.483 to 1
brotli_decompress_time : 0.319 seconds, 48.54 b/kc, rate= 77.36 mb/s

Brotli-11 : 24,700,820 -> 9,833,023 =  3.185 bpb =  2.512 to 1
brotli_decompress_time : 0.317 seconds, 48.84 b/kc, rate= 77.84 mb/s

Oodle Kraken -zl4 : 24,700,820 ->10,326,584 =  3.345 bpb =  2.392 to 1
encode only      : 4.139 seconds, 3.74 b/kc, rate= 5.97 mb/s
decode only      : 0.073 seconds, 211.30 b/kc, rate= 336.76 mb/s

Oodle Kraken -zl6 : 24,700,820 ->10,011,486 =  3.242 bpb =  2.467 to 1
decode           : 0.074 seconds, 208.83 b/kc, rate= 332.82 mb/s

Oodle Kraken -zl7 : 24,700,820 -> 9,773,112 =  3.165 bpb =  2.527 to 1
decode           : 0.079 seconds, 196.70 b/kc, rate= 313.49 mb/s

Oodle LZNA : lzt99 : 24,700,820 -> 9,068,880 =  2.937 bpb =  2.724 to 1
decode           : 0.643 seconds, 24.12 b/kc, rate= 38.44 mb/s

-------------

normals.bc1 :

miniz :   524,316 ->   291,697 =  4.451 bpb =  1.797 to 1
miniz_decompress_time : 0.008 seconds, 39.86 b/kc, rate= 63.53 mb/s

zlib-ng :   524,316 ->   292,541 =  4.464 bpb =  1.792 to 1
zlib_ng_decompress_time : 0.007 seconds, 47.32 b/kc, rate= 75.41 mb/s

zstd :   524,316 ->   273,642 =  4.175 bpb =  1.916 to 1
zstd_decompress_time : 0.007 seconds, 49.64 b/kc, rate= 79.13 mb/s

brotli-9 :   524,316 ->   289,778 =  4.421 bpb =  1.809 to 1
brotli_decompress_time : 0.010 seconds, 31.70 b/kc, rate= 50.52 mb/s

brotli-10 :   524,316 ->   259,772 =  3.964 bpb =  2.018 to 1
brotli_decompress_time : 0.011 seconds, 28.65 b/kc, rate= 45.66 mb/s

brotli-11 :   524,316 ->   253,625 =  3.870 bpb =  2.067 to 1
brotli_decompress_time : 0.011 seconds, 29.74 b/kc, rate= 47.41 mb/s

Oodle Kraken -zl6 :    524,316 ->   247,217 =  3.772 bpb =  2.121 to 1
decode           : 0.002 seconds, 135.52 b/kc, rate= 215.95 mb/s

Oodle Kraken -zl7 :    524,316 ->   238,844 =  3.644 bpb =  2.195 to 1
decode           : 0.003 seconds, 123.96 b/kc, rate= 197.56 mb/s

Oodle BitKnit :    524,316 ->   225,884 =  3.447 bpb =  2.321 to 1
decode only      : 0.010 seconds, 31.67 b/kc, rate= 50.47 mb/s

-------------

lightmap.bc3 :

miniz :  4,194,332 ->   590,448 =  1.126 bpb =  7.104 to 1 
miniz_decompress_time : 0.025 seconds, 105.14 b/kc, rate= 167.57 mb/s

zlib-ng : 4,194,332 ->   584,107 =  1.114 bpb =  7.181 to 1
zlib_ng_decompress_time : 0.019 seconds, 137.77 b/kc, rate= 219.56 mb/s

zstd :  4,194,332 ->   417,672 =  0.797 bpb = 10.042 to 1 
zstd_decompress_time : 0.014 seconds, 182.53 b/kc, rate= 290.91 mb/s

brotli-9 : 4,194,332 ->   499,120 =  0.952 bpb =  8.403 to 1 
brotli_decompress_time : 0.022 seconds, 118.64 b/kc, rate= 189.09 mb/s

brotli-10 : 4,194,332 ->   409,907 =  0.782 bpb = 10.232 to 1 
brotli_decompress_time : 0.021 seconds, 125.20 b/kc, rate= 199.54 mb/s

brotli-11 : 4,194,332 ->   391,576 =  0.747 bpb = 10.711 to 1 
brotli_decompress_time : 0.021 seconds, 127.12 b/kc, rate= 202.61 mb/s

Oodle Kraken -zl6 :   4,194,332 ->   428,737 =  0.818 bpb =  9.783 to 1 
decode           : 0.009 seconds, 308.45 b/kc, rate= 491.60 mb/s

Oodle BitKnit :   4,194,332 ->   416,208 =  0.794 bpb = 10.077 to 1
decode only      : 0.021 seconds, 122.59 b/kc, rate= 195.39 mb/s

Oodle LZNA :  4,194,332 ->   356,313 =  0.680 bpb = 11.771 to 1 
decode           : 0.033 seconds, 79.51 b/kc, rate= 126.71 mb/s

----------------

aow3_skin_giants.clb

Miniz : 7,105,158 -> 3,231,469 =  3.638 bpb =  2.199 to 1
miniz_decompress_time : 0.070 seconds, 63.80 b/kc, rate= 101.69 mb/s

zlib-ng : 7,105,158 -> 3,220,291 =  3.626 bpb =  2.206 to 1
zlib_ng_decompress_time : 0.056 seconds, 80.14 b/kc, rate= 127.71 mb/s

Zstd : 7,105,158 -> 2,700,034 =  3.040 bpb =  2.632 to 1
zstd_decompress_time : 0.050 seconds, 88.69 b/kc, rate= 141.35 mb/s

brotli-9 :  7,105,158 -> 2,671,237 =  3.008 bpb =  2.660 to 1
brotli_decompress_time : 0.080 seconds, 55.84 b/kc, rate= 89.00 mb/s

brotli-10 : 7,105,158 -> 2,518,315 =  2.835 bpb =  2.821 to 1
brotli_decompress_time : 0.098 seconds, 45.54 b/kc, rate= 72.58 mb/s

brotli-11 : 7,105,158 -> 2,482,511 =  2.795 bpb =  2.862 to 1
brotli_decompress_time : 0.097 seconds, 45.84 b/kc, rate= 73.05 mb/s

Oodle Kraken -zl6 : aow3_skin_giants.clb :  7,105,158 -> 2,638,490 =  2.971 bpb =  2.693 to 1
decode           : 0.023 seconds, 195.25 b/kc, rate= 311.19 mb/s

Oodle BitKnit : 7,105,158 -> 2,623,466 =  2.954 bpb =  2.708 to 1
decode only      : 0.095 seconds, 47.11 b/kc, rate= 75.08 mb/s

Oodle LZNA : aow3_skin_giants.clb :  7,105,158 -> 2,394,871 =  2.696 bpb =  2.967 to 1
decode           : 0.170 seconds, 26.26 b/kc, rate= 41.85 mb/s

--------------------

silesia_mozilla

MiniZ : 51,220,480 ->19,141,389 =  2.990 bpb =  2.676 to 1
miniz_decompress_time : 0.571 seconds, 56.24 b/kc, rate= 89.63 mb/s

zlib-ng : 51,220,480 ->19,242,520 =  3.005 bpb =  2.662 to 1
zlib_ng_decompress_time : 0.457 seconds, 70.31 b/kc, rate= 112.05 mb/s

zstd : malloc failed

brotli-9 : 51,220,480 ->15,829,463 =  2.472 bpb =  3.236 to 1
brotli_decompress_time : 0.516 seconds, 62.27 b/kc, rate= 99.24 mb/s

brotli-10 : 51,220,480 ->14,434,253 =  2.254 bpb =  3.549 to 1
brotli_decompress_time : 0.618 seconds, 52.00 b/kc, rate= 82.88 mb/s

brotli-11 : 51,220,480 ->14,225,511 =  2.222 bpb =  3.601 to 1
brotli_decompress_time : 0.610 seconds, 52.72 b/kc, rate= 84.02 mb/s

Oodle Kraken -zl6 : 51,220,480 ->14,330,298 =  2.238 bpb =  3.574 to 1
decode           : 0.200 seconds, 160.51 b/kc, rate= 255.82 mb/s

Oodle Kraken -zl7 : 51,220,480 ->14,222,802 =  2.221 bpb =  3.601 to 1
decode           : 0.201 seconds, 160.04 b/kc, rate= 255.07 mb/s

Oodle LZNA : silesia_mozilla : 51,220,480 ->13,294,622 =  2.076 bpb =  3.853 to 1
decode           : 1.022 seconds, 31.44 b/kc, rate= 50.11 mb/s

I tossed in tests of BitKnit & LZNA in some cases after I realized that the Brotli decode speeds are more comparable to BitKnit than Kraken, and even LZNA isn't that far off (usually less than a factor of 2). eg. you could do half your files in LZNA and half in Kraken and that would be about the same total time as doing them all in Brotli.


Here are charts of the above data :

(silesia_mozilla omitted due to lack of zstd results)

(I'm trying an experiment and showing inverted scales, which are more proportional to what you care about. I'm showing seconds per gigabyte, and percent out of output size, which are proportional to *time* not speed, and *size* not ratio. So, lower is better.)

log-log speed & ratio :

Time and size are just way better scales. Looking at "speed" and "ratio" can be very misleading, because big differences in speed at the high end (eg. 2000 mb/s vs 2200 mb/s) don't translate into a very big time difference, and *time* is what you care about. On the other hand, small differences in speed at the low end *are* important - (eg. 30 mb/s vs 40 mb/s) because those mean a big difference in time.

I've been doing mostly "speed" and "ratio" because it reads better to the novice (higher is better! I want the one with the biggest bar!), but it's so misleading that I think going to time & size is worth it.

5/12/2016

Oodle Kraken Thread-Phased Decoding

Oodle Kraken is already by far the fastest-to-decode high-compression compressor in the world (that's a mouthful!). But it's got a magic trick that makes it even faster :

Oodle Kraken can decode its normal compressed data on multiple threads.

This is different than what a lot of compressors do (and what Oodle has done in the past), which is to split the data into independent chunks so that each chunk can be decompressed on its own thread. All compressors can do that in theory, Oodle makes it easy in practice with the "seek chunk" decodes. But that requires special encoding that does the chunking, and it hurts compression ratio by breaking up where matches can be found.

The Oodle Kraken threaded decode is different. To distinguish it I call it "Thread-Phased" decode. It runs on normal compressed data - no special encoding flags are needed. It has no compressed size penalty because it's the same normal single-thread compressed data.

The Oodle Kraken Thread-Phased decode gets most of its benefit with just 2 threads (if you like, the calling thread + 1 more). The exact speedup varies by file, usually in the 1.4X - 1.9X range. The results here are all for 2-thread decode.

For example on win81, 2-thread Oodle Kraken is 1.7X faster than 1-thread : (with some other compressors for reference)


win81 :

Kraken 2-thread  : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.075 seconds, 410.98 b/kc, rate= 1398.55 M/s

Kraken           : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.127 seconds, 243.06 b/kc, rate= 827.13 M/s

zstdmax          : 104,857,600 ->39,768,086 =  3.034 bpb =  2.637 to 1 
decode           : 0.251 seconds, 122.80 b/kc, rate= 417.88 M/s

lzham            : 104,857,600 ->37,856,839 =  2.888 bpb =  2.770 to 1 
decode           : 0.595 seconds, 51.80 b/kc, rate= 176.27 M/s

lzma             : 104,857,600 ->35,556,039 =  2.713 bpb =  2.949 to 1 
decode           : 2.026 seconds, 15.21 b/kc, rate= 51.76 M/s

Charts on a few files :

Oodle 2.2.0 includes helper functions that will just run a Thread-Phased decode for you on Oodle's own thread system, as well as example code that runs the entire Thread-Phased decode client-side so you can do it on your own threads however you like.

Performance on the Silesia set for reference :


Silesia total :

Oodle Kraken -z6 : 211,938,580 ->51,857,427 =  1.957 bpb =  4.087 to 1

single threaded decode   : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s

two threaded decode      : 0.158 seconds, 394.55 b/kc, rate= 1342.64 M/s

Note that because the Kraken Thread-Phased decode is a true threaded decode of individual compressed buffers that means it is a *latency* reduction for decoding individual blocks, not just a *throughput* reduction. For example, if you were really decoding the whole Silesia set, you might just run the decompression of each file on its own thread. That is a good thing to do, and it would give you a near 2X speedup (with two threads). But that's a different kind of threading - that gives you a throughput improvement of 2X but the latency to decode any individual file is not improved at all. Kraken Thread-Phased decode reduces the latency of each independent decode, and of course it can also be used with chunking or multiple-file decoding to get further speedups.

Oodle 2.2.0 Kraken Optimal Parse improvements

Oodle 2.2.0 is about to ship, with some improvements to the Kraken optimal parse compression ratios. Compressed size is improved by around 1%. Speed is approximately the same at -z6 (previous max level for Kraken) but there's a new -z7 mode that's slightly slower and even higher compression.

I think we'll continue to find improvements in the optimal parsers over the coming months (optimal parsing is hard!) which should lead to some more tiny gains in the compression ratio in the slow encoder modes.


Silesia , sum of all files

uncompressed : 211,938,580

Kraken 2.1.5 -z6 : 52,366,897
Kraken 2.2.0 -z6 : 51,857,427
Kraken 2.2.0 -z7 : 51,625,488

Oodle Kraken 2.1.5 topped out at -z6 (Optimal2). There's a new -z7 (Optimal3) mode which gets a bit more compression at the cost of a bit of speed, which is why it's on a separate option instead of just part of -z6.

Results on some individual files (Kraken 220 is -z7) :

-------------------------------------------------------
"silesia_mozilla"

by ratio:
lzma        :  3.88:1 ,    2.0 enc mb/s ,   63.7 dec mb/s
Kraken 220  :  3.60:1 ,    1.1 enc mb/s ,  896.5 dec mb/s
lzham       :  3.56:1 ,    1.5 enc mb/s ,  186.4 dec mb/s
Kraken 215  :  3.51:1 ,    1.2 enc mb/s ,  928.0 dec mb/s
zstdmax     :  3.24:1 ,    2.8 enc mb/s ,  401.0 dec mb/s
zlib9       :  2.51:1 ,   12.4 enc mb/s ,  291.5 dec mb/s
lz4hc       :  2.32:1 ,   36.4 enc mb/s , 2351.6 dec mb/s

-------------------------------------------------------
"lzt99"

by ratio:
lzma        :  2.65:1 ,    3.1 enc mb/s ,   42.3 dec mb/s
Kraken 220  :  2.53:1 ,    2.0 enc mb/s ,  912.0 dec mb/s
Kraken 215  :  2.46:1 ,    2.3 enc mb/s ,  957.1 dec mb/s
lzham       :  2.44:1 ,    1.9 enc mb/s ,  166.0 dec mb/s
zstdmax     :  2.27:1 ,    3.8 enc mb/s ,  482.3 dec mb/s
zlib9       :  1.77:1 ,   13.3 enc mb/s ,  286.2 dec mb/s
lz4hc       :  1.67:1 ,   30.3 enc mb/s , 2737.4 dec mb/s

-------------------------------------------------------
"all_dds"

by ratio:
lzma        :  2.37:1 ,    2.1 enc mb/s ,   40.8 dec mb/s
Kraken 220  :  2.23:1 ,    1.0 enc mb/s ,  650.6 dec mb/s
Kraken 215  :  2.18:1 ,    1.0 enc mb/s ,  684.6 dec mb/s
lzham       :  2.17:1 ,    1.3 enc mb/s ,  127.7 dec mb/s
zstdmax     :  2.02:1 ,    3.3 enc mb/s ,  289.4 dec mb/s
zlib9       :  1.83:1 ,   13.3 enc mb/s ,  242.9 dec mb/s
lz4hc       :  1.67:1 ,   20.4 enc mb/s , 2226.9 dec mb/s

-------------------------------------------------------
"baby_robot_shell.gr2"

by ratio:
lzma        :  4.35:1 ,    3.1 enc mb/s ,   59.3 dec mb/s
Kraken 220  :  3.82:1 ,    1.4 enc mb/s ,  837.2 dec mb/s
Kraken 215  :  3.77:1 ,    1.5 enc mb/s ,  878.3 dec mb/s
lzham       :  3.77:1 ,    1.6 enc mb/s ,  162.5 dec mb/s
zstdmax     :  2.77:1 ,    5.7 enc mb/s ,  405.7 dec mb/s
zlib9       :  2.19:1 ,   13.9 enc mb/s ,  332.9 dec mb/s
lz4hc       :  1.78:1 ,   40.1 enc mb/s , 2364.4 dec mb

-------------------------------------------------------
"win81"

by ratio:
lzma        :  2.95:1 ,    2.5 enc mb/s ,   51.9 dec mb/s
lzham       :  2.77:1 ,    1.6 enc mb/s ,  177.6 dec mb/s
Kraken 220  :  2.77:1 ,    1.0 enc mb/s ,  818.0 dec mb/s
Kraken 215  :  2.70:1 ,    1.0 enc mb/s ,  877.0 dec mb/s
zstdmax     :  2.64:1 ,    3.5 enc mb/s ,  417.8 dec mb/s
zlib9       :  2.07:1 ,   16.8 enc mb/s ,  269.6 dec mb/s
lz4hc       :  1.91:1 ,   28.8 enc mb/s , 2297.6 dec mb/s

-------------------------------------------------------
"enwik7"

by ratio:
lzma        :  3.64:1 ,    1.8 enc mb/s ,   79.5 dec mb/s
lzham       :  3.60:1 ,    1.4 enc mb/s ,  196.5 dec mb/s
zstdmax     :  3.56:1 ,    2.2 enc mb/s ,  394.6 dec mb/s
Kraken 220  :  3.51:1 ,    1.4 enc mb/s ,  702.8 dec mb/s
Kraken 215  :  3.49:1 ,    1.5 enc mb/s ,  789.7 dec mb/s
zlib9       :  2.38:1 ,   22.2 enc mb/s ,  234.3 dec mb/s
lz4hc       :  2.35:1 ,   27.5 enc mb/s , 2059.6 dec mb/s

-------------------------------------------------------
You can see that encode & decode speed is slightly worse at level -z7 , and compression ratio si improved. (most of the other compression levels have roughly the same decode speed; -z7 enables some special options that can hurt decode speed a bit). Of course even at -z7 Kraken is way faster than anything else comparable!

Tips for benchmarking a compressor

You're about to evaluate Oodle (thanks for having a look!) or some other compressor. Before you start, consider these tips :

1. Time only the compressor.

Place your time measurements only around the compressor. Not IO, not your parsing, not mallocs, just the compress or decompress calls. I understand that in the end what you care about is total time to load, but there can be a lot of issues there that need fixing, and they can cloud the comparison of just the compression part. eg. if your parsing is really slow, that will dominate the CPU time and hide the differences between the compressors.

2. Time what you actually care about.

If you care about decode time, time the decompression. If you care about encode time, time compression. If you care about round-trip time, add the two times. Compressors are not just "fast" or "slow" at both ends, you can't time encoding and decide that it's a fast or slow compressor if what you care about is decoding.

3. Choose the right options.

Most compressors have the ability to target slightly different use cases. The most common option is the ability to trade off encode time vs. compression ratio. So, if what you care about is smallest size, then run the compressor at its highest encode effort level. It can be tricky to get the options right in most compression libraries; we are woefully non-standardized and not well documented. Aside from the simple "level" parameter, there may be other options that are relevant to your goals, perhaps trading off decompressor memory usage, or decompression speed. With Oodle the best option is always to email us and ask what options will best suit your goals.

4. Run apples-to-apples (threads-to-threads) comparisons.

It can be tricky to compare compressors fairly. As much as possible they should be run in the same way, and they should be run in the way that you will actually use them in your final application. Don't profile them with threads if you will not use them threaded in your shipping application.

Threads are a common problem. Compressors should either be tested all threaded (if you will use threads in your final application), or all non-threaded. Unfortunately the defaults are not the same. "lzma" (7z) and LZHAM create threads by default. You have to change their options to tell them to *not* create threads. The normal Oodle_Compress calls will not use threads by default, you have to specifically call one of the _Async threaded routines. (my personal preference is to benchmark everything without threads to compare single-threaded performance, and you can always add threads for production use)

5. Take the MIN of N run times.

To get reliable timing, you need to run the loop many times, and take the MIN of all times. The min will give you the time it takes when the OS isn't interrupting you with task switches, the CPU isn't clocking-down for speedstep, etc. I usually do 30 *per core* but you can probably get a way with a bit less.

6. Wipe the cache.

Assuming you are now doing N loops, you need to invalidate the cache between iterations. If you don't, you will be running the compressor in a "hot cache" scenario, with some buffers already in cache.

7. Don't pack a bunch of files together in a tar if that's not how you load.

It may seem like a good way to test to grab your bunch of test files and pack them together in a tar (or zip -0 or similar package) and run the compression tests on that tar. That's a fine option if that's really how you load data in your final application - as one big contiguous chunk that must be loaded in one big blob. But most people don't. You need to test the compressors in the same way they will be used in the final application. If you load whole file at a time, test the compressors on whole file units. Many people do loading on some kind of paging unit, like perhaps 1 MB chunks. If you do that, then test the compressor on the same thing.

8. Choose your test set.

If you could test on the entire set of buffers that your final application will load, that would be an accurate test. (though actually, even that is a bit subtle, since some buffers are more latency sensitive than others, so for example you might care more about the first few things you load to get into a running application as quickly as possible). That's probably not practical, so you want to choose a set that is representative of what you will actually load. Don't exclude things like already-compressed files (JPEGs and so on) *if* you will be running them through the compressor. (though consider *not* running them your compressed-file loading path, in which case you should exclude them from testing). It's pretty hard to get an accurate representative sample, so it's generally best to just get a variety of files and look at individual per-file results.

9. Look at the spectrum of results, not the sum.

After you run on your test set, don't just add up the compressed sizes and times to make a "total" result. Sums can be misleading. One issue is there are some large incompressible files, they can hide the differences on the more compressible files. But a bigger and more subtle trap is the way that sums weight the combination of results. A sum is a weighting by the size of each file in the test set. That's fine if your test set is all of your data, or is a perfectly proportionally representative sampling of all of your data (a subset which acts like the whole). But most likely it's not. It's best to keep the results per file separate and just have a look at individual cases to see what's going on, how the results differ, and try not to simplify to just looking at the sum.

10. If you do sum, sum *time* not speed, sum *size* not ratio.

Speed (like mb/s) and ratio (raw size/comp size) are inverted measures and shouldn't be summed. What you actually care about is total compressed size, and total time to decode. So if you run over a set of files, don't look at "average speed" or "average ratio" , because those are inverted meaures that will oddly weight the accumulation. Instead accumulate total time to decode, total raw size, and total compressed size, and then if you like you can make "overall speed" and "overall ratio" from those total.

11. Try not to malloc in the timing loop.

Your malloc might be fast, it might be slow, it's best to not have that as a variable in the timing. In general try to allocate the memory for the compressor or decompressor outside of the timing loop. (In Oodle this is done by passing in your own pointer for the "decoderMemory" argument of OodleLZ_Decompress). That would be an unfair test if you didn't also do that in the final application - so do it in the final application too! (similarly, make sure there's no logging inside the timing loop).

12. Consider excluding almost-incompressible files.

This is something you should consider for final shipping application, and if you do it in your shipping application, then you should do it for the benchmark too. The most common case is already-compressed files like JPEG images and MP3 audio. These files can usually be compressed slightly, maybe saving 1% of their size, but the time to decode them is not worth it overall - you can get more total size savings by running a more powerful compressor on other files. So it's most efficient to just send them uncompressed.

13. Tiny files should either be excluded or packed together.

There's almost never a use case where you really want to compress tiny files (< 16k bytes or so) as independent units. There's too much per-unit overhead in the compressor, and more importantly there's too much per-unit overhead in IO - you don't want to eat a disk seek to just to get one tiny file. So in a real application tiny files should always be grouped into paging units that are 256k or more, a size where loading them won't just be a total waste of disk seek time. So, when benchmarking compressors you also shouldn't run them on tiny independent files, because you will never do that in a shipping application.

4/30/2016

Oodle Kraken Pareto Frontier

The whole idea of Kraken is to excel in the size-decodespeed tradeoff. From the beginning of its design, every decision was considered in terms of the size vs decodespeed.

So how does it do?

To visualize the size-decodespeed Pareto frontier, I like to use an imaginary loading problem. You want to load compressed data and then decompress it. One option is the null compressor (or "memcpy") that just loads uncompressed data and then memcpys it. Or you can use compressors that give smaller file sizes, thus quicker loads, but take more time to decompress. By dialing the virtual disk speed, you can see which compressor is best at different tradeoffs of space vs. speed.

Of course you may not actually be just trying to optimize load time, but you can still use this imaginary loading problem as a way to study the space-speed tradeoff. If you care more about size, you look at lower imaginary disk speeds, if you core more about CPU use, you look at higher imaginary disk speeds.

I like to show "speedup" which is the factor of increase in speed using a certain compressor to do load+decomp vs. the baseline (memcpy). So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff. (By "show" I mean "is linearly proportional to, so you can actually look at the ratio of Y-intercepts in the Pareto curve and it tells you compression ratio on the left and decompression speed on the right).

03-02-15 - Oodle LZ Pareto Frontier and 05-13-15 - Skewed Pareto Chart

The chart showing "millis" shows time, so obviously lower is better. I show a few ways to combine load & decomp. Sum is the time to load then decomp sequentially. Max is the larger of the two, which is the time if they were perfectly overlapped (not usually possible). Personally I think "l2 sum" is most useful. This is the sqrt of sum of squares, which is a kind of sum that biases towards the larger; it's kind of like a mix of "sum" and "max", it means you want the sum to be small, but you also want them to similar times.

Kraken tops the Pareto curve for decodespeed vs size for a huge range of virtual disk speed; from around 2 mb/s up to around 300 mb/s.

Of course the Pareto curve doesn't tell the whole picture. For one thing I don't have encode speed in the equation here at all (and sometimes you need to look at the memory use tradeoff too, so it's really a size-decodespeed-encodespeed-memoryuse surface). For another you sometimes have strict requirements, like I must hit a certain file size, and then you just pick the fastest decoder that meets that requirement. One thing the curves do clearly tell you is when a curve just completely covers another (that is, is greater in Y for all values of X), then that compressor is just never needed (in space & decodespeed).


Silesia :


Game Test Set :


Just for comparison to earlier posts on my Blog about the other Oodle compressors, here's a run of lzt99 with the full Oodle compressor set.

I've included ZStd for reference here just to show that Kraken jumping way up is not because the previous Oodle compressors were bad - far from it, they were already Pareto optimal. (I like to compare against ZStd not because it's bad, but because it's excellent; it's the only other compressor in the world I know of that's even close to Oodle in terms of good compression ratio and high speed (ZStd also has nice super-fast encode speeds, and it's targetted slightly differently, it's better on text and Oodle is better on binary, it's not a direct apples comparison)).

You can see that LZHLW is just completely covered by Kraken, and so is now deprecated in Oodle. Even LZNIB and BitKnit are barely peeking out of the curve, so the range where they are the right answer is greatly reduced to more specialized needs. (for example BitKnit still romps on strongly structured data, and is useful if you just need a size smaller than Kraken)

4/28/2016

Performance of Oodle Kraken

A continuing series reporting on the performance of recently released Oodle Kraken

First some general notes about Oodle before the big dump of numbers. (skip to the bottom for charts)

Oodle is not intended to solve every problem in data compression. Oodle Data Compression is mainly designed for compress-once load-many usage patterns, where the compressed size and decode speed is important, but encode speed is not as important. Things like distribution, packaging, when you bake content into a compressed archive and then serve it to many people. I do consider it a requirement to keep the encoders faster than 1 MB/s on my ancient laptop, because less than that is just too slow even for content baking.

Like most data compressors, Oodle has various compression level options that trade off encoder speed for compressed size. So there are faster and slower encoders; some of the fast modes are good tradeoffs for space-speed. Kraken turns out to be pretty good at getting high compression even in the fast modes. I'll do a post that goes into this in the future.

Oodle is mainly intended for packing binary data. Because Oodle is built for games, we focus on the types of data that games ship, such as textures, models, animations, levels, and other binary structured data that often contains various types of packed numeric data. Oodle is good on text but is not really built for text. In general if you are focusing on a specific type of data (eg. text/html/xml) you will get the best results with domain-specific preprocessors, such as xwrt or wbpe. Oodle+wbpe is an excellent text compressor.

Oodle Kraken is intended for high compression use cases, where you care about size. There's a whole family of compressors now that live in the "just below lzma (7zip)" compression ratio domain, such LZHAM, Brotli, Oodle BitKnit, and ZStd. In general the goal of these is to get as close to lzma compression ratio as possible while providing much better speed. This is where Kraken achieves far more than anyone has before, far more than we thought possible.

Today I will be focusing on high compression modes, looking at decode speed vs. compression ratio. All compressors will generally be run in their max compression mode, without going into the "ridiculously slow" below 1 MB/s range. (for Oodle this means running -zl6 Optimal2 but not -zl7 or higher).

Okay, on to lots of numbers.


Rather than post an average on a test set, which can be misleading due to the selection of the test set or the way the results are averaged (total compressed size or average ratio?), I'll post results on a selection of individual files :

-------------------------------------------------------
"silesia_mozilla"

by ratio:
lzma        :  3.88:1 ,    2.0 enc mbps ,   63.7 dec mbps
lzham       :  3.56:1 ,    1.5 enc mbps ,  186.4 dec mbps
Oodle Kraken:  3.51:1 ,    1.2 enc mbps ,  928.0 dec mbps
zstdmax     :  3.24:1 ,    2.8 enc mbps ,  401.0 dec mbps
zlib9       :  2.51:1 ,   12.4 enc mbps ,  291.5 dec mbps
lz4hc       :  2.32:1 ,   36.4 enc mbps , 2351.6 dec mbps

by encode speed:
lz4hc       :  2.32:1 ,   36.4 enc mbps , 2351.6 dec mbps
zlib9       :  2.51:1 ,   12.4 enc mbps ,  291.5 dec mbps
zstdmax     :  3.24:1 ,    2.8 enc mbps ,  401.0 dec mbps
lzma        :  3.88:1 ,    2.0 enc mbps ,   63.7 dec mbps
lzham       :  3.56:1 ,    1.5 enc mbps ,  186.4 dec mbps
Oodle Kraken:  3.51:1 ,    1.2 enc mbps ,  928.0 dec mbps

by decode speed:
lz4hc       :  2.32:1 ,   36.4 enc mbps , 2351.6 dec mbps
Oodle Kraken:  3.51:1 ,    1.2 enc mbps ,  928.0 dec mbps
zstdmax     :  3.24:1 ,    2.8 enc mbps ,  401.0 dec mbps
zlib9       :  2.51:1 ,   12.4 enc mbps ,  291.5 dec mbps
lzham       :  3.56:1 ,    1.5 enc mbps ,  186.4 dec mbps
lzma        :  3.88:1 ,    2.0 enc mbps ,   63.7 dec mbps
-------------------------------------------------------
"lzt99"

by ratio:
lzma        :  2.65:1 ,    3.1 enc mbps ,   42.3 dec mbps
Oodle Kraken:  2.46:1 ,    2.3 enc mbps ,  957.1 dec mbps
lzham       :  2.44:1 ,    1.9 enc mbps ,  166.0 dec mbps
zstdmax     :  2.27:1 ,    3.8 enc mbps ,  482.3 dec mbps
zlib9       :  1.77:1 ,   13.3 enc mbps ,  286.2 dec mbps
lz4hc       :  1.67:1 ,   30.3 enc mbps , 2737.4 dec mbps

by encode speed:
lz4hc       :  1.67:1 ,   30.3 enc mbps , 2737.4 dec mbps
zlib9       :  1.77:1 ,   13.3 enc mbps ,  286.2 dec mbps
zstdmax     :  2.27:1 ,    3.8 enc mbps ,  482.3 dec mbps
lzma        :  2.65:1 ,    3.1 enc mbps ,   42.3 dec mbps
Oodle Kraken:  2.46:1 ,    2.3 enc mbps ,  957.1 dec mbps
lzham       :  2.44:1 ,    1.9 enc mbps ,  166.0 dec mbps

by decode speed:
lz4hc       :  1.67:1 ,   30.3 enc mbps , 2737.4 dec mbps
Oodle Kraken:  2.46:1 ,    2.3 enc mbps ,  957.1 dec mbps
zstdmax     :  2.27:1 ,    3.8 enc mbps ,  482.3 dec mbps
zlib9       :  1.77:1 ,   13.3 enc mbps ,  286.2 dec mbps
lzham       :  2.44:1 ,    1.9 enc mbps ,  166.0 dec mbps
lzma        :  2.65:1 ,    3.1 enc mbps ,   42.3 dec mbps
-------------------------------------------------------
"all_dds"

by ratio:
lzma        :  2.37:1 ,    2.1 enc mbps ,   40.8 dec mbps
Oodle Kraken:  2.18:1 ,    1.0 enc mbps ,  684.6 dec mbps
lzham       :  2.17:1 ,    1.3 enc mbps ,  127.7 dec mbps
zstdmax     :  2.02:1 ,    3.3 enc mbps ,  289.4 dec mbps
zlib9       :  1.83:1 ,   13.3 enc mbps ,  242.9 dec mbps
lz4hc       :  1.67:1 ,   20.4 enc mbps , 2226.9 dec mbps

by encode speed:
lz4hc       :  1.67:1 ,   20.4 enc mbps , 2226.9 dec mbps
zlib9       :  1.83:1 ,   13.3 enc mbps ,  242.9 dec mbps
zstdmax     :  2.02:1 ,    3.3 enc mbps ,  289.4 dec mbps
lzma        :  2.37:1 ,    2.1 enc mbps ,   40.8 dec mbps
lzham       :  2.17:1 ,    1.3 enc mbps ,  127.7 dec mbps
Oodle Kraken:  2.18:1 ,    1.0 enc mbps ,  684.6 dec mbps

by decode speed:
lz4hc       :  1.67:1 ,   20.4 enc mbps , 2226.9 dec mbps
Oodle Kraken:  2.18:1 ,    1.0 enc mbps ,  684.6 dec mbps
zstdmax     :  2.02:1 ,    3.3 enc mbps ,  289.4 dec mbps
zlib9       :  1.83:1 ,   13.3 enc mbps ,  242.9 dec mbps
lzham       :  2.17:1 ,    1.3 enc mbps ,  127.7 dec mbps
lzma        :  2.37:1 ,    2.1 enc mbps ,   40.8 dec mbps
-------------------------------------------------------
"baby_robot_shell.gr2"

by ratio:
lzma        :  4.35:1 ,    3.1 enc mbps ,   59.3 dec mbps
Oodle Kraken:  3.77:1 ,    1.5 enc mbps ,  878.3 dec mbps
lzham       :  3.77:1 ,    1.6 enc mbps ,  162.5 dec mbps
zstdmax     :  2.77:1 ,    5.7 enc mbps ,  405.7 dec mbps
zlib9       :  2.19:1 ,   13.9 enc mbps ,  332.9 dec mbps
lz4hc       :  1.78:1 ,   40.1 enc mbps , 2364.4 dec mbps

by encode speed:
lz4hc       :  1.78:1 ,   40.1 enc mbps , 2364.4 dec mbps
zlib9       :  2.19:1 ,   13.9 enc mbps ,  332.9 dec mbps
zstdmax     :  2.77:1 ,    5.7 enc mbps ,  405.7 dec mbps
lzma        :  4.35:1 ,    3.1 enc mbps ,   59.3 dec mbps
lzham       :  3.77:1 ,    1.6 enc mbps ,  162.5 dec mbps
Oodle Kraken:  3.77:1 ,    1.5 enc mbps ,  878.3 dec mbps

by decode speed:
lz4hc       :  1.78:1 ,   40.1 enc mbps , 2364.4 dec mbps
Oodle Kraken:  3.77:1 ,    1.5 enc mbps ,  878.3 dec mbps
zstdmax     :  2.77:1 ,    5.7 enc mbps ,  405.7 dec mbps
zlib9       :  2.19:1 ,   13.9 enc mbps ,  332.9 dec mbps
lzham       :  3.77:1 ,    1.6 enc mbps ,  162.5 dec mbps
lzma        :  4.35:1 ,    3.1 enc mbps ,   59.3 dec mbps
-------------------------------------------------------
"win81"

by ratio:
lzma        :  2.95:1 ,    2.5 enc mbps ,   51.9 dec mbps
lzham       :  2.77:1 ,    1.6 enc mbps ,  177.6 dec mbps
Oodle Kraken:  2.70:1 ,    1.0 enc mbps ,  877.0 dec mbps
zstdmax     :  2.64:1 ,    3.5 enc mbps ,  417.8 dec mbps
zlib9       :  2.07:1 ,   16.8 enc mbps ,  269.6 dec mbps
lz4hc       :  1.91:1 ,   28.8 enc mbps , 2297.6 dec mbps

by encode speed:
lz4hc       :  1.91:1 ,   28.8 enc mbps , 2297.6 dec mbps
zlib9       :  2.07:1 ,   16.8 enc mbps ,  269.6 dec mbps
zstdmax     :  2.64:1 ,    3.5 enc mbps ,  417.8 dec mbps
lzma        :  2.95:1 ,    2.5 enc mbps ,   51.9 dec mbps
lzham       :  2.77:1 ,    1.6 enc mbps ,  177.6 dec mbps
Oodle Kraken:  2.70:1 ,    1.0 enc mbps ,  877.0 dec mbps

by decode speed:
lz4hc       :  1.91:1 ,   28.8 enc mbps , 2297.6 dec mbps
Oodle Kraken:  2.70:1 ,    1.0 enc mbps ,  877.0 dec mbps
zstdmax     :  2.64:1 ,    3.5 enc mbps ,  417.8 dec mbps
zlib9       :  2.07:1 ,   16.8 enc mbps ,  269.6 dec mbps
lzham       :  2.77:1 ,    1.6 enc mbps ,  177.6 dec mbps
lzma        :  2.95:1 ,    2.5 enc mbps ,   51.9 dec mbps
-------------------------------------------------------
"enwik7"

by ratio:
lzma        :  3.64:1 ,    1.8 enc mbps ,   79.5 dec mbps
lzham       :  3.60:1 ,    1.4 enc mbps ,  196.5 dec mbps
zstdmax     :  3.56:1 ,    2.2 enc mbps ,  394.6 dec mbps
Oodle Kraken:  3.49:1 ,    1.5 enc mbps ,  789.7 dec mbps
zlib9       :  2.38:1 ,   22.2 enc mbps ,  234.3 dec mbps
lz4hc       :  2.35:1 ,   27.5 enc mbps , 2059.6 dec mbps

by encode speed:
lz4hc       :  2.35:1 ,   27.5 enc mbps , 2059.6 dec mbps
zlib9       :  2.38:1 ,   22.2 enc mbps ,  234.3 dec mbps
zstdmax     :  3.56:1 ,    2.2 enc mbps ,  394.6 dec mbps
lzma        :  3.64:1 ,    1.8 enc mbps ,   79.5 dec mbps
Oodle Kraken:  3.49:1 ,    1.5 enc mbps ,  789.7 dec mbps
lzham       :  3.60:1 ,    1.4 enc mbps ,  196.5 dec mbps

by decode speed:
lz4hc       :  2.35:1 ,   27.5 enc mbps , 2059.6 dec mbps
Oodle Kraken:  3.49:1 ,    1.5 enc mbps ,  789.7 dec mbps
zstdmax     :  3.56:1 ,    2.2 enc mbps ,  394.6 dec mbps
zlib9       :  2.38:1 ,   22.2 enc mbps ,  234.3 dec mbps
lzham       :  3.60:1 ,    1.4 enc mbps ,  196.5 dec mbps
lzma        :  3.64:1 ,    1.8 enc mbps ,   79.5 dec mbps
-------------------------------------------------------

In chart form :

(lz4 decode speed is off the top of the chart)

I'm not including the other Oodle compressors here just to keep things as simple as possible. If you do want more compression than Kraken, and care about decode speed, then Oodle LZNA or BitKnit are much faster to decode than lzma (7zip) at comparable or better compression ratios.

Visit radgametools.com to learn more about Kraken and Oodle.

4/27/2016

Release the Kraken!

Announcing Oodle Kraken! - a new compression algorithm in the Oodle Data Compression library.

Oodle Kraken offers high compression with incredible decode speed, the likes of which has never been seen before.

Oodle Kraken decodes 3X faster than ZLib, and gets way more compression. There's really nothing even close to Kraken, its performance is revolutionary - it's never been possible to get such high compression and blazing fast decode speed before.

On the "Silesia" test set, the performance is :

Kraken :  4.05 to 1 compression ratio,   919.8 decode mb/s
zlib9  :  2.74 to 1 compression ratio,   306.9 decode mb/s
lzma   :  4.37 to 1 compression ratio,    78.8 decode mb/s

(speed measured on a Core i7-3770 3.4 Ghz , x64, single threaded)

Oodle Kraken's high compression makes it great for downloadables and distribution. Kraken's fast decode speed makes it amazing for in-game loading, paging and streaming, with less CPU load taking time away from your game. There's no need to decompress Kraken data when you install - just load compressed data directly. Kraken is so fast that loading compressed data and then decompressing is faster than just loading uncompressed data off disk!

Visit radgametools.com to learn more about Kraken and Oodle.

I'll be writing more about Oodle Kraken in the coming weeks, such as performance analysis on other data sets and vs. other compressors.

The overall performance on Silesia :

And a chart of the performance on each file in Silesia. This is a loglog scatter plot, with log2 of decode speed on the x axis and log2 of compression ratio on the y axis. (so for example from LZMA to Kraken on "sao" is 4.68 steps on the x axis, that means 25.76X faster)

Oodle Kraken is superb on the AMD Jaguar chip that's in the PS4 and XBox One. Most decompressors really struggle to run fast on those platforms, but not Kraken.

It's hard to write about Oodle Kraken without sounding ridiculous. It's really a huge step in data compression, where huge steps almost never happen.

old rants