1/09/2012

01-09-12 - LZ Optimal Parse with A Star Part 5

Wrapping up the series with lots of numbers.

Previous parts :

cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1

I'm delirious with fever right now so I might write something inane, but I'm so bored of lying in bed so I'm trying to wrap this up. Anyhoo..

So first of all we have to talk a bit about what we're comparing the A Star parse to.

"Normal" is a complex forward lazy parse using heuristics to guide parsing, as described in Part 1. "Fast" is like Normal but uses simpler heuristics and simpler match finder.

"Chain" is more interesting. Chain is a complex "lazy"-type parser which considers N decisions ahead (eg. Chain 4 considers 4 decisions ahead). It works thusly :

Chain Parse : first do a full parse of the file using some other parser; this provides with a baseline cost to end from each point. Now do a forward parse. At each position, consider all match and literal options. For each option, step ahead by that option and consider all the options at the next position. Add up the cost of each coding step. After N steps (for chain N) add on the cost to end from the first baseline parse. Go back to the original position and finalize the choice with the lowest cost. Basically it's a full graph walk for N steps, then use an estimate of the cost to the end from the final nodes of that sub-graph.

To make Chain parsing viable you have to reduce the number of match options to a maximum of 8 or so. Still Chain N has a complexity of 8^N , so it becomes slow very quickly as N grows.

Chain forward parse is significantly better than LZSS style backwards optimal parse for these LZ coders that have important adaptive state. The baseline parse I use for Chain actually is a backwards LZSS optimal parse, so you can see how it does by looking at the "Chain 0" results.


First overall results. Chain 6 is the most amount of steps I can run in reasonable time, and AStar 2048 means the quantum length for dividing up the file for AStar was 2048.

raw Fast Normal Chain 6 AStar 2048
lzt00 16914 5179 5016 4923 4920
lzt01 200000 198313 198321 198312 198312
lzt02 755121 181109 177792 173220 173315
lzt03 3471552 1746443 1713023 1698949 1690655
lzt04 48649 13088 12412 10407 10249
lzt05 927796 368346 367598 355804 354230
lzt06 563160 352827 351051 344721 343173
lzt07 500000 226533 215996 209133 208566
lzt08 355400 250503 249987 230541 230220
lzt09 786488 302927 287479 268544 265525
lzt10 154624 11508 10958 10307 10291
lzt11 58524 20553 19628 19139 19087
lzt12 164423 29001 26488 23966 23622
lzt13 1041576 935484 931415 924510 922745
lzt14 102400 47690 47298 46417 46350
lzt15 34664 10832 10688 10269 10260
lzt16 21504 10110 10055 9952 9927
lzt17 53161 19526 18514 17971 17970
lzt18 102400 64280 63251 59772 59635
lzt19 768771 322951 288872 269132 269162
lzt20 1179702 888881 872315 856369 855588
lzt21 679936 91677 88011 83529 83184
lzt22 400000 287715 284378 279674 279459
lzt23 1048576 807253 804048 798369 798334
lzt24 3471552 1418076 1411387 1399197 1388105
lzt25 1029744 113085 107882 97320 100175
lzt26 262144 212445 210836 207701 207552
lzt27 857241 237253 235137 222023 220837
lzt28 1591760 332660 308940 260547 252808
lzt29 3953035 1193914 1180823 1147160 1135603
lzt30 100000 100001 100001 100001 100001
10800163 10609600 10337879 10289860


Now number of Chain steps for the chain parser : (that's O0 - O6)

U N O0 O1 O2 O3 O4 O5 O6
lzt00 16914 5016 5024 4922 4922 4922 4922 4923 4923
lzt01 200000 198321 198321 198312 198312 198312 198312 198312 198312
lzt02 755121 177792 177877 175905 174835 174073 173759 173509 173220
lzt03 3471552 1713023 1712337 1704417 1703873 1702651 1701635 1700282 1698949
lzt04 48649 12412 11315 10516 10481 10457 10427 10416 10407
lzt05 927796 367598 368729 365743 364332 360630 356403 355968 355804
lzt06 563160 351051 350995 346856 345500 344778 344739 344702 344721
lzt07 500000 215996 215644 211336 209481 209259 209244 209138 209133
lzt08 355400 249987 249372 239375 237320 231554 231435 233324 230541
lzt09 786488 287479 284875 280683 275679 270721 269754 269107 268544
lzt10 154624 10958 10792 10367 10335 10330 10311 10301 10307
lzt11 58524 19628 19604 19247 19175 19225 19162 19159 19139
lzt12 164423 26488 25644 24217 24177 24094 24108 24011 23966
lzt13 1041576 931415 931415 929713 927841 926162 924515 924513 924510
lzt14 102400 47298 47300 46518 46483 46461 46437 46429 46417
lzt15 34664 10688 10656 10317 10301 10275 10278 10267 10269
lzt16 21504 10055 10053 9960 9966 9959 9952 9948 9952
lzt17 53161 18514 18549 17971 17970 17974 17971 17973 17971
lzt18 102400 63251 63248 59863 59850 59799 59790 59764 59772
lzt19 768771 288872 281959 277661 273316 269157 269141 269133 269132
lzt20 1179702 872315 872022 868088 865376 863236 859727 856408 856369
lzt21 679936 88011 88068 84848 83851 83733 83674 83599 83529
lzt22 400000 284378 284297 281902 279711 279685 279689 279696 279674
lzt23 1048576 804048 804064 802742 801324 799891 798367 798368 798369
lzt24 3471552 1411387 1410226 1404736 1403314 1402345 1401064 1400193 1399197
lzt25 1029744 107882 107414 99839 100154 99710 98552 98132 97320
lzt26 262144 210836 210855 207775 207763 207738 207725 207706 207701
lzt27 857241 235137 236568 233524 228073 223123 222884 222540 222023
lzt28 1591760 308940 295072 286018 276905 273520 269611 264726 260547
lzt29 3953035 1180823 1183407 1180733 1177854 1170944 1162310 1152482 1147160
lzt30 100000 100001 100001 100001 100001 100001 100001 100001 100001
10609600 10585703 10494105 10448475 10404719 10375899 10355030 10337879

Some notes : up to 6 (the most I can run) more chain steps is better - for the sum, but not for all files. In some cases, more steps is worse, which should never really happen, but it's an issue of approximate optimal parsers I'll discuss later. (*)

On most files, going past 4 chain steps helps very little, but on some files it seems to monotonically keep improving. For example lzt29 stands out. Those files are ones that get helped the most by AStar.


Now the effect on quantum size on AStar. In all cases I only output codes from the first 3/4 of each quantum.

raw 256 512 1024 2048 4096 8192 16384
lzt00 16914 4923 4923 4920 4920 4920 4921 4921
lzt01 200000 198312 198312 198312 198312 198312 198314 198314
lzt02 755121 175242 173355 173368 173315 173331 173454 173479
lzt03 3471552 1699795 1691530 1690878 1690655 1690594 1690603 1690617
lzt04 48649 10243 10245 10234 10249 10248 10241 10241
lzt05 927796 357166 354629 354235 354230 354233 354242 354257
lzt06 563160 346663 343202 343139 343173 343194 343263 343238
lzt07 500000 209934 208669 208584 208566 208556 208553 208562
lzt08 355400 228389 229447 229975 230220 230300 230374 230408
lzt09 786488 266571 265564 265487 265525 265559 265542 265527
lzt10 154624 10701 10468 10330 10291 10273 10273 10272
lzt11 58524 19139 19123 19096 19087 19085 19084 19084
lzt12 164423 23712 23654 23616 23622 23628 23630 23627
lzt13 1041576 923258 922853 922747 922745 922753 922751 922753
lzt14 102400 46397 46364 46351 46350 46350 46348 46350
lzt15 34664 10376 10272 10260 10260 10251 10258 10254
lzt16 21504 9944 9931 9926 9927 9927 9927 9927
lzt17 53161 17937 17970 17968 17970 17969 17969 17969
lzt18 102400 59703 59613 59632 59635 59637 59640 59640
lzt19 768771 269213 269151 269128 269162 269193 269218 269229
lzt20 1179702 855992 855580 855478 855588 855671 855685 855707
lzt21 679936 83882 83291 83215 83184 83172 83171 83169
lzt22 400000 279803 279368 279414 279459 279605 279630 279647
lzt23 1048576 798325 798319 798321 798334 798354 798357 798358
lzt24 3471552 1393742 1388636 1388031 1388105 1388317 1388628 1388671
lzt25 1029744 97910 101246 101302 100175 100484 100272 100149
lzt26 262144 207779 207563 207541 207552 207559 207577 207576
lzt27 857241 222229 220832 220770 220837 220773 220756 220757
lzt28 1591760 256404 253257 252933 252808 252737 252735 252699
lzt29 3953035 1136193 1135442 1135543 1135603 1135710 1135689 1135713
lzt30 100000 100001 100001 100001 100001 100001 100001 100001
10319878 10292810 10290735 10289860 10290696 10291106 10291116

The best sum is at 2048, but 1024 is a lot faster and almost the same.

Again, as the previous note at (*), we should really see just improvement with larger quantum sizes, but past 2048 we start seeing it go backwards in some cases.


Lastly a look at where the AStar parse is spending its time. This is for a 1024 quantum.

The x axis here is the log2 of the number of nodes visited to parse a quantum. So, log2=20 means a million nodes were needed to parse that quantum. So for speed purposes a cell one to the right is twice as bad. The values in the cells are the percentage of quanta in the file that needed that number of nodes.

(note : log2=20 means one million nodes were visited to output 768 bytes worth of codes, so it's quite a lot)

log2 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
lzt00 0 0 0 18.18 59.09 18.18 4.55
lzt01 3.75 0.75 41.2 34.08 13.86 5.62 0.75
lzt02 1.81 1.36 25.37 34.09 13.59 13.02 8.15 1.93 0.23 0.23
lzt03 1.46 1.18 17.51 18.46 14.16 13.17 6.95 4.81 3.66 4.81 9.54 2.79 0.96 0.11 0.03
lzt04 1.67 0 0 1.67 0 21.67 5 18.33 3.33 10 16.67 16.67 5
lzt05 0.59 0.25 4.41 10.77 9.92 18.32 13.23 10.09 9.67 6.02 12.47 3.22 0.51 0.08 0.08
lzt06 0.8 0.93 6.81 23.77 14.69 16.96 21.09 11.48 2.67 0.8
lzt07 0.46 0.46 8.66 7.88 6.8 15.3 17 14.53 5.56 9.58 8.19 4.79 0.31 0.31
lzt08 0 0 0 0 0 0 1.68 1.68 1.47 27.67 53.88 11.95 1.68
lzt09 0.29 0.48 0.76 0.86 0.95 3.9 28.07 47.76 16.18 0.38
lzt10 0 0.56 10.17 12.99 9.04 9.04 10.17 41.24 4.52 0.56 1.13
lzt11 0 0 7.89 10.53 14.47 17.11 6.58 9.21 21.05 10.53 2.63
lzt12 0 0 0 0 0 4.27 28.91 59.24 7.58
lzt13 0 0 0.07 0.14 0.57 1.72 3.36 5.72 39.24 42.03 7.08 0.07
lzt14 0 0.83 0 2.5 8.33 34.17 42.5 5 2.5 1.67 0.83 0 0.83
lzt15 0 2.27 4.55 15.91 13.64 15.91 13.64 6.82 11.36 11.36 4.55
lzt16 0 0 3.57 0 14.29 42.86 32.14 3.57
lzt17 1.39 1.39 2.78 1.39 4.17 75 13.89
lzt18 0 0 0 0 0 0.72 0 2.17 2.9 11.59 56.52 23.19 2.9
lzt19 0 0 1.26 2.81 0.39 7.56 87.11 0.87
lzt20 0 0.13 2.08 2.02 4.29 67.07 24.29 0.06
lzt21 0.2 0.78 6.07 6.07 5.28 19.77 35.62 22.9 1.96 0.2 0.2
lzt22 0 0.56 2.98 5.59 26.82 62.94 1.12
lzt23 0 0 0 0 0 0.07 1.35 2.63 0.92 70.88 23.15 0.14 0.36 0.5
lzt24 0.44 0.61 4.14 37.41 7.62 12.68 12.72 8.52 6.11 5.19 3.11 0.94 0.31 0.04
lzt25 0.22 0.43 1.52 1.74 2.68 6.44 15.69 27.19 30.22 13.09 0.72
lzt26 0 0 0 1.15 3.15 2.58 77.65 14.61 0.57
lzt27 0.61 0.1 7.55 6.53 1.22 4.39 5 4.08 7.76 44.8 16.43 1.43
lzt28 0.25 0.1 3.71 0.94 0.74 6.77 15.56 10.08 10.97 14.82 18.68 11.41 4.05 1.24 0.1
lzt29 0.3 0.73 1.61 22.37 5.28 6.16 26.34 2.97 0.48 0.85 19.63 12.47 0.73
lzt30 3.7 0.74 47.41 34.07 12.59 0.74

Well there's no easy answer, the character of the files are all very different.

In many cases the A Star parse is reasonably fast (comparable to Chain 3 or something). But in some cases it's quite slow, eg. lzt04, lzt08, lzt28.


Okay, I think that's all the data. We have one point to discuss :

(*) = in all these type of endeavors, we see these anomolies where as we give the optimizer more space to make decisions, it gets better for a while, then starts getting worse. I saw the same thing, but more extreme, with video coding.

Basically what causes this is that you aren't optimizing for your real final goal. If you were optimizing for the total output size, then giving it more freedom should never hurt. But you aren't. With Chain N or with A Star in both cases you are optimizing just some local portion, and it turns out that if you let it make really aggressive decisions trying to optimize the local bit, that can hurt overall.

A similar issue happens with an Huffman optimal parse, becuase you are using the huffman code lengths from the previous parse to do the current parse. That's fine as long as your parse is reasonably similar, but if you let the optimal parser really go nuts, it can start to get pretty far off those statistics, which makes it wrong, so that more optimizing actually gives worse results.

With video coding the main issue I had was that the optimization was generally local (eg. just on one macro block at a time or some such), but it of course affects the future as a source for motion compensation (and in other ways), and it turns out if you do really aggressive optimization on the local decisions, that can wind up hurting overall.

A similar thing can happen in image and video coding if you let optimization proceed very aggressively, because you have to use some simple analytic criterion (such as RMSE - though even if you use a fancier metric the same problems arise). The issue is that the coder can wind up finding strange states that are a good trade-off for RMSE, but wind up looking just horrible visually.

Obviously the correct solution is to optimize with the true final goal in mind. But that's not always possible, either computationally, or because the final goal is subjective.

Generally the solution is to moderate the optimization in some way. You have some heuristic idea of what kind of solutions will provide good globally optimal solutions. (for example, in image/video coding, you might require that the bit rate allocation not create too big of a difference between adjacent blocks). So you sort of want to guide your optimization to start around where you suspect the answer to be, and then you tune it so that you don't allow it to be too aggressive in making whatever decision it thinks is locally optimal.

5 comments:

ryg said...

What's the relative speed of the Chain* variants compared to A*, and how much slower is ChainN compared to Chain1 (assuming a Suffix Tree or similarly fast match finder)?

cbloom said...

"What's the relative speed of the Chain* variants compared to A*"

A* is a lot more variable. It can be as fast as Chain3 or so, but sometimes as bad as Chain6 or so.

"and how much slower is ChainN compared to Chain1 (assuming a Suffix Tree or similarly fast match finder)? "

First of all, something I maybe didn't point out is that both for ChainN and A* to be reasonably fast you have to pre-find all matches. So I decide in advance that I will consider at most 8 matches per position, and I go find all those matches and save them in an array. So you never redo match finding as you parse.

If it's 8 matches, then Chain N+1 is 8X slower than Chain N.

I should make clear that even if Chain was fast enough, it's not super awesome because it uses a pretty rough approximation for the cost to end after the N steps which doesn't include the current state at all. A* doesn't have this drawback.

At the moment Chain is more usable in production because I haven't found a good way to control the occasional very long time that A* takes.

Yann Collet said...

So it basically means that current LZMA could get even better compression performance by using one of your methodologies ?

cbloom said...

Yes!

cbloom said...

Though only something like 1-3%

old rants