01-22-10 - Exponential

One of my all time pet-peeves is people who say things are "increased exponentially". Of course the worst use of all is just when people use it to mean "a lot" , eg. they're not even talking about a trend or a response curve, eg. "the 911 Turbo provides an exponential increase in power over the base spec". This abortion of usage is becoming more and more common. But even scientists and mathematicians will use "exponential" to describe a trend that's faster than linear (it's quite common in NYT math/economics/science articles for example).

Today I was reading this blog on software development organizations and my hackles immediately went up when I read :

"The real cost of complexity increases exponentially."

I started to write a snarky post about how he almost certainly meant "geometrically". But then I started thinking about it a bit more. (* correction : by "geometrically" I mean "polynomially").

Maybe software development time actually is exponential in number of features?

If you're trying to write N features and they are all completely independent, then time is O(N), eg. linear.

If each feature can be used with only one other feature, and that interaction is completely custom but independent, then time is O(N^2), eg. geometric.

Already I think there's a bit of a myth we can tackle. A lot of optimistic software devs think that they can get under even this O(N^2) complexity by making the features interact through some generic well defined pathways. eg. rather than specificially code how feature (A) and feature (B) and feature (A+B) work, they try to just write (A) and (B) but make them both aware of their circumstances and handle various cases so that (A+B) works right. The problem is - you didn't really avoid the O(N^2). At the very least, you still had to test (A+B) to make sure it worked, which meant trying all N^2 ways, so your time is still O(N^2). The code might look like it's O(N) because there's only a function for each feature, but within each function is implicit O(N^2) logic !!

What I mean by implicit logic is the blank lines which testing reveals you don't have to write! eg. :

void Feature_A( )


    if ( SelectionMode() )
        // ... custom stuff here to make (A+C) and (A+D) work

    // !!! blank lines here !
    //  this is implicit knowledge that (A+B) and (A+E) don't need custom code


You might argue that this is slightly less than quadratic complexity for the developer, and tester time is cheaper so we don't care if that's quadratic. But in any case it's geometric and above linear.

But is it actually exponential? What if all the features could be enabled or disabled, so the state of your code is a binary string indicating what features are on or off; eg. 1100 = A on, B on, C off, D off. Then there are in fact 2^N feature states, which is in fact exponential.

Another possibility is if the features can only be enabled one by one, but they have lingering effects. You have some shared state, like a data file you're processing, and you can do A then C then B then E on it. In that case the number of sequences is something like N! which is exponential for large N (actually super-exponential)

Let's concretely consider a real video game coding scenario.

You're trying to write N features. You are "sophisticated" so rather than writing N^2 hard-coded interactions, you make each feature interact with a shared world state via C "channels". (in the old Looking Glass speak, these C channels might be a set of standard "properites" on objects and ways to interact with those channels; in the old Oddworld Munch codebase there were C "component" types that could be on objects such as "SoundTrigger "Pressable" etc.). So your initial code writing time something like O(N*C).

But for the N features to really be meaningful, the C is ~= N (roughly proportional). (or at least C ~= log(N) , but certainly C is not constant as N increases - as you add features you always find you need more channels for them to communicate with each other). So just code writing time is something between O(NlogN) and O(N^2).

But your features also affect shared state - e.g. the "world" that the game takes place in, be that physical state, or state variables that can be flipped on other objects. If you have N objects each with K internal states, this creates K^N world states that have to be tested. Even with very small world state, if the features are order dependent, you're back to N! test cases.

If the bug rate was a constant percentage of test cases (eg. 0.1% of test cases produce a bug), then you are back to exponential number of bugs = exponential coder time. But I'm not sure that model of bugs is correct. If the bug rate was a constant percentage of lines of code, then bug rate would only be geometric.


01-17-10 - Nob or Knob -

Is there a difference between Nob and Knob or are they just alternate spellings of the same thing ?

It appears that "knob" is generally considered more correct now, though in old english "nob" was more common. There are many meanings, I'll show with the spelling I prefer :

knob : dial/wheel control
knob : bump or protuberance
knob : small amount, usually of butter
knob : head of the penis
nob  : head of a man (archaic)
nob  : wealthy or upper class person (archaic)

Some people seem to think that either "nob" or "knob" are exclusively correct for the slang meaning of penis (they also disagree on whether it refers to the whole penis or just the head). It's unclear to me where the origin of this slang came from, since "nob" can either mean a person's head (archaic), which would suggest the slang came to refer to head of the penis, or "knob" can mean any protuberance.

Wiktionary seems to think "knob" is a common way to refer to a hill; perhaps this is British, I've never heard it in America.

Some weird uses :

"Nob Hill" - my guess is this common name refers to a hill where the wealthy people lived, not the fact that the hill was a knob, but I could be totally wrong about that. Some people seem to point Nob Hill at nabob but since nabob basically means the same thing as nob I don't see why you would point "Nob" at "nabob" when you could just point it at "nob".

"Hob Nob" - apparently this is completely unrelated if you believe this etymology it came from habbe nabbe

"For his nobs" in Cribbage is a funny one ; it appears to also be completely unrelated, coming from the game noddy which means simpleton, and since the knave of the same suit was important to the game it was referred to as the "knave noddy" or just "noddy" which must have become nob. (unrelated but the story of John Suckling, purported inventor of modern Cribbage, seems pretty fascinating; he was apparently a master gambler and cheater at cards who used his skill to get money beyond his station; he was involved in a plot to spring a prisoner from the tower of london, and received at least one beating at the handle of a nobleman tradgames ; ezinearticles ; wikipedia )

There are some funny uses : Nob Hill Knob Set and Nifty Nob Inc. maker of fine Knobs good job on the consistency, guys.


01-14-10 - A small note on Trellis quantization

See reference .

I guess this is obvious, but you do get a pretty nice win from using the true floating point Dct results rather than the quantized Dct results when you do TQ.

I believe the standard practice (what I was doing before anyway) is to do your normal fast Dct + quantization, which takes your integer pixels and makes quantized integer post-Dct output. You then apply TQ on that quantized-and-transformed output, which means instead of sending the true output { 4,3,0,0 } you also consider {4,2,0,0 } and {4,0,0,0 } and so on, measure J(R,D) of each and pick the best.

Okay, but the distortion of changing "1" to 0 is not the same as the distortion of another 1 to 0 , if those were not really the same 1 before quantization.

For example, say you're quantizing with a quantizer = 1.0 for simplicity, and no deadzone, even bucket sizes, so you have quantization buckets :

[ -0.5, 0.5 ] -> 0
[ 0.5 , 1.5 ] -> 1

In that case, when TQ decides to take a quantized "1" and send instead a 0, if the true value was 0.51 , that's not so bad. If the true original value was 1.49 , that's a lot worse.

However, interestingly, if the true original value was 1.49 , then we could send it as a "2". If the value is near a quantization boundary, then the distortion doesn't care whether you kick the value up or down, but the rate might be significantly different, in which case you should make a choice based on J and get a win.

So my Dct now also does the true float -> float transform just for use in the distortion measurement for TQ. It's also useful in this application to make sure your Dct doesn't do any scaling, so that the transform is Unitary, that is, L2 norm is preserved. That way the distortion measure in post-Dct space is the same as the distortion measure in pre-Dct space which means you can use the same lambda for lagrange J decisions.


01-13-10 - Oodle Revisited

I got some emails from a friend recently that made me start thinking about Oodle again. Friend is an indie 360 developer who just shot me a query like (paraphrasing) :

"Hey, I'm loading stuff in my game and only getting like 20 MB/sec , what gives?"

So we started trying to dig into what he was doing exactly - are you opening/reading the files with all the right flags, are you on 4k alignment, are you on a thread so you aren't just stalling out for the seeks, etc. ? In the end I think we figured out that his problem was he was reading too many small files, to which he replied :

"Oh yeah, duh, I've always packed files into bundles and loaded big chunks in previous games, but I just hadn't gotten around to it yet in this and it was bugging me that my level loads were taking so long."

! Ah ha ! To me this is where Oodle comes in as a product. It's not super hard stuff, but it's something that people always put off until the very end, which is kind of a shame because it means their level loads take forever during dev. So for three years while you're making your game you suffer through annoying slow level loads. Instead you buy Oodle on day 1 and your level loads are automagically fast.

I believe in this as a product, in the sense that I think we can make it, and it would be extremely valuable, but I'm not convinced that the game industry is mature enough to buy it. The game industry has always suffered from the mistaken thinking of "I could write that myself, so I won't pay for it". That's retarded. What you should look at is what's the full cost of writing it yourself, including debugging, and perhaps most importantly including the opportunity cost of spending that time on this instead of something else, or the opportunity cost of not having feature X until you've gotten around to writing it.

For example, say you're on a one man dev team and you lay out your coding tasks for your project in order {A,B,C,D,E} . All during dev you are suffering a penalty from not having feature E done. If it would make dev a lot easier to have feature E done up front, you should pay a *lot* of money to get it immediately. Some of the classic mistakes in this vein are things like profile HUDs which people put off until the end, but would provide huge benefit if you had from the beginning; another one people aren't aware of is level save/load and memory card support - you think of that as a minor detail to do at the end, but as soon as you do it level designers are walking around with scenarios saved on memory cards to show to each other and they get a huge productivity boost, so it would have been a nice win to do early (though then you have maintenance pain).

To me the big win of Oodle is :

Client just writes code to load files one by one with plain old fopen/fread/whatever.

Behind the scenes Oodle magically robustly incrementally syncs PC files to consoles, packages files into bundles, makes prefetch lists and prefetches bundles, compresses & decompresses bundles on threads. You have ship quality fast loading from the beginning.

That's a small, simple, great product I think. The problem is to make it something compelling enough to sell we start to add lots of features : high performance paging in/out for seamless worlds, hot reloading of changed content, smooth IO integration with streaming data files like audio/video, DVD layout optimization, texture compressors, etc. which really just wind up clouding the tiny little valuable product at the core.

01-13-10 - Lagrange Rate Control Part 3

Lambda seeking. How do we choose lambda to match a desired rate? I have a few ideas.

One idea is to train up a statistical fit to make good guesses. For any given video (or other data to compress), you can create a relationship between lambda and R that's a pretty simple functional fit. There are a few ways to do this. One way would be to just go ahead and compress the video 16 times with different lambda values, observe R each time, that gives you a bunch of data points, now use those to fit a function to lambda(R) , now you can dial a lambda for any rate. If you're going to be working on the same video a lot, this might not be so bad; in particular in the games or DVD publishing business this might be tolerable.

A slight tweak on that idea is instead of just making a fit per video, instead fit to some characteristics of that video. Sample some kind of moments from the video, like do one pass with simple mocomp and measure the L2 norm of the movec and the L2 norm of the residuals. Those are just two moments. Use those to build up a fit lambda(R,m1,m2). To make this fit, we could run on a variety of sample videos to build a database of some good seeds. Furthermore, whenever a user does a run, it would add a seed to their local database. That way if they run on the same video multiple times, those moments would be matched exactly and the fit would get better. If we don't find a close enough sample point in the database to make a good fit, then we just force a test compression of the video at a given lambda and measure the R. So the end result is that rather than doing a bunch of test compressions to build a model, we just do a quick scan to sample some moments, then take the R parameter and look up the fit and get a guess for lambda. Note that rather than doing this fit per video, you could do it *per frame* which would give you a lot more dense sample data very quickly.

Another way to get a decent guess for lambda is with interpolation search. The procedure would be something like this :

pick good brackets for lambda
    high and low lambdas that you're pretty sure the desired lambda is between, but are as close together as possible
    (requires some kind of moment sampling or heuristic training)

evalue R(lambda) at the two end points and half way between them
    that's 3 compression runs

from 3 points, fit a quadratic R(lambda)

use quadratic to choose 4th point lambda
    (bias it so its not too close to the 3 we already did)

run compression at 4th point
    if that R(lambda) is close enough to target -> done

make line between 4th point and neighbor that straddles target R
use lerp to choose 5th lambda

4 or 5 evaluations of R(lambda)

In many cases with 4 runs your R may be close enough to desired rate and you are just done. The quality of interpolation search depends on how simple/smooth the R(lambda) function is. In general it is a good fit to simple curves, but you can get unlucky in your sample point choices.

Finally, if you want to hit a rate more exactly, you can do so more efficiently by relaxing the global uniform lambda requirement. Say you have some lambda which is pretty close to giving you the right rate and you want to encode to hit that rate more exactly. You let lambda drift a little bit as you code to try to adjust towards hitting the exact bit rate.

I'm not sure exactly what the ideal algorithm is for this, so if you have better ideas let me know. It's a little bit of a nasty black box feedback problem, because I'm trying to hit R by dialing L, and I have this unknown function R(L). One assumption that we will use, which most video coders use in one way or another, is that the video is locally self similar, that is, frame N is usually similar to frame N+1 , so if I observe something about the function R(L) on frame N, it is a good guess that it behaves the same way on frame N+1 ; obviously this is grossly untrue for major cuts, but those are "rare" and we just accept the innaccuracy there (you could also have a panic mode when you see you got it really wrong).

So the idea is that rather than search L around on a given frame, we use previous frames to just make a guess for L and only encode the frame once. If we get it wrong by a little bit, no big deal, we'll make up for it on the next frame. This only works when we are only trying to make small corrections.

Specifically : you did a previous pass at lambda L1 and that gave you total size R1. It also gave you a size for each frame : F1(i). You now want to hit size R2 (which is very close to R1).

At frame i you have already written W2 bits, so you have (R2 - W2) remaining. In pass one you had (R1 - W1) remaining at the same spot. Compute the desired size for the current frame as :

F2 = F1 * (R2 - W2) / (R1 - W1)

To hit frame size F2 , you know if F2 is close to F1 then you should use L = L1. If F2 is a little bigger or smaller either way you should adjust L slightly. To do that, we track a running estimate of dL/dF , call it M for the slope. So we use :

L = L1 + (F2 - F1) * M

We then compress with this L which gives us a frame size F(L). We then compute the actual observed slope :

S = (L - L1) / ( F(L) - F1 )

and then update M using S via IIR or FIR.

There are a lot of kludgy things you'd need to do here, like seed M with a decent guess, don't blend in S updates if you get a weird result ( like F(L) = F1 ), forbid L from making big steps away from L1 even if the estimate really wants to, etc.

Also I'm not sure if dF/dL is really the right variable to estimate ; maybe it would be better to do d(F/F1)/ d(L/L1) , or something. That is, for frames of different characters, what is the most consistent response of frame size to L variation?


01-12-10 - Lagrange Rate Control Part 2

Okay, so we've talked a bit about lagrange coding decisions. What I haven't mentioned yet is that we're implicitly talking about rate control for video coders. "Rate control" just means deciding how many bits to put in each frame. That's just a coding decision. If the frames were independent (eg. as in Motion JPEG - no mocomp) then we know that lagrange multiplier R/D decisions would in fact be the optimal way to allocate bits to frames. Thus, if we ignore the frame-dependence issue, and if we pretend that all distortions are equal - lagrange rate control should be optimal.

What does lagrange rate control mean for video? It means you pick a single global lambda for the whole video. This lambda tells you how much a bit of rate should be worth in terms of distortion gain. Any bit which doesn't give you at least lambda worth of distortion gain will not be coded. (we assume bits are of monotonically decreasing value - the first bit is the most important, the more bits you send the less value they have). On each frame of video you want to code the frame to maximize J. The biggest single control parameter to do this is the quantizer. The quantizer will largely control the size of the frame. So you dial Q to maximize J on that frame, this sets a frame size. (maximizing J will also affect movec choices, macroblock mode choices, etc).

Frames of different content will wind up getting different Q's and very different sizes. So this is very far from constant bit rate or constant quantizer. What it is is "constant bit value". That is, all frames are of the size where adding one more bit does not help by lambda or more. Harder to code (noisy, fast motion) frames will thus be coded with more error, because it takes more bits to get the same amount of gain in that type of frame. Easy to code (smooth, flat) frames will be coded with much less error. Whether or not this is good perceptually is unclear, but it's how you get optimal D or a given R assuming our D choice is what we want.

Ideally you want to use a single global lambda for your whole video. In practice that might not be possible. Usually the user wants to specificy a certain bit rate, either because they actually need to meet a bit rate maximum (eg. for DVD streaming) , or because they want to specify a maximum total size (eg. for fitting on your game's ship DVD), or simply because that's an intuitive way of specifying "quality" that people are familiar with from MP3 audio and such. So your goal is to hit a rate. To do that with a single global lambda, you would have to try various lambdas, search them up and down, re-encode the whole video each time. You could use binary search (or maybe interpolation search), but this is still a lot of re-encodings of the whole video to try to hit rate. (* more on this later)

Aside : specifying lambda is really how people should encode videos for distribution as downloads via torrents or whatever. When I'm storing a bunch of videos on my hard disk, the limitting factor is my total disk size and the total download time - I don't need to limit how big each individual movie is. What I want is for the bits to go where they help me most. That's exactly what lambda does for you. It makes no sense that I have some 700 MB half hour show that would look just fine in 400 MB , while I have some other 700 MB show that looks like shit and could really use some more bits. Lambda is the right way to allocate hard drive bytes for maximum total viewing quality.

Okay. The funny thing is that I can't find anyone else on the web or in papers talking about lagrange video rate control. It's such an obvious thing that I expected it to be the standard way, but it's not.

What do other people do? The de-facto standard seems to be what x264 and FFMPEG do, which I'll try to roughly outline (though I can't say I get all the details since the only documentation is the very messy source code). Their good mode is two pass, so I'll only talk about that.

The primary thing they do is pick a size for each frame somehow, and then try to hit that size. To hit that frame size, they search QP ( the quantization parameter) a bit. The specifically only search QP in the local neighborhood of the previous QP because they want to limit QP variation between frames (the range of search is a command line parameter - in fact almost everything in this is a command line parameter so I'll stop saying that). When they choose a QP, there's a heuristic formula for H264 which specifies a lambda for lagrange decisions that corresponds to that QP. Note that this lambda is only used for inside-the-frame coding decisions, not for choosing QP or global rate allocation. Also note that the lambda-QP relationship is not necessarily optimal; it's a formula (there are a bunch of H264 papers about making good lambda-QP functional fits and searches). They also do additional funny things like run a blurring pass on QP to smooth out variation; presumably this is a heuristic for perceptual purposes.

So the main issue is how do they pick this frame size to target? So far as I can tell it's a mess of heuristics. For each frame they have a "complexity" measure C. On the first pass C is computed from entropy of the delta or some such thing, raised to the 0.8 power (purely heuristic I believe). The C's are then munged by some fudge factors (that can be set on the command line) - I frame sizes are multiplied by a factor > 1 that makes them bigger, B frame sizes are multipled by a factor < 1 that makes them smaller. Once all the "complexities" are chosen, they are all scaled by (target total size) / (total complexity) to make them add up to the total target size. This sets the desired size for each frame.

Note that this heuristic method has many of the same qualitative properties as full lagrangian allocation - that is, more complex frames will get more bytes than simpler frames, but not *enough* more bytes to give them the same error, so more complex frames will have larger errors than simpler frames. However, quantitatively there's no gaurantee that it's doing the same thing.

So far as I can tell the lagrange method is just better (I'm a little concerned about this because it just seems to vastly obviously better that it disturbs me that not everyone is doing it). Ignoring real world issues we've glossed over, the next big problem is the fact that we have to do this search for lambda, so we'll talk about that next time.

ADDENDUM : x264/ffmpeg rate control references :

ratecontrol.txt - Loren is the main dev but this is very cursory
FFmpeg RateControlContext Struct Reference
FFmpeg libavcodecratecontrol.c Source File
[Ffmpeg-user] changing bitrate on the fly - detailed rate control document, but not written by one of the main devs so beware
x264 Macroblock Tree Ratecontrol testing (committed) - Doom9's Forum - this is about the dependency issue that we haven't discussed

01-12-10 - Lagrange Rate Control Part 1

I'm going to talk about video coding and code stream design. It's important to remember the end goal - we want to make a coded bit stream which is <= N bytes and plays back with the highest quality possible. In all cases I assume the decoder is a fixed known quantity.

First of all it's useful to consider what you would do if you had infinite CPU time. The answer is you should try all coded bit streams. That is, to make the optimal N byte stream, you should make 256 ^ N streams, run each through the decoder, measure the error somehow, and choose the best. That sounds ridiculous, but it is our goal; everything else we talk about will be approximations that are trying to get as close as possible to that.

Let's try to get into the realm of reality. You have some encoder that you think produces good code streams for your given decoder. In various places in your encoder you have decision points - what quantizer to use, what macroblock mode to use, whether to send this value as its true self or as any other value, etc. If you had lots of CPU power you should do a brute force search on all decisions - try every decision all possible ways; if there are M decisions and each has K choices this is K^M. Reject code streams produced that are bigger than N. Measure the error of the results and choose the best.

Okay, that's still ridiculously impractical. But let's consider a simpler case. Say you only have two coding decisions, and most importantly - they are independent, that is one decision does not affect the other. For example say you are coding two separate images, and you are only allowed to choose the quantizer on each image, and your goal is to minimize the total error. If there are K quantizer choices, a full search would mean trying K^2 ways. But in that's not necessary because of independence. All you have to do code the first image K ways & remember the rate (R) and distortion (D) of each way, then the second image K ways and remember R/D, so you have 2K codings. Now to find the optimal combination you could do K^2 table lookups, but even here you can usually do much less, because the R/D tables are monotonic (or very nearly monotonic) which means rather than doing K^2 you only need to pick an entry on each side and then slide each one up and down to search for improved D given the R constraint, which is O(K). This obviously extends to N independent coding choices ; rather than K^N we can do K*N codings.

The simplest way to do this in practice is with a lagrange multiplier framework. Rather than look for optimal D given R, we look for an optimal lagrange cost : J = D + lambda * R.

If J is maximized by our coding choices, it means dJ/dCode = 0. That means 0 = dD/dC + lambda * dR/dC , or dD/dR = - lambda

That is, lambda has selected a slope of the D(R) curve. If D(R) and dD/dR are both monotonic, then this has uniquely selected a rate & encoding. We shall henceforth assume that this monotonicity requirement is true. In practice it is not quite true but it is "mostly" true which we can deal with in hacky ways (* perhaps more on this niggle some day). With this assumption, if your goal is to hit a certain rate, you can monotonically search lambda to find an encoding that maximizes J at that rate.

Now the key point is that maximizing J for a given rate gives you the optimal encoding for independent coding choices. Consider our example of the two images above. Instead I make a quantizer choice from the K choices on each image independently. On each one I measure J and choose the one that maximizes J. (note that since R and D are monotonic I can actually just do a binary search here that's O( log(K) ) not O(K) ). Because they are independent, I have found the maximum J for the full stream. This is the optimal encoding. Why? Because it has made the slope dD/dR equal at each of the coding choices. That is, I can't move bits from one image to the other and get a better distortion at the same rate. If the slopes were not equal, then I would get a win from moving the bits away from where they affect D less to where they affect D more.

This is the same as before, but the win in terms of code flow and structure should be obvious. Before I had to do a bunch of encodings and then go back and consider different bit allocations to find the best. Now I can do one linear pass through my codings and make the optimal J decision at each step, and then after a decision is made I can forget about it and move on to the next and never revisit it. The disadvantage is now that my encoding is parameterized by the unintuitive lambda parameter rather than just R, the rate; if I want to hit a specific rate I have to search lambda in an outer loop (* perhaps more on this later).

Now we're going to make a big leap of faith and see what happens if we try to use this sort of one-pass lagrange optimization on more complicated real world scenarios. What goes wrong? Two big issues. One is that D might not be independent, the other is that coding is not independent.

To be clear, our proposal is to do one-pass lagrange coding decision making. Encode the data in streaming one pass scan. At each decision point, you try all possible ways for the current subset of the data only. You measure J on the current subset and take the decision that maximizes J on that subset. You then do the encoding using that decision and output the encoded bits, and move on to the next subset. R (the rate) can of course be measured on each subset independently (if arithmetic coding, you can count the fractional bits left in the state as well). D (the distortion) must be some measure that can be done locally.

What about D (the distortion measure) ? Well, if D is SSD (sum of square differences) (aka L2 error or MSE), then it is in fact independent from pixel to pixel or block to block, which means it is fully correct to measure D only locally on each decision. But you might object that this is not a good way to measure distortion. I'm going to ignore this for now and just posit that we choose D to be SSD and we're going to optimize for that. (* more on this later).

The big issue is that in the real world our coding decisions are not independent. A huge affect with video coding is of course motion compensation - blocks can be used as reference for later blocks. That means a coding decision now can have affects far in the future. Even if we ignore that and just talk about image coding, blocks affect the coding of other blocks through statistical coding - be it huffman, arithmetic, or context coding. Typically this has an affect of the form : if I chose mode A for the current block, that has the side effect of making mode A cheaper for future blocks, and all other modes slightly more expensive. I contend that the statistical affects on the future are not a huge issue. (you may recall when we talked about LZ77 optimal parsing, I made the same hand-wavey argument to dismiss this issue). There is one big issue about the statistical effects - early on in coding when statistics are sparse, decision 1 can have a huge affect on the statistics which heavily biases decision 2 and leads you far away from the true optimal choice. To stop this, the statistics for decision making purposes should be preconditioned with some "normal" data to reduce the affect of strange early decisions. Note that the statistics used for actual coding need not be the same as the statistics for decision making.

Sometimes a purely greedy coding decision can be very bad, though. Consider the case of something like trellis quantization of DCT coefficients. The true output of the DCT is X,Y,Z. But you need not transmit those values, you can transmit anything and it will just have some rate and distortion. For example, say the output is 0,2,3,0,X,0,0,0,0 . How should you transmit X ? If you're using something like run-length coding or end-of-block signalling, there is a big advantage to sending X as 0. You can't know that unless you could see the future after X. If you only saw 0,2,3,0,X, and didn't know what followed you wouldn't see that. This is a well known issue with quantization of trailing DCT coefficients, but it's also an issue with things like macroblock mode decisions in video coding. The reason is that mode equal to previous mode is coded so much smaller than any other mode, when you make a decision you very strongly affect not only your current rate, but also the following block. One remedy to this is "semi-non-greedy lookahead" type coding as is well known from LZ77 coders; that is, instead of just doing one coding decision, you try all ways for the next 2 or 3 or 4, choose the one that optimizes all of them together, and then discard all but the current one and step ahead one step. Okay I'm going to sort of ignore this issue for now, but we should keep it in the back of our mind that purely greedy forward-scan decision making can be pretty far from optimal.

The other big issue for video is motion compensation. Yeah, that's a big issue, but it's not specific to this type of code stream optimization, so I'm not going to discuss it now. It's an issue that you must address in any video coder that makes decisions. The way to address it in this framework is to choose some scaling factor for "D" on each block. You compute J = c * D + lambda * R , where normally c would be 1, instead you bias it up or down depending on whether you decide the current block is more or less important than average. (blocks which will be sources for future motion compensation should be considered more important in the forward greedy pass decision making). (* perhaps more on this later).

To be continued ...

old rants