I was thinking about it recently, and I wanted to do some vague rambling about the overall issue.
First of all, the lagrange method in general for image & video coding is just totally wrong. The main problem is that it assumes every coding decision is independent. That distortions are isolated and additive, which they aren't.
The core of the lagrange method is that you set a "bit usefulness" value (lambda) and then you make each independent coding decision based on whether more bits improve D by lambda or more.
But that's just wrong, because distortions are *not* localized and independent. I've mentioned a few times recently the issue of quality variation; if you make one block in image blurry, and leave others at high detail, that looks far worse than the localized D value tells you, because it's different and stands out. If you have a big patch of similar blocks, then making them different in any way is very noticeable and ugly. There are simple non-local effects, like if the current block is part of a smooth/gradient area, then blending smoothly with neighbors in the output is crucial, and the localized D won't tell you that. There are difficult non-local effects, like if the exact same kind of texture occurs in multiple parts of the image, then coding them differently makes the viewer go "WTF", it's a quality penalty worse than the local D would tell you.
In video, the non-local D effects are even more extreme due to temporal coherence. Any change of quality over time that's due to the coder (and not due to motion) is very ugly (like I frames coming in with too few bits and then being corrected over time, or even worse the horrible MPEG pop if the I-frame doesn't match the cut). Flickering of blocks if they change coding quality over time is horrific. etc. etc. None of this is measurable in a localized lagrangian decision.
(I'm even ignoring for the moment the fact that the encoding itself is non-local; eg. coding of the current block affects the coding of future blocks, either due to context modeling or value prediction or whatever; I'm just talking about the fact that D is highly non-local).
The correct thing to do is to have a total-image (or total-video) perceptual quality metric, and make each coding decision based on how it affects the total quality. But this is impossible.
So the funny thing is that the lagrange method actually gets you some global perceptual quality by accident.
Assume we are using quite a simple local D metric like SSD or SAD possibly with SATD or something. Just in images, perceptually what you want is for smooth blocks to be preserved quite well, and very noisey/random blocks to have more error. Constant quantizer doesn't do that, but constant lambda does! Because the random-ish blocks are much harder to code, they cost more bits per quality, they will be coded at lower quality.
In video, it's even more extreme and kind of magical. Blocks with a lot of temporal change are not as important visually - it's okay to have high error where there's major motion, and they are harder to code so they get worse quality. Blocks that stay still are important to have high quality, but they are also easier to code so that happens automatically.
That's just within a frame, but frame-to-frame, which is what I was talking about as "lagrange rate control" the same magic sort of comes out. Frames with lots of detail and motion are harder to code, so get lower quality. Chunks of the video that are still are easier to code, so get higher quality. The high-motion frames will still get more bits than the low-motion frames, just not as many more bits as they would at constant-quality.
It can sort of all seem well justified.
But it's not. The funny thing is that we're optimizing a non-perceptual local D. This D is not taking into account things like the fact that high motion block errors are less noticeable. It's just a hack that by optimizing for a non-perceptual D we wind up with a pretty good perceptual optimization.
Lagrange rate control is sort of neat because it gets you started with pretty good bit allocation without any obvious heuristic tweakage. But that goes away pretty fast. You find that using L1 vs. L2 norm for D makes a big difference in perceptual quality; maybe L1 squared? other powers of D change bit allocation a lot. And then you want to do something like MB-tree to push bits backward; for example the I frame at a cut should get a bigger chunk of bits so that quality pops in rather than trickles in, etc.
I was thinking of this because I mentioned to ryg the other day that I never got B frames working well in my video coder. They worked, and they helped in terms of naive distortion measures, but they created an ugly perceptual quality problem - they had a slightly different look and quality than the P frames, so in a PBPBP sequence you would see a pulsing of quality that was really awful.
The problem is they didn't have uniform perceptual quality. There were a few nasty issues.
One is that at low bit rates, the "B skip" block becomes very desirable in B frames. (for me "B skip" = send no movec or residual; use predicted movec to future and past frames to make an interpolated output block). The "B skip" is very cheap to send, and has pretty decent quality. As you lower bit rate, suddenly the B frames start picking "B skip" all over, and they actually have lower quality than the P frames. This is an example of a problem I mentioned in the PVQ posts - if you don't have a very smooth contimuum of R/D choices, then an RD optimizing coder will get stuck in some holes and there will be sudden pops of quality that are very ugly.
At higher bit rates, the B frames are easier to code to high quality, (among other things, the P frame is using mocomp from further in the past), so the pulsing of quality is high quality B's and lower quality P's.
It's just an issue that lagrange rate control can't handle. You either need a very good real perceptual quality metric to do B-P rate control, or you just need well tweaked heuristics, which seems to be what most people do.