10/30/2010

10-30-10 - Detail Preservation in Images

I've been thinking about how to do a simple measure of "detail" in images that could be used as a metric for lossy compressors.

First of all, one simple think to ask is what is the "sharpness" of the image. One way to measure this is to compare the image with a blurred version (lowpassed if you prefer) version of itself. Basically if an image is already blurry (eg. smooth ramps) and you run a blur filter on it, it doesn't change much, but if it was very sharp (eg. black and white pixel checkerboard), it changes a lot. See for example this page on MTF with nice pictures.

One nice way to measure this is the ratio of the highpass energy over the lowpass energy. To motivate that what we are basically doing is :


I = original image
L = lowpass of I

Sharpness = Energy[I] / Energy[L]

H = highpass = I - L
I = L + H

Energy[I] = Energy[L] + Energy[H]

Sharpness = 1 + Energy[H] / Energy[L]

where "Energy" might be something like transform to Fourier domain and sum the power spectrum. Now, this Sharpness has an implicit cutoff frequency f that is the parameter of the lowpass. So it's really S(f) and we could scan f around and make a chart of the sharpness at various frequencies. To measure preservation of detail, you want to compare S(f) at all f's.

Now we'd like to have something like this that is more discrete and also localized. We want to ask if the detail at a specific spot is preserved.

A natural idea is the (abs or square) sum of laplacian filters. Something like :


In a local neighborhood of I :
Energy = L1 or L2 sum of Laplacian filters on I
L = blur of I

Sharpness = Energy[I] / Energy[L]

Instead of scanning the lowpass cutoff around, we just picked some single blur amount, but then we can do this in a multi-scale way. Let I0 = original image, I1 = blur I0, I2 = blur I1, etc. , then S0 = E[I0]/E[I1], S1 = E[I1]/E[I2], etc.. To measure preservation of detail at various scales, we compare S0,S1,S2.. from each image to S0,S2,S3.. of the other image (on each local neighborhood). That is, we require the detail level is preserved in that area in the same frequency band.

That is, we make a Gaussian pyramid of images that are blurred more and more, and then we take the energy in each level vs the energy in the parent.

But the laplacian is just the delta of each level from its parent (roughly), something like I0 - I1. So we can just make these delta images, D0 = I0 - I1, D1 = I1 - I2, and then S0 = |D0|/|D1| (just magnitudes, not "energy" measures).

By now the similarity to wavelets should be obvious. The fine detail images are just the high pass parts of the wavelet. So really all we're doing is looking at the L1 or L2 sum of coefficients in each band pass of a wavelet and comparing to the sum in the parent.

But wavelets also suggest something more that we could have done from the beginning - instead of a symmetric lowpass/highpass we can do seperate ones for horizontal, vertical, and diagonal. This tells us not just the amount of energy but a bit more about its shape. So instead of just a sharpness Sn we could measure Sn_H,Sn_V and Sn_D using a wavelet. This would be like using a horizontal laplacian [-1, 2, -1], a vertical, and an "X" shaped diagonal one.

And wavelets suggest something more - we could just use block transforms to do the same thing. An 8x8 Haar is a wavelet on a local chunk (and an 8x8 DCT has "wavelet structure" too). In particular you can arrange it into frequency bands like so :


01223333
11223333
22223333
22223333
33333333
33333333
33333333
33333333

and then take the L1 or L2 sum in each region and ask for preservation.

The similarity to x264's SATD energy metric is obvious. They use Haar and take the L1 sum of the energy in all the frequency bands to measure the total energy in the block. But we can be a lot more specific. In fact it suggests a whole "multi-accuracy" kind of delta.


Do 8x8 Haar or DCT.
Compare 8x8 blocks A and B.

Add terms :

1. each of the 64 coefficients should be the same :

 += |A_ij - B_ij|

2. the sum of each savelet band should be the same, that is
if you use the diagram above for groups 0-3, then within each group
there is H,V, and D, add :

    S(g,H/V/D) = Sum{in group g, H,V or D} 
    
 += | S(A,g,H) - S(B,g,H) |
 += | S(A,g,V) - S(B,g,V) |
 += | S(A,g,D) - S(B,g,D) |

  for g in 0-3

3. ignore the H,V,D and do the same thing just for the frequency
 subband sums :

 += | S(A,g) - S(B,g) |

4. ignore the frequency subbands and do the sum of all the coefficients :

 += | S(A) - S(B) |

These are error terms that go from fine to coarse. This last one (#4) is the most coarse and is the "SATD". Adding the multiple terms together means that if we have errors that screw up the highest precision test (#1) but preserve the other measures, we prefer that kind of error. eg. we prefer the energy to move somewhere nearby in frequency space rather than just disappear.

Now obviously if your coder works on something like 8x8 blocks then you don't want to run a test metric that is also 8x8 block based, certainly not if it's aligned with the coding blocks. You could run 8x8 test metric blocks that are offset by 4 so they straddle neighbors, or you could do 16x16 test blocks centered on each 8x8 code block, or you could do 8x8 test blocks but do one for every pixel instead of just at aligned locations.

10/28/2010

10-28-10 - Game Review - Uncharted 1

After about two hours of jumping around on rocks and running through jungles and shooting brown people in the face, running around picking up their ammo, it all grows rather repetetive. Climb around, shoot brown people, climb around, bleh, this again?

The one sentence Hollywood style pitch would be "it's Tomb Raider, but instead of playing as a scantily clad hot chick, you play as a smug douchebag". Brilliant! Green light that concept!

There's some mildly bad game design in that the world is not at all a "simulation", but they also don't make clear the gameyness - like there are tons of things in the world that you should be able to interact with that you can't, but then they also don't make it clear at all what things you *can* interact with, and they add new ones all the time to make the "puzzles" (which are not so much puzzles as stand around mystified until the narrator gives you hints, or randomly walk around until you trigger the cinematic, or randomly press triangle until you find the interactive piece that looks just like all the non-interactive pieces).

The combat is not very well designed, you feel like you don't really have choices or much cleverness you can use in dealing with enemies, and it's not very varied. It makes me feel like the difficulty comes from me fighting the controls, not the enemies. On the plus side, the basic platforming/climbing controls are really good, it's the first game I've ever played where I can do cliff climbing and it just does what I want it to, I don't feel like I'm fighting the controls all the time (as in most ledge-hangers like Ico, PoP, etc).

There are some nice kind of cinematic moments where you transition in and out of cut scenes nicely, though it all feels a bit too "Dragon's Lair" , like I'm just pressing A to watch a movie, and the opportunity for beauty is ruined by the monotony of the environments.

But the real problem is that Nathan Drake is a huge douchebag. He's obviously modeled after J Rubin (real life douchebag extraordinaire), and I just don't want to play as a douchebag. It's not my escapist character roleplaying fantasy to be a self-satisfied smug wisecracker. Any time it went into dialog sequences I'd just be like "groan, eye roll, I hate you".

10-28-10 - Game Review - Fable 2

I beat Fable 2 a while ago and thought it was pretty rubbish (but amusing enough to play through to the end of the main story line). The whole time I was playing I was thinking "man this is rubbish" but I also didn't stop playing, because it was so ridiculously easy and mostly non-frustrating that it never gave me a point where I threw down the controller in disugst and stopped playing.

(the closest it got to making me quit was when I was in the spire and had basically a 30 minute non-interactive sequence where they make you press A every 5 minutes and pretend you're playing, but it's all forced and canned. I hate this trend in games; I think it comes from Valve and their "cinematics while the player still has control" which everyone seems to like so much; I fucking hate it, there's no point in giving me control and saying "walk over to the reactor" when that's my only choice, just make it a fucking non-interactive sequence and walk me over to the reactor for me, thank you. That way I can at least set down my controller and go to the bathroom or get a drink, this fucking semi-interactive canned sequence thing is real shit. If it's a fucking cinematic, make it a fucking cinematic, and you know what, fucking pre-render it too, I don't want god damn low poly in game character models with shitty lip sync, if you're showing me a movie, show me something good)

Games like Fable 2 are really tragic to me, because I thought the art was pretty cool and some of the play mechanics were okay (or could have been okay with a bit of better guidance), but the design and direction was just so completely retarded that it spoils the game.

I like the game world, sort of semi-realistic recent past but with magic, and very British, but I don't find all the stealing from pop culture very amusing (Star Wars, Pirates of the Caribbean, etc.). Make up your own damn fantasy world, or use historical references. I like the fact that it's colorful and playful and cheery and whimsical and all that, it's a nice break from all the horrible gray and "gritty" and grim and gruesome.

There's all this work on the expression system and the town interactions, but it actually does almost nothing for the game play, it's not actually integrated into the story or anything like that, so it's just pointless. Obviously the way your character changes over time is a complete gimmick and completely irrelevant. Like it affects the way townspeople interact with you, but you always have such a huge excess of money that it doesn't matter if they give you good prices or not. It's classic pot-smoking game "director" style designed where they have a bunch of "cool ideas" but don't actually think about how they will be used in game balance and play dynamics.

It's ridiculously badly balanced. They use the classic "I'm a bad designer" RPG trick of making each step up the experience ladder like 10X more expensive than the last. This is a standard crutch and it just ruins RPGs, but it's the only way for designers to fix broken games at the last minute. Let me do an aside on this for a moment because it's such a general problem.

The classic problem with RPGs which makes them difficult to develop is that players have so much freedom in how they develop their character, what side quests they do, etc. that when they reach story point X in the game, two different players could have just massively different development. This makes it almost impossible to tweak for difficulty statically. There are a few good solutions : 1. dynamic difficulty based on player development, 2. just a massive amount of side quests so that if a player is insufficiently developed they are sent off to buff up in side quests. But most shitty developers use the "I'm a bad designer" solution - make each level of advancement so big that it makes all previous advancement irrelevant, and then give out huge gold/exp rewards along the main quest. This means that player A who did a bunch of side quests and highly developed his character is only slightly powerful than player B who was a moron at chosing upgrades. You can tell that a designer is doing this to you when the next step up of a weapon is like 2X as good as the previous weapon (eg. you go from 10 damage to 20 damage, as opposed to like 10 to 12), also when the exp for a level is just massive, it makes it irrelevant whether you got the exp from side quests - eg. in the beginning of the game you're fighting mobs that give you 50 exp, suddenly you start meeting guys who gives you 500. It makes it pretty irrelevant whether you killed a bunch of extra early mobs or not. It's a way of wiping out the past, and as soon as you detect this in an RPG you know that you can just stop doing any side quests or worrying about your purchases.

The spells are terribly designed; the vast majority of them are completely identical (all the damage spells), but then Raise Dead is absurdly out of balance. All you need is level 1 raise dead and you can beat the game easily with no other magic.

The main quest is really short, I guess there are a lot of side quests but they just feel so pointless because the rewards are unnecessary. In a proper RPG the main quest should feel really difficult if you don't do any side quests, and the side quests should feel like they give you something useful that makes the main quest easier.

Generally it's just way way too easy. You can beat the game by just mashing X for all the combat, so there's no need to ever bother to learn any of the combo moves or strategy.

The controls are also just horribly broken; obviously the spell selector is just super rubbish, but the worst sin is that they massively miss button presses. This was the game that pushed me over the edge and made me write the game controls rant . I was constantly holding B to cast spells or holding A to run, and my guy's not actually doing it, and I'm like "friggling fracking WTF" and have to let go of the button and press it again to make them register it.

10/27/2010

10-27-10 - Image Comparison - JPEG-XR

I'm testing JPEG-XR (aka HD Photo, aka PTC, aka WPF or WIC codec) using the maximum-size chunk width option and with the optional arithmetic coder (instead of the faster RLE/Rice type coder), both of which are supposed to give me maximimum quality/compression. I let it do its color conversion (I believe they use the lossless YCoCg).

On the PDI 1200 image :

On the plus side, the RMSE chart reveals that their color conversion & such code is not just broken the way it is in so many coders.

On the minus side, the SSIM SCIELAB score is just *abysmal* at the crucial area of logbpp = 0 , even worse than Hipix, and in fact much worse than good old JPEG-Huff. The bad perceptual score is confirmed by personal evaluation - JPEG-XR seems to be just the worst of both worlds, it is *both* ringy like JPEG and blurry like JPEG-2000, plus it has a new type of artifact, a kind of blobby ringing (if you are familiar with lapped transforms you have seen these before).

It's just by far the worst thing I've seen yet and it's not even close. Here's the image : PDI JPEG-XR 116,818 bytes

The only thing going for JPEG-XR is that they standardized adding alpha channels, color spaces and HDR. Of course as others have pointed out you could have just added that to JPEG.

BTW the JPEG XR Wikipedia Page reads like an advertisement and should be edited. In particular "JPEG XR file format supports higher compression ratios in comparison to JPEG for encoding an image with equivalent quality." is manifestly not true.

10/26/2010

10-26-10 - Image Comparison - Hipix vs PDI

If you actually want to make a fair test vs. JPEG :

1. Don't test JPEG at quality levels below 40, it doesn't work.

2. Don't compare to "JPEG" when you use a bad encoder and use basic huff-only. At least use IJG and use -optimize and -progressive.

3. Don't test on a 2400 line image when you will be viewing it on a 1200 line monitor. Always view your tests at native pixel res, that's what the compressors are designed for.

So anyway, I found a "PDI-Target" image that's 2297 x 3600 in a jpeg ( here among other places ). I scaled it to 766x1200 so it would fit on my monitor. I ran Hipix "Good" to set the target file size - it used 117050 bytes, which 1.019 bpp , which is a reasonable target for compression. Then I ran JPEG and Kakadu to try to get the same file sizes.

Here are the images and you can look at them with your own eyes :

PDI 766 x 1200 original
JPEG Arith , 116818 bytes
JPEG PackJPG , 116457 bytes
Kakadu JPEG 2000 , 117120 bytes
Hipix , 117050 bytes

Note : do NOT zoom in when comparing them! Also, it's easier to see differences in A/B toggle tests, so before you post a comment, download the above and do A/B toggle testing with something like ACDSee in full screen mode please.

(BTW I'm using PackJPG now instead of PAQ for the "modern entropy backend of JPEG" ; PackJPG is very good, it's fast, and it also doesn't have the bug that PAQ has for some small size files; it usually compresses slightly larger than PAQ, but pretty close; also I've switched JPEG-Huff to progressive as it helps slightly (doesn't help JPEG-ari or JPEG-pack))

My subjective conclusion :

Overall I think Hipix is the worst quality. It's the only one that badly screws up parts of the faces, messes up some large-scale DC, and just generally loses tons of detail. JPEG preserves detail and sharpness way better than any other, and is only really bad in one way - ringing artifacts, it has lots of ringing artifacts. Kakadku is amazingly free of ringing (some early JPEG2000 coders suffered from bad ringing), but it also just blurs the hell out of the image. If you look at the Kakadu output without comparing to the original, it looks pretty nice and artifact free, but when compared to the original it looks just like a gaussian blur has been run on the whole image.

Basically Kakadu has the least visually annoying "artifacts" , but at the cost of pretty severe blurring and general chunky blobby look everywhere. JPEG is great except for ringing articacts. Hipix is somewhere between the two (it both blurs and rings) but is just not a good middle ground, it's worse than an average of the two.

Some portions. The order in these images is :

[ original , hipix, kakadu, jpeg pack, jpeg ari ]

Fruit :

Bad chunkiness in the hipix image. Kakadu also has nasty edges on the apple. JPEG looks like the winner, despite some ringing around the lemon.

Leather satchel thingy :

Hipix and Kakadu both completely destroy the detail of the leather texture, lots of blurring.

Black baby's hair :

This one might be clearest win for JPEG. Excellent preservation of the detail in both JPEGs. Kakadu and Hipix both blur the bang wisps to hell. Hipix also creates a bad overall change of color and brightness, this is easist to say by toggling the original vs the hipix version.

Sunflower :

Note how blurry Kakadu is, especially the nasty chunky blurs on the lower stem area and the curve of the leaf on the right. Some bad ringing in JPEG around the stem and leaves.

Gear and circuit board :

Hipix and Kakadu again just toss out detail on the gear like crazy. Kakadu blurs the circuit board to all hell. The JPEGs actually add detail to the circuit board that shouldn't be there by ringing ;)

Hand and corn :

Hipix stands out here by completely screwing up the back of the hand, throwing away all detail and changing overall luma and adding weird chunkies. The JPEGs as usual do great with detail, the back of the hand is best on the JPEGS, but the lower edge and the fingers show some bad ringing.

CDs :

Again Hipix stands out as the only one that makes the rainbow patterns all chunky. Kakadu does okay on the interior rainbows but ruins the the edges of the left CD with blurry chunks. The JPEG does well except for some ringing on the inside circular edge of the CD.

Robots :

JPEG ringing is really bad on these, notice the black disease all over the robots body, and chroma distortion on the robot's left hand. Hipix makes the diagonal edge in the lower left all chunky and has a little ringing. Kakadu is probably best here.

Color Boxes :

JPEG is the only one that does really badly on these, creating ringing ghosts in the colors from the black bars. Hipix does very well on this type of "graphic arts" material (just as WebP does BTW), so if you are doing graphic-design type images it might be a win there (though I'm guessing x264 probably does that better, or you know, you could just use PNG). ( Color boxes shows [ original , hipix, kakadu, jpeg-xr, jpeg pack, jpeg ari ] )

Some charts for good measure :

Kakadu is by far the best numeric performer. Its one big fault is making everything blurry. Since our perceptual metric so far does not have any measure of detail preservation, Kakadu gets away with it (SSIM doesn't do much for us here).

You can really see the way JPEG works from these test sets. If you take any of them and zoom up a lot, the JPEG just looks horrible. But at correct pixel size, they look great. This is because JPEG is intentionally allowing errors that are just under the threshold of visibility.

In normal viewing conditions, JPEG is just great. One usage in which it is not great is for video game textures, because those often get sheared, zoomed, colored, etc. which ruins the JPEG perceptual model, which means they may have much larger visible artifacts than other compressors.

What are some valid complaints about JPEG ?

1. Yes there are a lot of bad encoders out there and the average JPEG that's out on the net is probably pretty far from optimal. In the WebP recompression project, you could easily replace that with just re-jpegging the JPEGs. (this includes people using grossly wrong quality settings, or not down-scaling images that will be shown very small on the web page).

2. It falls apart at very low quality. If for some reason you really need super low bit rates, JPEG is not for you. (However, the common test that people do of very large images at very low bit rates is not a valid test, nor is cranking down the quality to "see the difference").

3. JPEG needs to be viewed at native res with the original pixel intensities. The whole way it works is based on the human-optical model, so if your image will be stretched or shown in some weird way, JPEG is not for you.

4. It does create a lot of ringing. This is sort of an inherent trade off in signal processing - when you represent a signal with a truncated basis set, you can either get smoothing or ringing. JPEG is way towards the choice of ringing, not smoothing, it might be slightly more ideal to be able to get somewhere in between.

10/22/2010

10-22-10 - Some notes on Chroma Sampling

First some motivation before we dig into this :

Take bmp, downsample convert to YUV 420 (specifically YCbCr 601), upsample and convert back to RGB, measure RMSE. (yes I know RGB rmse is not really what you want, more on this later).


Testing "my_soup" :

ffmpeg bmp -> y4m -> bmp , default options :
rmse : 5.7906 

ffmpeg -sws_flags +accurate_rnd+full_chroma_int+full_chroma_inp
rmse : 2.3052

my_y4m : 

RGB -> chroma then boc :
rmse : 3.3310

box then RGB -> chroma :
rmse : 3.3310

box rgb FindBestY :
rmse : 3.1129

lsqr solved YCbCr for box :
rmse : 3.1129

box rgb (decoder spill) :
rmse : 3.0603

linear-down linear-up :
rmse : 2.7562

down-for-linear-up :
rmse : 2.0329

down-for-linear-up _RGB FindBestY :
rmse : 1.8951

solve lsqr for linear up :
//float solution rmse : 1.6250
rmse : 1.7400

Clearly there is a lot of win to be had from good chroma sampling. Let's talk a bit about what's going on.

(BTW the ffmpeg results don't directly compare to my_y4m, they're just there for reference and sanity check; for example I use symmetric-centered (JPEG style) 420 subsampling and I think they use offset MPEG-style 420 subsampling ; I'm also still not sure if they are in 16-235 or 0-255 , I am definitely in 0-255 ).

First of all just in terms of basic filtering, you might do your operation like this :


"encode" :

RGB -> floats
float RGB -> matrix multiply -> YUV
UV plane -> filters -> downsampled
quantize & clamp to [0,255] ints
transmit YUV 420

"decode" :

UV plane -> floats -> filters -> upsampled
YUV -> matrix multiply -> float RGB
quantize & clamp to [0,255] ints
write RGB bmp

So, you might think that's a reasonable way to go, and try various filters. I did experiments on this before (see this blog post and in particular the comment at the end). I found that fancy filters don't really help much, that the best thing is just doing bilinear reconstruction and a special optimized downsample filter that I call "down for bilinear up".

But once you think about it, there's no reason that we should follow this particular process for making our YUV. In particular, downsampling in chroma space does some very weird things that you might not expect. A lot of the weirdness comes from the fact that the RGB we care about is clamped in the range [0-255]. And our advantage is that we have Y at higher resolution.

So let's start looking at better ways to do chroma sampling.

Our first good link is this : Twibright Luminaplex and Hyperluma

His writing is very weird, so I'll paraphrase briefly. One of the problems with chroma subsampling is that it's not light linear. eg. averaging Cb and Cr does not produce a resulting color which is the average of what your eye will see. Instead of subsampling CbCr , you should instead solve for the YCbCr which produces the light-linear color that you would see for the average of those 2x2 chromas. The easiest way to do this is just to subsample CbCr is in some way, and then instead of computing Y from the original RGB, you turn it into a solve to find the Y, given CbCr, that produces the best result.

The next good idea from the "Twibright" page is just to abandom the idea of computing the YCbCr from a matrix in general. We know what the decoder will do to upsample, so instead we just take the idea that our encoder should output the coefficients which will make the best result after upsampling. In the "Hyperluma" algorithms, he sets his goal as preserving constant actual luma ( luma is not "Y" , it's actual perceived brightness). Basically he does the chroma subsample using RGB, and then given the low-res chroma and the original high-res RGB, solve for the Y that gives you the correct luma.

The "Luminaplex" algorithm takes this one step further and does a brute force optimization. In the end with compression if you want absolute optimality it always comes down to this - wiggle the discrete values, see what the output is, and take the best. (we saw this with DXT1 as well).

Implementing these ideas gives me the "down-for-linear-up _RGB FindBestY" results above.

On the Twibright page he claims that he can implement Luminaplex for box upsampling and still get great results. I found that to not be true on most real images, you need at least bilinear upsampling. To solve for the optimal YCbCr for a given RGB target image, you have a large coupled linear system. That is, for each pixel, you have to consider the current CbCr and also the neighbors, and for those neighbors, you have to consider their neighbors. This is a sparse semi-diagonal linear system. In particular, you are solving a linear system like this :


find x to minimize | A x - b |

b = the target RGB values
 (b has 3 * # of pixels values)

x = the YCbCr values for the output
 w*h of Y, (w*h)/4 each of Cb and Cr

A = the YCbCr -> RGB matrix plus upsample filter
    3*w*h rows
    (3/2)*w*h columns

for each R and B row, there are 5 non-zero entries
  (1 for Y and 4 for Cr or Cb)
for each G row there are 9 non-zero entries
  (1 for Y and 4 for Cr and 4 for Cb)

you can solve this reasonably easily with a standard sparse matrix linear solver. Note that in this form we are directly solving for minimum RGB RMSE , but you can of course solve for other metrics (that are linear transformations of RGB anyway). The fours in the number of terms come from the four-tap box filter; bigger filters have more non-zero terms and so make the matrix much fatter in memory and slower to solve.

If you implement this you get "solve lsqr for linear up" , in the results above. Note that this has the problem mentioned in the last post. I actually want to solve for discrete and clamped YCbCr, but it's too hard, so I just solve the equation as if they are continuous, and then round to the nearest int and clamp to 0-255. To improve this, I actually re-find Y from Chroma after the round-and-clamp. The loss from going discrete is this bit :


//float solution rmse : 1.6250
rmse : 1.7400

I believe that this is as good as you can do for a decoder which operates in the simple linear way. That is, up to now we have assumed that we have a generic decoder, so we can use these techniques to make optimal baseline-compatible JPEGs or H264 videos or whatever. But there are things we can do on the decode side as well, so let's look at that.

The most interesting link I've seen is this page by Glenn Chan : Towards Better Chroma Subsampling

Glenn makes the realization that many of the YCbCr produced by normal subsampling are actually impossible. That is, they can't come from any RGB in [0-255]. When you see an impossible YCbCr, you know it was caused by downsampling, which means you know that some chroma was put on your pixel that should have been on a neighbor's pixel. For the moment let's pretend we have box sampling. In a 2x2 block we have some black pixels and some red pixels. When you box downsample you will get the average of the red chroma (call it Cr = 1) and the black chroma (Cr = 0). When you box upsample you will get Cr = 0.5 over all four pixels. But not you have a pixel with Y = 0 and Cr = 0.5 ; the only way to make a zero Y but with some red chroma would be for it to have negative G and/or B. So this must be a mistake - when we see Y = 0 and Cr = 0.5, we know that the chroma on this pixel must haved "spilled" onto us from our neighbor incorrectly. To fix it, we just take our unwanted Cr and push it over to our neighbor, and we get a perfect result - the Y = 0 gets Cr = 0 and is black, and the Cr = 0.5 red pixel gets pushed up to Cr = 1.

Glenn works out how much chroma a pixel can hold for a given Y. One way to think about this is to think of the RGB->YCbCr as a rotation (+ shear and scale, but you can think of it as rotation for our purposes). You've taken the RGB axial box and have put a new box around it in a rotated space. To completely cover the range of the original box, we have to use a much bigger box in this new space. The result is a large amount of empty space in the YCbCr box which did not come from the original RGB box. Handling this better is a general problem for all compressors with color conversions - we often code YCbCr as if they have full range, but in fact after we have seen Y we know that the range for CbCr might be much smaller.

There's another way of getting the same result, which is to use the fact that we know our Y is more reliable than our CbCr. That is, use your YCbCr to reproduce RGB. Now see if the RGB are all in [0-255] , if they are you are fine. If not, you have to clamp them. Now recompute Y from RGB (something like 0.2R + 0.7G + 0.1B). Because of the clamping, this will now be wrong, eg. not match the transmitted Y. So what we are doing is ensuring that the Y of the output RGB is equal to the Y we transmitted. To acheive that, we adjust CbCr so that we are not clamping the RGB.

On some very bad cases, the win from the "spill" decoder is massive :


on "redlines" :
alternating vertical black and red lines :

ffmpeg default :
rmse : 144.9034

ffmpeg -sws_flags +accurate_rnd+full_chroma_int+full_chroma_inp
rmse : 94.7088

my_y4m box filter :
rmse : 101.9621

my_y4m bilinear :
rmse : 101.3658

my_y4m box spill :
rmse : 7.5732

WOW !

The main limitation of Glenn's method is that it only helps when you are pushing pixels into illegal values. eg. the black next to red example above was helped enormously, but if it was instead grey next to red, then no illegal value would have been made and we would have done nothing. (eg. on my_soup it only gave us 3.1129 -> 3.0603)

The other problem with Glenn's method is that it is rather slow in the decoder, too slow for something like video (but certainly okay for a JPEG loader in Photoshop or something like that).

There are some simplifications/approximations of Glenn's method which would probably be viable in video.

One is to compute an approximate "chroma capacity" based on Y, and then for each 2x2 box upsample, instead of putting the chroma onto each pixel with equal weight, you do it weighted by the chroma capacity. Chroma capacity is a triangle function of Y, something like min( Y, 255-Y ). So a 2x2 box upsample adjusted for capacity is just :


given subsampled chroma Cr & Cb
and non-subsampled Y's (Y_0,1,2,3)

unadjusted box upsample is just :
Cr_n = Cr
Cb_n = Cb

adjusted box upsample is :

CC_n = min( Y_n, 255 - Y_n )
CC_n *= 4 / ( CC_0 + CC_1 + CC_2 + CC_3 )

Cr_n = CC_n * Cr
Cb_n = CC_n * Cb

(this is similar to his "proportion" method but much simplified). On the synthetic case of the red and black quad, this produces the same results as the more expensive method. On real images it's slightly worse.

Another approach to accelerating Glenn's method would be to just go ahead and do the YCbCr->RGB on each pixel, and then when you do the clamp (which you must do anyway), use that branch to spill to neighbors, and compute the spill amount directly from how far your RGB is pushed out of [0..255] , eg. if B is -4 , then (-4 / Cb_to_B) worth of Cb goes onto the neighbor.

I've only implemented the "spill" method for box sampling, but you can do it for any time of upsampling filter. It's a little awkward though, as you have to implement your upsampler in a sort of backwards way; rather than iterating on the high res pixels and sampling from the subsampled CbCr plane with some filter and accumulating into each output pixel only once, instead you need to iterate over the low res subsampled CbCr and create the filter output and add it into various target pixels.

There's one final method to look at which we've not implemented yet. Glenn's method is of course a form of "luma aided chroma upsample", but it's not what I'm usually refering to when I say that. What we usually mean is using the luma edge information to select the chroma upfilter. That is, rather than always doing bilinear chroma upsample or bicubic or sinc or whatever, you do a decision tree on the luma pixels and choose one of several filters which have various shapes. This is actually a variant of the old "super resolution" problem. We have a low res signal and wish to make the best possible high res output, possibly with the help of some side information. In this case we have the high res Y plan as side information which we believe is well correlated; in many of the "super resolution" papers in the literature what they have is previous video frames, but the techniques are quite similar. I've seen some papers on this but no implementation (though I hear some TV's actually have a hacky version of this called "warpscan" or something like that).

Training filters for various edge scenarios is relatively straightforward, so I might do that in the next few days.

BTW one scary issue for all these fancy filters is if you don't control the decoder or know its exact spec. In particular, TV's and such that are doing fancy chroma upsamplings now means your manipulations could make things worse.

Also something I have observed with chroma sampling is that minimizing any simple analytic metric like RMSE can lead to ringing. This is a very generic issue in lossy compression (it's roughly the same reason that you get ringing in wavelets). In order to reduce a very large single pixel error, the encoder will apply something like a sinc filter shape to that pixel. That might cut the error at that pixel from 100 to 50, which is a huge win, but it also adds a low magnitude ringing around that pixel (maybe magnitude 5 or so). In RMSE terms this is a big win, but visually it's very bad to create that ringing around the single bad pixel, better to just leave its big error alone. (nonlinear error metrics might help this, but nonlinear error metrics don't lead to simple to solve linear matrix equations)

The best links :

Twibright Luminaplex and Hyperluma
Towards Better Chroma Subsampling

Some good links related to chroma sampling and color : psx h4x0rz in teh wired YCbCr to RGB Conversion Showdown
psx h4x0rz in teh wired Immaculate Decoding
Marty Reddy Color FAQ
Chroma Sampling An Investigation
hometheaterhifi - chroma upsampling error
COMPSCI708S1T CBIR - Colour Features
CiteSeerX � Optimal image scaling using pixel classification

And some really unrelated links that I just happened to have found in the last few days :

LIVE lab on image quality
Live Chart Playground - Google Chart Tools Image Charts (aka Chart API) - Google Code
JPEG Post Processing
Shape-Adaptive Transforms Filtering Pointwise SA-DCT algorithms
Perl jpegrescan - Dark Shikari - Pastebin.com

10/20/2010

10-20-10 - Discrete Math is the Bane of Computer Science

We are constantly faced with these horrible discrete optimization problems.

The specific example I've been working on today is YCbCr/YUV optimization. Say you know the reconstruction matrix (YUV->RGB) and you want to find the optimal YUV coefficients for some given RGB ?

Well the standard answer is you invert the reconstruction matrix and then multiply that by your RGB. That is wrong!.

In general what we are trying to solve here is a problem something like this :


13 * x + 17 * y = 200;
11 * x +  7 * y = 100;

find x & y to minimize the error
x & y are integers
and bounded in [-100,100]

I guess these types of problems are "integer programming" (this particular case is integer linear programming since my example is linear, but sometimes we have to do it on nonlinear problems), and integer programming is NP-hard , so there is no exact solution but brute force.

(more generally, minimize error E(x,y,..) subject to various linear constraints)

The usual approach is just to wave your hands around and pretend that your variables are actually continuous, work out the math for the solution as if they were, and then round to ints at the end.

The problem is that that can be arbitrarily far from the optimal solution. Obviously in some simple problems you can bound the error of using this approach, or you may even know that this is the minimum error, but in nasty problems it might be way off.

We saw this same issue before in DXT1

I've never really seen many pages or much reference on this kind of problem. I did at one point decide that I was sick of the approximate hacky solutions and tried to find some software package to do this for real and came up empty on real usable ones.

There are a few hacky tricks you can use :

1. Find the continuous solution, then for each float value try clamping it to various integers nearby. See which of these rounds is best. The problem is for an N variable problem this takes 2^N tries if all you try is floor and ceil, and really you'd like to try going +1 and -1 beyond that a few times.

2. You can always remove one linear variable. That is, if you can reduce the problem to just something like a x = b, then the optimal integer solution is just round(b/a). What that means is if you use some other approach for all but one variable (such as brute force search), then you can solve the last variable trivially.

3. Often the importance of the various variables is not the same, their coefficients in the error term may vary quite a bit. So you may be able to brute force search on just the term with the largest contribution to the error, and use a simpler/approximate method on the other terms.

10/18/2010

10-18-10 - Frustum and RadiusInDirection

Ryg has a note on Zeux' Frustum culling series , both are good reads.

The thing that struck me when reading it is that you can get almost directly to the fastest possible code from a very generic C++ framework.

The way I've done frustum culling for a while is like this :


template < class t_Volume >
Cull::EResult Frustum::Cull(const t_Volume & vol) const
{   
    Cull::EResult eRes = Cull::eIn;

    for(int i=0;i < m_count;i++)
    {
        const Plane::ESide eSide = PlaneSide(vol,m_planes[i]);
        
        if ( eSide == Plane::eBack )
            return Cull::eOut;
        else if ( eSide == Plane::eIntersecting )
            eRes = Cull::eCrossing;
        // else Front, leave eRes alone
    }

    return eRes;
}

now any primitive class which implements "PlaneSide" can be culled. (Cull is trinary - it returns all in, all out, or crossing, similarly PlaneSide is trinary, it returns front or back or crossing).

Furthermore, PlaneSide can be overridden for classes that have their own idea of how it should be done, but almost always you can just use the generic PlaneSide :


template < class t_Volume >
Plane::ESide PlaneSide(const t_Volume & vol, const Plane & plane) const
{
    const float dist = plane.DistanceToPoint( vol.GetCenter() );
    const float radius = vol.GetRadiusInDirection( plane.GetNormal() );

    if ( dist > radius )
        return Plane::eFront;
    else if ( dist < - radius )
        return Plane::eBack;
    else
        return Plane::eIntersecting;
}

For a volume to be generically testable against a plane it has to implement GetCenter() and GetRadiusInDirection().

GetRadiusInDirection(dir) tells you the half-width of the span of the volume along direction "dir". The neat thing is that GetRadiusInDirection turns out to be a pretty simple and very useful function for most volumes.

Obviously the implementation for Sphere is the simplest, because RadiusInDirection is the same in all directions :


Sphere::GetCenter() { return m_center; }
Sphere::GetRadiusInDirection() { return m_radius; }

for an AxialBox, if you store your box as {m_center, m_halfExtent} so that min & max are { m_center - m_halfExtent, m_center + m_halfExtent } then the implementation is :

AxialBox::GetCenter() { return m_center; }
inline float AxialBox::GetRadiusInDirection(const Vec3 & dir) const
{
    return 
         fabsf(dir.x) * m_halfExtent.x +
         fabsf(dir.y) * m_halfExtent.y +
         fabsf(dir.z) * m_halfExtent.z;
}

if you now compile our test, everything plugs through - you have Ryg's method 4c. (without the precomputation of absPlane however).

Of course because we are generic and geometric, it's obvious how to extend to more primitives. For example we can make our Frustum work on oriented bounding boxes just like this :


float OrientedBox::GetRadiusInDirection(const Vec3 & dir) const
{
    const float radius = 
        fabsf((dir * m_axes.GetRowX()) * m_radii.x) +
        fabsf((dir * m_axes.GetRowY()) * m_radii.y) +
        fabsf((dir * m_axes.GetRowZ()) * m_radii.z);

    return radius;
}

(I store my OrientedBox as an m_center, an orthonormal rotation matrix m_axes, and the extent of each of the 3 axes in m_radii ; obviously you could speed up this query by storing the axes scaled by their length, but that makes some of the other operations more difficult so YMMV).

Similarly tests against cylinders and lozenges and k-Dops or convex hulls or what-have-you are pretty straightforward.

We can use GetRadiusInDirection for other things too. Say you want to find the AxialBox that wraps the old AxialBox but in a new rotated orientation ? Well with GetRadiusInDirection it's very obvious how to do it - you just take the new basis axes and query for the slab spans along them :


AxialBox AxialBox::Rotate( const Matrix & mat ) const
{
    AxialBox ret;
    ret.m_center = m_center;

    ret.m_halfExtent.x = GetRadiusInDirection( mat.GetColumnX() );
    ret.m_halfExtent.y = GetRadiusInDirection( mat.GetColumnY() );
    ret.m_halfExtent.z = GetRadiusInDirection( mat.GetColumnZ() );

    return ret;
}

And you will find that this is exactly the same as what Zeux works out for AABB rotation . But since we are doing this with only GetCenter() and GetRadiusInDirection() calls - it's obvious that we can use this to make an AxialBox around *any* volume :

template < class t_Volume >
AxialBox MakeAxialBox( const t_Volume & vol, const Matrix & mat )
{
    AxialBox ret;
    ret.m_center = vol.GetCenter();

    ret.m_halfExtent.x = vol.GetRadiusInDirection( mat.GetColumnX() );
    ret.m_halfExtent.y = vol.GetRadiusInDirection( mat.GetColumnY() );
    ret.m_halfExtent.z = vol.GetRadiusInDirection( mat.GetColumnZ() );

    return ret;
}

The nice thing about generic programming is it gives you a set of interfaces which provide a contract for geometric volumes, and then anything that implements them can be used in certain functions. You wind up doing this kind of thing where you write a routine just for AxialBox rotations, but then you see "hey I'm only using calls that are in the generic Volume spec so this is general".

Now I'm not claiming by any means that you can make C++ generic templates and they will be competitive with hand-tweaked code. For example in the case of AABB vs. Frustum you probably want to precompute absPlane , and you probably want to special case to a 5-plane frustum and unroll it (or SIMD it). Obviously when you want maximum speed, you want to look at the assembly after the C++ compiler has had its turn and make sure it's got it right, and you may still want to SIMD and whatever else you might want to do.

But as in all optimization, you want to start your assembly work from the right algorithm, and often that is the one that is most "natural". Interestingly the interfaces which are most natural for generic programming are also often ones that lead to fast code (this isn't always the case, just like the true physics equations aren't always beautiful, but it's a good starting point anyway).

BTW a few more notes on Frustum culling. Frustum culling is actually a bit subtle, in that how much work you should do on it depends on how expensive the object you are culling is to render. If the object will take 0.0001 ms to just render, you shouldn't spend much time culling it. In that case you should use a simpler approximate test - for example a Cone vs. Sphere test. Or maybe no test at all - combine it into a group of objects and test the whole group of culling. If an object is very expensive to render (like maybe it's a skinned human with very complex shaders), it takes 1 ms to render, then you want to cull it very accurately indeed - in fact you may want to test against an OBB or a convex hull of the object instead of just an AABB.

Another funny thing about frustum culling in game usage is that the planes are not all equal. That is, our spaces are not isotropic. We usually have lots of geometry in the XY plane and not much above or below you. You need to take advantage of that. For example using an initial heirarchy in XY can help. Or if you are using a non-SIMD cull with early outs, order your plane tests to maximize speed (eg. the top and bottom planes of the frustum should be the last ones checked as they are least important).

10-18-10 - How to make a Perceptual Database

Well, I'm vaguely considering making my own perceptual test database since you know it's been like 20 years since it was obvious we needed this and nobody's done. I'm going to brainstorm randomly about how it should be.

EDIT : hmm whaddaya know, I found one.

You gather something like 20 high quality images of different characteristics. The number of base images you should use depends on how many tests you can run - if you won't get a lot of tests don't use too many base images.

Create a bunch of distorted images in various ways. For each base image, you want to make something like 100 of these. You want distortions at something like 8 gross quality levels (something like "bit rates" 0.125 - 2.0 in log scale), and then a variety of distortions that look different at each quality level.

How you make these exactly is a question. You could of course run various compressors to make them, but that has some weird bias built in as testers are familiar with those compressors and their artifacts, so may have predispositions about how they view them. It might be valuable to also make some synthetically distorted images. Another idea would be to run images through multiple compressors. You could use some old/obscure compressors like fractals or VQ. One nice general way to make distortions is to fiddle with the coefficients in the transformed space, or to use transforms with synthesis filters that don't match the analysis (non-inverting transforms).

The test image resolution should be small enough that you can display two side by side without scaling. I propose that a good choice is 960 x 1080 , since then you can fit two side by side at 1920x1080 , which I believe is common enough that you can get a decent sample size. 960 divides by 64 evently but 1080 is actually kind of gross (it only divides up to 8), so 960x1024 might be better, or 960x960. That is annoyingly small for a modern image test, but I don't see a way around that.

There are a few different types of test possible :

Distorted pair testing :

The most basic test would be to show two distorted images side by side and say "which looks better". Simply have the use click one or the other, then show another pair. This lets testers go through a lot of images very quickly which will make the test data set larger. Obviously you randomize which images you show on the left or right.

To pick two distorted images to show which are useful to test against, you would choose two images which are roughly the same quality under some analytic metric such as MS-SSIM-SCIELAB. This maximizing the amount of information you are getting out of each test, because when you put up images where one is obviously better than another you aren't learning anything (* - but this a useful way to test the user for sanity, occasionally put up some image pairs that are purely randomly chosen, that way you get some comparisons where you know the answer and can test the viewer).

Single image no reference testing :

You just show a single distorted image and ask the viewer to rate its "quality" on a scale of 0-10. The original image is not shown.

Image toggle testing :

The distorted and original image are shown on top of each other and toggled automatically at N second intervals. The user rates it on a scale of 0-10.

Double image toggle testing :

Two different distorted images are chosen as in "distorted pair testing". Both are toggled against the original image. The user selects which one is better.

When somebody does the test, you want to record their IP or something so that you make sure the same person isn't doing it too many times, and to be able to associate all the numbers with one identity so that you can throw them out if they seem unreliable.

It seems like this should be easy to set up with the most minimal bit of web programming. You have to be able to host a lot of bandwidth because the images have to be uncompressed (PNG), and you have to provide the full set for download so people can learn from it.

Obviously once you have this data you try to make a synthetic measure that reproduces it. The binary "this is better than this" tests are easier to deal with than the numeric (0-10) ones - you can directly test against them. With the numeric tests you have to control for the bias of the rating on each image and the bias from each user (this is a lot like the Netflix Prize actually, you can see also papers on that).

10/16/2010

10-16-10 - Image Comparison Part 11 - Some Notes on the Tests

Let's stop for a second and talk about the tests we've been running. First of all, as noted in a comment, these are not the final tests. This is the preliminary round for me to work out my pipeline and make sure I'm running the compressors right, and to reduce the competitors to a smaller set. I'll run a final round on a large set of images and post a graph for each image.

What is this MS-SSIM-SCIELAB exactly? Is it a good perceptual metric?

SCIELAB (see 1 and 2 ) is a color transform that is "perceptually uniform" , eg. a delta of 1 has the same human visual importance at all locations in the color cube, and is also spatially filtered to account for the difference in human chroma spatial resolution. The filter is done in the opponent-color domain, and basically the luma gets a sharp filter and the chroma gets a wide filter. Follow the blog posts for more details.

SCIELAB is pretty good at accounting for one particular perceptual factor of error measurement - the difference in importance of luma vs. chroma and the difference in spatial resolution of luma vs. chroma. But it doesn't account for lots of other perceptual factors. Because of that, I recognize that using SCIELAB is somewhat distorting in the test results. What it does is give a bonus to compressors that are perceptually tuned for this particular issue, and doen't care about other issues. More on this later. (*1) (ADDENDUM : of course using SCIELAB also gives an advantage to people who use a colorspace similar to LAB, such as old JPEG YUV, and penalize people who use YCoCg which is not so close. Whether or not you consider this to be fair depends on how accurate you believe LAB to be as an approximation of how the eye sees).

MS-SSIM is multi-scale SSIM and I use it on the SCIELAB data. There are a few issues about this which are problematic.

How do you do SSIM for multi-component data?

It's not that clear. You can just do the SSIM on each component and then combine - probably with an arithmetic mean, but if you look at the SSIM a bit you might get other ideas. The basic factor in SSIM is a term like :


ss = 2 * x*y / (x*x + y*y )

If x and y are the same, this is 1.0 , the more different they are, the smaller it is. In fact this is a normalized dot product or a "triangle metric". That is :

ss = ( (x+y)^2 - x*x - y*y ) / (x*x + y*y )

if you pretend x and y are the two short edges of a triangle, this is the length of the long edge between them minus the length squared of each edge.

Now, this has just been scalar, but to go to multi-component SSIM you could easily imagine the right thing to do is to go to vector ops :


ss = 2 * Dot(x,y) / ( LenSqr(x) + LenSqr(y) )

That might in fact be a good way to do n-component SSIM, it's impossible to say without doing human rating tests to see if it's better or not.

Now while we're at it let's make a note on this basic piece of the SSIM.


ss = 2 * x*y / (x*x + y*y ) = 1 - (x-y)^2 /  ( x*x + y*y )

we can see that all its done is take the normal L2 MSE term , and scale it by the inverse magnitude of the values. The reason they do this is to make SSIM "scale independent" , that is if you replace x and y with sx and sy you get the same number out for SSIM. But in fact what it does is make errors in low values much more important than errors in high values.


ssim :

    delta from 1->3

    2*1*3 / ( 1*1 + 3*3 ) = 6 / 10 = 0.6

    delta from 250->252

    2*250*252 / ( 250*250 + 252*252 ) = 0.999968

Now certainly it is true that a delta of 2 at value 1 is more important than a delta of 2 at value 250 (so called "luma masking") - but is it really *this* much more important? In terms of error (1 - ssim), the different is 0.40 vs. 0.000032 , or 1250000 % greater. I think not. My conjecture is that this scale-independence aspect of SSIM is wrong, it's over-counting low value errors vs. high-value errors. (ADDENDUM : I should note that real SSIM implementations have a hacky constant term added to the numerator and denominator which reduce this exaggeration)

As usual in the SSIM papers they show that SSIM is better at detective true visual quality than a straw man opponent - pure RMSE. But what is in SSIM ? It's plain old MSE with a scaling to make low values count more, and it's got a local "detail" detection term in the form of block sdevs. So if you're going to make a fair comparison, you should test against something similar. You could easily do RMSE on scaled values (perhaps log-scale values, or simple sqrt-values) to make low value errors count more, and you could easily add a detail preservation term by measuring local activity and adding an RMSE-like term for that.

What's the point of even looking at the RMSE numbers? If we just care about perceptual quality, why not just post that?

Well, a few reasons. One, as noted previously, we don't completely trust our perceptual metric, so having the RMSE numbers provide a bit of a sanity check fallback for that. For another, it lets us sort of check on the perceptual tuning of the compressor. For example if we find something that does very well on RGB-RMSE but badly on the perceptual metric, that tells us that it has not been perceptually tuned; it might actually be an okay compressor if it has good RMSE results. Having multiple metrics and multiple bit rates and multiple test images sort of let you peer into the function of the compressor a bit.

What's the point of this whole process? Well there are a few purposes for me.

One is to work out a simple reproducable pipeline in which I can test my own compressors and get a better idea of whether they are reasonably competitive. You can't just compare against other people's published results because so many of the test are done on bad data, or without enough details to be reproducable. I'd also like to find a more perceptual metric that I can use.

Another reason is for me to actually test a lot of the claims that people bandy about without much support, like is H264-Intra really a very good still image coder? Is JPEG2000 really a lame duck that's not worth bothering with? Is JPEG woefully old and easy to beat? The answers to those and other questions are not so clear to me.

Finally, hopefully I will set up an easy to reproduce test method so that anyone at home can make these results, and then hopefully we will see other people around the web doing more responsible testing. Not bloody likely, I know, but you have to try.

(*1) : you can see this for example in the x264 -stillimage results, where they are targetting "perceptual error" in a way that I don't measure. There may be compressors for example which are successfully targetting some types of perceptual error and not targetting the color-perception issue, and I am unfairly showing them to be very poor.

However, just because this perceptual metric is not perfect doesn't mean we should just give up and use RMSE. You have to use the best thing you have available at the time.

Generally there are two classes of perceptual error which I am going to just brazenly coin terms for right now : occular and cognitive.

The old JPEG/ SCIELAB / DCTune type perceptual error optimization is pretty much all occular. That is, they are involved in studying the spatial resolution of rods vs. cones, the occular nerve signal masking of high contrast impulses, the thresholds of visibility of various DCT shapes, etc. It's sort of a raw measure of how the optical signal gets to the brain.

These days we are more interested in the "cognitive" issues. This is more about things like "this looks smudgey" or "there's ringing here" or "this straight line became non-straight" or "this human face got scrambled". It's more about the things that the active brain focuses on and notices in an image. If you have a good model for cognitive perception, you can actually make an image that is really screwed up in an absolute "occular" sense, but the brain will still say "looks good".

The nice thing about the occular perceptual optimization is that we can define it exactly and go study it and come up with a bunch of numbers. The cognitive stuff is way more fuzzy and hard to study and put into specific metrics and values.

Some not very related random links :

Perceptual Image Difference Utility
New cjpeg features
NASA Vision Group - Publications
JPEG 2000 Image Codecs Comparison
IJG swings again, and misses Hardwarebug
How-To Extract images from a video file using FFmpeg - Stream #0
Goldfishy Comparison WebP, JPEG and JPEG XR
DCTune 2.0 README

A brief note on this WebP vs. JPEG test :

Real world analysis of google�s webp versus jpg English Hard

First of all he uses a broken JPEG compressor (which is then later fixed). Second he's showing these huge images way scaled down, you have to dig around for a link to find them in their native sizes. He's using old JPEG-Huff without a post-unblock ; okay, that's fine if you want to compare against ancient JPEG, but you could easily test against JPEG-Arith with unblocking. But the real problem is the file sizes he compresses to. They're around 0.10 - 0.15 bpp ; to get the JPEGs down to that size he had to set "quality" to something like 15. That is way outside of the functional range of JPEG. It's abusing the format - the images are actually huge, then compressed down to a tiny number of bits, and then scaled down to display.

Despite that, it does demonstrate a case where WebP is definitely significantly better - smooth gradients with occasional edges. If WebP is competitive with JPEG on photographs but beats it solidly on digital images, that is a reasonable argument for its superiority.

10-16-10 - Image Comparison Part 9 - Kakadu JPEG2000

Kakadu JPEG2000 (v6.4) can be tuned for visual quality or for MSE (-no_weights) , so we run both :

my_soup :

Performance in general is excellent, and we can see that they did a good job with their visual tuning (according to this metric anyway). KakaduMSE is slightly worse that jpeg_paq through the [-1,1] zone, but the visually tuned one is significantly better.

moses :

Moses is one of those difficult "noisey / texturey" type of images (like "barb") that people historically say is bad for wavelets, and indeed that seems to be the case. While Kakadu still stomps on JPEG, it's not by nearly as much as on my_soup.

The old MSU test says that ACDSee and Lurawave are better than Kakadu (v4.5) so maybe I'll try those, but they're both annoyingly commercial.

10-16-10 - Image Comparison Part 10 - x264 Retry

Well I've had a little bit more success.

I still can't get x264 to do full range successfully; or maybe I did, but then I can't figure out how to make the decoder respect it.

I think the thing to do is make an AVISynth script containing something like :


AviSource("r:\my_soup.avi")
ConvertToYV12( matrix="pc.709")

The pc YV12's are supposed to be the full 0-255 ones, and then on the x264 command like you also do "--fullrange on --colormatrix bt709" , which are just info tags put into the stream, which theoretically the decoder is supposed to see so that it can do the inverse colorspace transform correctly, but that doesn't seem to be working. Sigh!

One difficulty I have is that a lot of programs don't handle these one frame videos right. MPlayer refuses to extract any frames from it, Media Player Classic won't show me the one frame. FFmpeg does succeed in outputting the one frame, so its what I'm using to decode right now.

Anyway these are the some of the links that don't actually provide an answer :

Convert - Avisynth
Re FFmpeg-user - ffmpeg & final cut pro best format question
new x264 feature VUI parameters - Doom9's Forum
MPlayer(1) manual page
Mark's video filters
libav-user - Conversion yuvj420P to RGB24
H.264AVC intra coding and JPEG 2000 comparison
H.264 I-frames for still images [Archive] - Doom9's Forum
FFmpeg-user - ffmpeg & final cut pro best format question
FFmpeg-user - Converting high-quality raw video to high-quality DVD video
FFmpeg-devel - MJPG decoder picture quality
FFmpeg libswscaleoptions.c Source File
YCbCr - Wikipedia, the free encyclopedia

log rmse :

ms-ssim-scielab :

There's still a large constant error you can see in the RMSE graph that I believe is due to the [16-235] problem.

It should not be surprising that --tune stillimage actually hurts in both our measures, because it is tuning for "psy" quality in ways that we don't measure here. In theory it is actually the best looking of the three.

NOTE : This is counting the sizes of the .x264 output including all headers, which are rather large.


1:

call x264 -o r:\t.mkv r:\my_soup.avs --preset veryslow --tune psnr %*
call ffmpeg -i t.mkv -sws_flags +bicubic+accurate_rnd+full_chroma_int -vcodec png tf.png

2:

call ffmpeg -sws_flags +bicubic+accurate_rnd+full_chroma_int+full_chroma_inp -i my_soup.avi -vcodec libx264 -fpre c:\progs\video_tools\ffmpeg-latest\presets\libx264-lossless_slow.ffpreset r:\t.mkv
call ffmpeg -sws_flags +bicubic+accurate_rnd+full_chroma_int+full_chroma_inp -i r:\t.mkv -vcodec png r:\tf.png

I think I'll just write my own Y4M converter, since having my own direct Y4M in/out would be useful for me outside of this stupid test.

ADDENDUM : well I did.

added to the chart now is x264 with my own y4m converter :

I actually was most of the way there with the improved ffmpeg software scaler flags. I was missing the main issue - the reason it gets so much worse than our jpegfnspaq line at high bit rate is because "fns" is short for "flat no sub" and the no sub is what gets you - all subsampled codecs get much worse than non-subsampled codecs at high bit rates.

Even using my own Y4M converter is a monstrous fucking pain in the ass, because god damn FFMPEG won't just pass through the YUVJ420P data raw from x264 out to a Y4M stream - it prints a pointless error and refuses to do it. That needs to be fixed god damn it. The workaround is to make it output to "rawvideo" with yuvj420p data, and then load that same raw video but just tell it it has yuv420p data in it to get it to write the y4m. So my test bat for x264 is now :


call dele r:\t.*
call dele r:\ttt.*
c:\src\y4m\x64\release\y4m.exe r:\my_soup.bmp r:\t.y4m
r:\x264.exe -o r:\t.mkv r:\t.y4m --fullrange on --preset veryslow --tune psnr %*
call ffmpeg.bat -i r:\t.mkv -f rawvideo r:\ttt.raw
call ffmpeg.bat -pix_fmt yuv420p -s 1920x1200 -f rawvideo -i r:\ttt.raw r:\ttt.y4m
c:\src\y4m\x64\release\y4m.exe r:\ttt.y4m r:\ttt.bmp
namebysize r:\ttt.bmp r:\t.mkv r:\xx_ .bmp

The big annoyance is that I have the resolution hard-coded in there to make rawvideo work, so I can't just run it on arbitrary images.

FYI my_y4m is currently just doing PC.601 "YUV" which is the JPEG YCbCr. I might add support for all the matrices so that it can be a fully functional y4m converter.

I was going to use the Y4M reader/write code from MJPEGTOOLS , but it looks like it's fucking GPL which is a cancerous toxic license, so I can't use it. (I don't actually mind having to release my source code, but it makes my code infected by GPL, which then makes my code unusable by 90% of the world).

10/15/2010

10-15-10 - Image Comparison Part 8 - Hipix

Hipix is a commercial lossy image tool. I hear it is based on H264 Intra so I wanted to see if it was a better version of that idea (x264 and AIC both having let me down).

Well, at this point we should be unsurprised that it sucks balls :

log rmse :

ms-ssim-scielab :

One note : the rightmost data point from hipix is their "perfect" setting, which is very far from perfect. It's only a little over 2 bpp and the quality is shit. I feel bad for any sucker customers who are saving images as hipix "perfect" and thinking they are getting good quality.

I started to think , man maybe my ms-ssim-scielab is just way off the mark? How can everyone be so bad? Any time your test is telling you things that are hard to believe, you need to reevaluate your test. So I went and looked at the images with my own eyes.

Yep, hipix is awful. JPEG just blows it away.

A sample from the hipix image closest to the 0 on the x axis , and a JPEG of the same size : (HiPix is 230244 bytes, JPEG is 230794 bytes)

JPEG :

HiPix :

Note the complete destruction of the wood grain detail in the hipix, as well as introduction of blockiness and weird smudge shapes. Note the destruction of detail in the plate rim, and the ruining of the straight line edge of the black bowl.

BTW when you are evaluating perceptual quality, you should *NOT* zoom in! JPEG is optimized for human visual system artifact perceptibility at the given scale of the image. JPEG intentionally allows nasty artifacts that look bad when you zoom in, but not when you look at the image in its normal size.

Conclusion : Hipix needs to immediately release a "new and improved HiPix v2.0 that's way better than the last!" by just replacing it with JPEG.

Since they don't offer a command line app I won't be testing this on any more images.

ADDENDUM : Well I ran two points on Moses :

The two points are "High" and "Perfect" and perfect is way not perfect.

10-15-10 - Image Comparison Part 7 - WebP

I thought I wasn't going to be able to do this test, because damn Google has only released webpconv for Linux (or you know, if you download Cygwin and built yourself WTFBBQ). But I found these :

WebP for .NET

webp.zip solution for VC

... both of which are actually broken. The .NET one just fails myseriously on me. The webp.zip one has some broken Endian stuff, and even if you fix that the BMP input & output is broken. So.. I ripped it out and relaced it with the cblib BMP in/out, and it seems to work.

(I didn't want to use the webp converter in ffmpeg because I've seen past evidence that ffmpeg doesn't do the color conversion and resampling right, and I wanted to use the Google-provided app to make sure that any bad quality was due only to them)

My build of WebP Win32 is here : webp.zip

Here are the results :

log rmse :

ms-ssim-scielab :

Now I am surprised right off the bat that the ms-ssim-scielab results are not terrible, but the rmse is not very good. I've read rumors in a few places that On2 tweaked WebP/WebM for RMSE/PSNR , so I expected different.

Looking at the RMSE curve it's clear that there is a bad color conversion going on. Either too much loss in the color convert, or bad downsample code, something like that. Any time there is a broken base color space, you will see the whole error curve is a bit flatter than it should be and offset upwards in error.

The perceptual numbers are slightly worse than jpeg-huff through the "money zone" of -1 to 1. Like all modern coders it does have a flatter tail so wins at very low bit rate.

(BTW I think JPEG's shortcoming at very low bit rate is due to its very primitive DC coding, and lack of deblocking filter, but I'm not sure).

BTW I also found this pretty nice Goldfishy WebP comparison

Also some pretty good links on JPEG that I've stumbled on in the last few days :
jpeg wizard.txt
ImpulseAdventure - JPEG Quality Comparison
ImpulseAdventure - JPEG Quality and Quantization Tables for Digital Cameras, Photoshop
ImpulseAdventure - JPEG Compression and JPEG Quality

Here's how the WebP test was done :


webp_test.bat :
call dele s:\*.bmp
call dele s:\*.webp
Release\webp -output_dir s:\ -format webp -quality %1 r:\my_soup.bmp
Release\webp -output_dir s:\ -format bmp s:\my_soup.webp
namebysize s:\my_soup.bmp s:\my_soup.webp s:\webp_ .bmp
call mov s:\webp_*.bmp r:\webp_test\



md r:\webp_test
call dele r:\webp_test\*
call webp_test 5
call webp_test 10
call webp_test 15
call webp_test 20
call webp_test 25
call webp_test 30
call webp_test 40
call webp_test 50
call webp_test 60
call webp_test 65
call webp_test 70
call webp_test 75
call webp_test 80
call webp_test 85
call webp_test 90
call webp_test 95
call webp_test 100
call mov s:\webp_*.bmp r:\webp_test\
imdiff r:\my_soup.bmp r:\webp_test -cwebp_imdiff.csv
transposecsv webp_imdiff.csv webp_trans.csv

BTW the webpconv app is really annoying.

1. It fails out mysteriously in lots of places and just says "error loading" or something without telling you why.

2. It takes an "output_dir" option instead of an output file name. I guess that's nice for some uses, but you need an output file name option for people who are scripting. (you can fix this of course by making your batch rename the input file to "webp_in" or something then you ran rename the output at will)

3. It's got image format loaders for like 10 different formats, but they're all semi-broken. Don't do that. Just load one standard format (BMP is good choice) and support it *well* , eg. be really compliant with variants of the bitstream, and let the user convert into that format using ImageMagick or something like that.

4. It won't write the output files if they already exist, and there's no "force overwrite" option. This one had me absolutely pulling out my hair as I kept running it with different options and the output files stayed the same. (you can fix this of course by making your batch delete the output first)

Despite all this negativity, I actually do think the WebP format might be okay if it had a good encoder.

ADDENDUM : WebP on Moses :

On "my_soup" it looked like WebP was at least close to competitive, but on Moses it really takes itself out of the running.

old rants