12/08/2008

12-08-08 - DXTC Summary

I thought I should fix some things that were wrong or badly said in my original DXTC posts :

DXTC Part 1
DXTC Part 2
DXTC Part 3
DXTC Part 4

Added later :

DXTC Addendum
DXTC More Followup
DXTC is not enough
DXTC is not enough part 2

On the "ryg" coder : there was a bug/typo in the implementation I was using which gave bogus results, so you should just ignore the numbers in those tables. See for correction : Molly Rocket Forum on Ryg DXTC coder . Also I should note he does have some neat code in there. The color finder is indeed very fast; it is an approximation (not 100% right) but the quality is very close to perfect. Also his "RefineBlock" which does the Simon Brown style optimize-end-from-indices is a very clever fast implementation that collapses a lot of work. I like the way he uses the portions of one 32 bit word to accumulate three counters at a time.

I also mentioned in those posts that the optimized version of the Id "real time DXTC" bit math was "wrong". Well, yes, it is wrong in the sense that it doesn't give you exactly the same indeces, but apparently that was an intentional approximation by JMP, and in fact it's a very good one. While it does pick different indeces than the exact method, it only does so in cases where the error is zero or close to zero. On most real images the actual measured error of this approximation is exactly zero, and it is faster.

So, here are some numbers on a hard test set for different index finders :


    exact : err:  31.466375 , clocks: 1422.256522

    id    : err:  31.466377 , clocks: 1290.232239
            diff:  0.000002

    ryg   : err:  31.466939 , clocks:  723.051241
            diff:  0.000564

    ryg-t : err:  31.466939 , clocks:  645.445860
            diff:  0.000564

You can see the errors for all of them are very small. "ryg-t" is a new one I made which uses a table to turn the dot product checks into indexes, so that I can eliminate the branches. Start with the "ryg" dot product code and change the inner loop to :

    const int remap[8] = { 0 << 30,2 << 30,0 << 30,2 << 30,3 << 30,3 << 30,1 << 30,1 << 30 };

    for(int i=0;i < 16;i++)
    {
        int dot = block.colors[i].r*dirr + block.colors[i].g*dirg + block.colors[i].b*dirb;

        int bits =( (dot < halfPoint) ? 4 : 0 )
                | ( (dot < c0Point) ? 2 : 0 )
                | ( (dot < c3Point) ? 1 : 0 );

        mask >>= 2;
        mask |= remap[bits];
    }

I should note that these speed numbers are for pure C obvious implementations and if you really cared about speed you should use SSE and who knows what would win there.


Now this last bit is a little half baked but I thought I'd toss it up. It's a little bit magical to me that Ryg's Mul8Bit (which is actually Jim Blinn's) also happens to produce perfect quantization to 565 *including the MSB shifting into LSB reproduction*.

I mentioned before that the MSB shifting into LSB thing is actually "wrong" in that it would hurt RMSE on purely random data, because it is making poor use of the quantization bins. That is, for random data, to quantize [0,255] -> 32 values (5 bits) you should have quantization bins that each hold 8 values, and whose reconstruction is at the middle of each bin. That is, you should reconstruct to {4,12,20,...} Instead we reconstruct to {0,8,...247,255} - the two buckets at the edges only get 4 values, and there are some other ones that get 9 values. Now in practice this is a good thing because your original data is *not* random - it's far more likely to have exact 0 and 255 values in the input, so you want to reproduce those exactly. So anyway, it's not a uniform quantizer on the range [0,255]. In fact, it's closer to a uniform quantizer on the range [-4,259].


I think it might actually just be a numerical coincidence in the range [0,255].

The correct straight-forward quantizer for the DXTC style colors is

    return (32*(val+4))/(256+8);

for R.  Each quantization bin gets 8 values except the top and bottom which only get 4.  That's equivalent to quantizing the range [-4,256+4) with a uniform 8-bin quantizer.

Now

1/(256 + 8) = 1/256 * 1/(1 + 8/256)

We can do the Taylor series expansion of 1/(1+x) for small x on the second term and we get ( 1 - 8/256 + 64/256/256) up to powers of x^2

So we have

    ( (32*val+128) * ( 1 - 8/256 + 64/256/256) ) >> 8

And if we do a bunch of reduction we get 

    return ( 31*val+124 + ((8*val+32)>>8) ) >> 8

If we compare this to Mul8bit :

        return ( 31*val+128 + ((val*31 + 128)>>8)) >> 8;

it's not exactly the same math, but they are the same for all val in [0,255]

But I dunno. BTW another way to think about all this is to imagine your pixels are an 8.inf fixed point rep of an original floating point pixel, and you should replicate the 8 bit value continuously. So 0 = 0, 255 = FF.FFFFFF.... = 1.0 exactly , 127 = 7F.7F7F7F..

BTW this reminds me : When I do Bmp -> FloatImage conversions I used to do (int + 0.5)/256 , that is 0 -> 0.5/256 , 255 -> 255.5/256 . I no longer do that, I do 0->0, and 255 -> 1.0 , with a 1/255 quantizer.

1 comment:

castano said...

The SIMD on integers is a nice trick. It's especially useful on CUDA where there are no real SIMD instructions. That made my compressor ~20% faster.

old rants