First of all, let's dispell the illusion that some people have that DXT1 is "pretty good". DXT1 is fucking awful. It compresses
at 1.33333 bits per byte (4 bits per pixel). That's very large as far as image compressors are concerned. For typical images,
around 4.0 bpb is true lossless, around 1.5 is perceptually lossless, and around 0.5 is "very good". In fact wavelet compressors
can get as low as 0.1 bpb and acheive about the same quality as DXT1. Despite this I've heard smart people saying that "DXT1 is
pretty good". Yes, it is a convenient fixed size block, and yes the decoder is extremely fast and simple, but as far as quality
is concerned it is not even close. At 1.33333 it should be near lossless.
Here are some numbers on various DXT1 compressors. The numbers in the table are the RMSE (sqrt of the L2 error). The far right
column is a wavelet compressor for comparison; it's not the best wavelet compressor in the world by a long shot, it's "cbwave" which
is very old and which I designed for speed, not maximum quality. In any case it gives you an idea how far off DXT1 is.
(BTW I always try to show results in RMSE because it is linear in pixel magnitude, unlike MSE or PSNR which is a very weird nonlinear scale).
More discussion after the table...
file  Squish opt  Squish  CB opt  CB  ryg  D3DX8  FastDXT   cbwave

kodim01.bmp  8.2808  8.3553  8.352  8.6924  9.374  9.8466  9.9565   2.6068

kodim02.bmp  6.1086  6.2876  6.1287  6.3025  7.52  7.4308  8.456   1.6973

kodim03.bmp  4.7804  4.9181  4.8312  5.0225  5.855  6.094  6.4839   1.3405

kodim04.bmp  5.6913  5.8116  5.7285  5.9394  6.9408  7.1032  7.3189   1.8076

kodim05.bmp  9.6472  9.7223  9.707  10.112  10.8934  11.273  12.0156   2.9739

kodim06.bmp  7.1472  7.2171  7.1777  7.44  8.1005  8.5195  8.6202   2.0132

kodim07.bmp  5.7804  5.8834  5.8379  6.0583  6.8153  7.2182  7.372   1.4645

kodim08.bmp  10.2391  10.3212  10.346  10.747  11.3992  11.8703  12.2668   3.2936

kodim09.bmp  5.2871  5.3659  5.3306  5.5234  5.9884  6.5332  6.6716   1.6269

kodim10.bmp  5.2415  5.3366  5.2777  5.4633  5.9377  6.4601  6.4592   1.7459

kodim11.bmp  6.7261  6.8206  6.7643  7.0216  7.8221  8.1056  8.2492   1.8411

kodim12.bmp  4.7911  4.8718  4.8204  4.9863  5.6651  6.005  6.0748   1.5161

kodim13.bmp  10.8676  10.9428  10.925  11.4237  12.402  12.7139  12.9978   4.1355

kodim14.bmp  8.3034  8.3883  8.3398  8.6722  9.4258  9.896  10.8481   2.4191

kodim15.bmp  5.8233  5.9525  5.8568  6.0862  6.6749  7.3085  7.4932   1.6236

kodim16.bmp  5.0593  5.1629  5.0863  5.2851  5.8093  6.3361  6.1592   1.546

kodim17.bmp  5.5019  5.6127  5.5313  5.7358  6.4975  6.7395  6.8989   1.7166

kodim18.bmp  7.9879  8.0897  8.0192  8.3716  9.7744  9.5357  9.7857   2.9802

kodim19.bmp  6.5715  6.652  6.6692  6.91  8.0128  7.9229  8.0096   2.0518

kodim20.bmp  5.4533  5.5303  5.4895  5.6864  6.3457  6.4878  6.8629   1.5359

kodim21.bmp  7.1318  7.2045  7.1724  7.4582  8.1637  8.4703  8.6508   2.0659

kodim22.bmp  6.43  6.5127  6.4644  6.7137  7.8264  8.0046  7.9488   2.2574

kodim23.bmp  4.8995  5.0098  4.9244  5.0906  5.6989  6.3057  6.888   1.3954

kodim24.bmp  8.4271  8.5274  8.4699  8.8564  9.3906  9.9389  10.5156   2.4977

clegg.bmp  14.9733  15.2566  15.1755  15.7168  16.3563  21.471  32.7192   10.5426

FRYMIRE.bmp  10.7184  12.541  12.132  12.8278  12.989  16.7308  28.9283   6.2394

LENA.bmp  7.138  7.2346  7.1763  7.4264  8.1203  8.742  9.5143   4.288

MONARCH.bmp  6.5526  6.6292  6.5949  6.846  7.5162  8.1053  8.6993   1.6911

PEPPERS.bmp  6.3966  6.5208  6.4557  6.677  7.3618  8.1855  8.8893   2.3022

SAIL.bmp  8.3233  8.3903  8.3598  8.6627  9.8685  9.7838  10.5673   2.9003

SERRANO.bmp  6.3508  6.757  6.8385  7.9064  7.5303  9.0549  18.3631   4.6489

TULIPS.bmp  7.5768  7.656  7.6146  7.8786  8.4084  9.3817  10.5873   2.2228

Back to comparing the DXT1 encoders.
BTW the test set here is the Kodak image set plus the Waterloo Bragzone image set. The Kodak set is all photographs that are pretty noisy, and there's
not a huge difference in the coders.
The Bragzone image set has some synthetic images with things like gradients which are harder to compress well, and there you can really dramatically see
the bad encoders fall apart. In particular if you look at the results on "clegg" and "frymire" and "serrano" you can see how bad the "FastDXT" coder is.
The "Squish" in the table is the iterative cluster fit with uniform weighting. All coders work on linear RGB error; and the MSE is mean per pixel not per component.
The "CB" encoder is a simple 2means fit. I seed the means by doing the PCA to find the best fit line. I put a plane perpendicular to that line through
the average and take all the points on each half to be the two clusters, average the cluster to seed the means, and then iteratively refine the means by
reclustering. Once I have the 2means I do a simple search to find the best 565 DXT end points to find the two means. There are 3 cases to try :
1. put c0 and c1 = the two means
2. put c2 and c3 = the two means (so c0 and c1 are pushed out)
3. make (c0,c2) straddle mean 1 and (c3,c1) straddle mean 2  this is best for gaussian clusters around the mean
A 2means like this is slightly better than doing a pca "range fit" like Simon's fast method or the "ryg" method. If the data was all Gaussian noise, they
would be equivalent, but of course it's not. You often get blocks that have a bunch of pixels at the low end which are all exactly the same color (
for example, all perfect black), and then
a spread of a bunch of junk at the high end (like some orange, some yellow, etc.). You want to put one end point exactly on perfectly black and put the other
endpoint at the center of the cluster on the high end.
"CB opt" and "Squish opt" take the results of the CB and Squish compressors and then improve them by iterative search. Simon Brown on his page mentions
something about doing greedy endpoint refinement but claims it "doesn't converge because the indeces change". That's silly, of course it converges.
To do a greedy search : try wiggling one or both end points in 565 space. Find new best index fit for the new end points. Measure new error. If the
error is lower, take the step.
Of course that works and it does converge and it's pretty simple and improves the encoding after Squish.
In fact you can do even better by doing simulated annealing instead of a plain greedy search. We should know by now that any time we have a greedy
search like this where we can measure the error and it can get stuck in local minima, we can improve it with something like simulated annealing. I use 256
steps of annealing with a sinusoidal decreasing temperature pattern. I randomly pick a way to wiggle the two end points (you need to consider wiggling both,
not just single wiggles). I try the wiggle and measure the error delta. Negative errors (improvements) are always taken, positive errors are taken
probabilitistically based on the temperature.
This is what was done in the "opt" coders above.
Most of the time the annealing search just makes a small improvement, but once in a while (such as on "frymire" and "serrano") it really finds something
remarkable and makes a big difference.
Simon's cluster fit alg is very sweet. I didn't really understand it at first
from the description of the code, so I'm going to redescribe it for myself here.
The basic idea is you find an assignment of indices, and from that solve a best fit to give you the end points for those indices. If you think about all the
colors in the block living in 3x3 color space, assigning indices is like giving labels to each point. All the points with the same label are a "cluster".
Then each cluster is fit with either an end point or one of the interpolated points.
Once you know the clusters, putting the best two end points is a simple linear optimization. You can solve the least squares analytically, it's not like a
matrix iteration least squares problem or anything.
So the trick is how to decide the indices. If you tried them all, it would be 4^16 ways, which is way too many. So what Simon does is create a linear ordering
of the points using the PCA axis, then try all clusterings that can be created by planes perpendicular to that linear axis. That's equivalent to just trying all
groupings of the indeces sorted by their dot along the PCA axis. That is, the clusters are
[0,i) , [i,j) [j,k) [k,16)
for all i,j,k
which is a manageable amount to search, and gives you most of the interesting clusterings that you care about. Something that might improve Squish is tring a
few different axes and picking the best.
BTW this end point optimization is very approximate. One issue is that it assumes the best index for each point doesn't change. It
also of course just uses real number arithmetic to make the 1/3 points, not the actual integer rounding that's done. Those factors
are actually pretty big.