12-18-09 - The ACM

There's a bunch of hubbub lately in the game/graphics community about the anti-scientific actions of the ACM.

I find it all much ado about nothing. Of *course* the ACM is a bunch of cocks that are against scientific progress and dissemination of information. Ooo the group of evil bastards who claim all the rights to the paper that you yourself wrote are being dicks !? Well big fucking surprise. Woop-tee-do.

Everyone should boycott the ACM and the IEEE. Duh. The paper publications have absolutely zero purpose now. The conferences have almost no purpose, it's just a bunch of hob-nobbing and back-slapping. The internet is how real information is conveyed. I'm not quite sure why people are so enamored of conferences. I can get way more information in 10 minutes reading papers on the internet than you can get from a 4 day conference, because I am pulling the information I want at the pace that I want, not having it pushed to me.

Now, peer review and collection of papers does in fact provide a service. You don't just want researchers putting up papers on their own web sites with no organization and no peer review. (for example, Arxiv is great and all, but the lack of organization and peer review makes it weak as a primary publishing location).

If you actually cared about getting away from the evil iron grip of the ACM and IEEE bastards, you should work harder to organize and promote the free online journals, of which there are many, but they haven't caught on because the majority of major researchers are smitten by the prestige of the ACM/IEEE names.

Go to DOAJ , pick a journal, support that. Or you know, start a new one focused on higher performance / realtime computer graphics.


11-18-09 - Raw Conversion

CR2 format is a big mess of complication. My god I hate TIFF. The main thing is the sensor data. It appears to be stored as "lossless JPEG" which is a new format that uses the JPEG-LS predictor but then just codes the residual with normal JPEG Huffman coding. The sensor data is RGGB which they either store as a 4-channel per pixel [RGGB per pixel] or as 2-channel [GR or GB]. Either way is clearly not optimal. One interesting thing I could do if I cracked the CR2 format is store all these raws smaller with a better compressor. The RAWs from the S90 are around 11M on average, it uses the 2-channel mode; the RAWs are 1872x2784 = 3744x2784 samples and 12 bits per sample. That means the JPEG is getting to 8.85 bits per sample. Not very good.

Of course I probably have to use dcraw to read it for me, but dcraw is just about the worst piece of code I've ever seen in my life. It's a miracle to me that people are able to write functioning software from code like that.

Paul Lee has a modified dcraw and some nice sample pictures of how demosaicing can go wrong (click the Moire or Aliasing links).

My idea for high quality RAW processing :

First of all, abandon your idea of an "image" as a matrix (grid) of colors (aka a bitmap).

The S90 sensor has barrel distortion that's corrected in software.

It also samples colors in an RGGB Bayer mosaic pattern (like most cameras).

The two of those things combined mean that you really just have a collection of independent R's, G's, and B's at
irregular positions (not on a grid due to barrel distortion).

Now, you should also know that you need to do things like denoising on these original samples, NOT on
the grid of colors after conversion to a bitmap.

So I want to denoise directly on the source data of irregular color samples.
Denoising R & B should make use of the higher quality G data.

Denoising should of course use edge detection and other models of the image prior to make a Bayesian
maximum likelihood estimate of the sample without noise.

To output a bitmap you need to sample from this irregular lattice of samples (mosaic'ed and distorted).

Resampling creates aliasing and loss of information, so you only want to do it once ever on an image.

There's absolutely no a-priori reason why we should be resampling to the same resolution as the sensor
here.  You should resample at this point directly to the final resolution that you want your image.

For example with the S90 rather than outputting the stupid resolution 3648x2736, I would just output 3200x2400
which would let me view images at 1600x1200 on monitors with a box down-filter which will make them appear
much higher quality in practice (vs 3648x2736 viewed at 1600x1200 which involves a nasty blurring down-filter).

The output from this should be a floating point bitmap so that we don't throw away any color resolution

Exposure correction can then be done on the floating point bitmap without worrying about the irregular
lattice or any further resampling issues.


11-06-09 - IsSameFile

I found myself wanting to know if two file names were the same file on disk. It's hard to check that just by looking at the name. Obviously you have issues like one might be absolute, one might be relative. Even if you fix that, they could be different A-code-pageizations of unicode names. And something I hit often is one of them might be on a "subst" or even a hard link. I want to know if they are actually the same file.

This appears to work :

bool IsSameFile(char * Name1,char * Name2)
    HANDLE f1,f2;
    if ( f1 == INVALID_HANDLE_VALUE )
        return false;
    if ( f2 == INVALID_HANDLE_VALUE )
        return false;
    // BY_HANDLE_FILE_INFORMATION has a unique file ID and Volume Serial Number in it
    //  check those are the same
    //  heh fuck it just check they are all the same
    // confirmed : this does work across substs
    return memcmp(&info1,&info2,sizeof(info1)) == 0;


11-04-09 - Video is not for Windows

Holy crap video is a flaming mess.

First, if you don't know, there are two main things : packages & streams. The packages are AVI, MKV, MP4, MOV ; they put together image data and audio data in some way (the layer that unpacks these is often called a "splitter" or "demuxer"). The packages send their streams to codecs which convert them to some format for display. On Windows this is all supposed to go through DirectShow which is supposed to use the 4CC codes and some priority information to automatically find the right handler for the various streams and packages. In theory.

The first problem you hit on Windows is that AVI packages are handled pretty well, but AVI can't hold H264 video because AVI can't handle the flexible B-frame ordering that H264 can generate. (limitted profiles of H264 can be put in AVI, and there are hacks around this problem, but you're getting into a world of hurt). So you need MKV or MP4 boxes, and those are handled poorly; some apps handle them okay, some don't. Some apps "cheat" and don't trust DirectShow like they're supposed to (the cheating apps often work better).

Things I've installed lately :

MP4Box : MP4 stream boxer/unboxer ; pretty decent app, recommended, but help is poor

YAMB : GUI for MP4Box.  Useful to help figure out command lines for MP4Box because the help is bad.
    YAMB has bad bugs though and will fail to launch MP4Box, so you have to copy out the command line
    and run it yourself

MKVVerify : MKV stream checker.  Useful because MKV support is so fucking borked.

MediaInfo : Media info reporter.  Questionable usefulness because I don't trust it and don't know where
    it's getting it's info for.

Graphedit : DirectShow graph visualizer and tester from MS

GSpot : AVI info tool.  Useless.

MSU VMT : Moscow State University Video Quality Measurement Tool.  This is pretty neat when it works,
    but far too often it fails to get the frames correctly, so you get totally bogus results.

MSU LS Codec : Moscow State University Lossless Codec.  Best compressing lossless codec, seems nice
    but crashes some tools when you try to use videos compressed with this.  Thus useless.

Lagarith Codec : This appears to be the one good working lossless codec.  Recommended.

HuffYUV Codec : Videos made with this crash me on read.  Jeff says it works great for him.  Avoid.

MeGUI : GUI for "mencoder" which can driver AviSynth and x264 ; like all of these big GUIs that try to
    run a bunch of other products, this mysteriously fails for me.  It seems to set everything up right
    and then it launches ten other programs and they fail to hook up in the way MeGUI expected them to.

Handbrake : see MeGUI

FFDShow : hooks up the Linux video decoders (ffmpeg , libavc, etc.) to DirectShow.  This thing is
    pretty evil and fails to report frame rate and media info sometimes, but is also the only real

Haali Media Splitter : MKV unboxer, works with FFDShow.  Difficult to install correct manually.
    Even when installed correctly, does some weird shit with framerate; doesn't seem to report it
    correctly through DirectShow.  Probably best to get a codec pack like :

K-Lite Codec Pack : works for me but generally is considered malware
Matroska Codec Pack : didn't work for me
CCCP Codec Pack : not tried

MPlayer : Linux media player, now ported to Windows ; very flexible command line control of everything,
    alternate audio/video in/out.  Highly recommended.

MEncoder : video encode/decode partner to MPlayer.  I've had more success running mplayer and x264 manually
    than using this.  Still I can't complain about MEncoder from the command line.

MPUI : GUI for MPlayer.  This is horrific malware.  When you install it, it takes over your system without
    asking.  They do provide some tools for you to change this after the fact, but still should be avoided.
    Use Media Player Classic or VLC.

AviSynth : script thing to pipe video to other programs that read AVS scripts.  Dear lord.

Basically I've found that all the GUI's are broken, and all the video containers (AVI,MP4,MKV) are broken. The thing I've finally discovered that actually works is using MPlayer and X264 from the command line, and only working with split frames. Trying to work with video containers caused me all kinds of hurt because so many of these apps fail to unbox the containers right and screw up the frame rate or drop frames or other mistakes. Instead now if I want to work on a video I use MPlayer to convert it to raw frames.

mplayer -benchmark -ao null -vo png:z=5 video.avi

to dump frames to PNG

mplayer -benchmark -ao null -vo yuv4mpeg:file=test.y4m video.avi

to dump the video to YUV4MPEG format in "test.y4m" for input to x264

x264.exe --bitrate 10000 --output "out.mp4" test.y4m

x264 compress to "out.mp4"

Then use mp4box to put the audio back if wanted.

The cool thing about mplayer is that its audio/video decoders are the same ones used to view the video. So you can watch it, and if it plays right in the viewer, then it will extract correctly. I've found lots of videos that I can watch in MPC or VLC, but then fail to load the same way in whatever encoder/decoder when I try to process something.

The sucky thing about this method is you make ginormous temp files on your disk, which also slows things down a lot. But avoiding the fuckups of going through the borked DShow codecs and splitters is worth it.

Most of these tools now are originally Linux tools that are getting moved back to Windows. One very promising development is that many of them have the option to directly load libs for the codecs Linux-style (eg. just load libavc to play video) and avoid DirectShow completely. I haven't really tried that yet but it seems like it's almost possible to work with video just by manually picking a few of these libs and then you avoid the whole Windows borked media layer.

ADDENDUM : one of the difficulties I've seen in a lot of tools is reading the frame rate wrong. This is presumably due to the demuxers not reporting things back totally right. But there are also two related fundamental problems that make everything harder :

1. Most of these formats don't have real/useful headers. (if they do have header info, it's just added as "comment" information). This was done originally because theoretically if your AVI is being broadcast on TV and you change the channel into it, you will just start getting bytes in the middle and never see the header, thus they didn't put good headers on at all.

2. It's almost impossible to really reliably just get the frames out of video. DirectShow doesn't have a reliable call that's just "give me the next frame". Instead you have to ask for "when is the next frame" and then "give me an image at this time". The problem is that the "when" can get fucked up in various ways, and then when you say "give me an image at this time" you can either skip frames or get duplicate frames. (this is what fucks up the MSU VMT tool for me so badly, they are getting the time sampling all wrong quite often).

Even if it's not way off, this still causes subtle bugs because people don't agree exactly on how to represent the frame rate. Some people treat broadcast as exactly = 30000/1001 fps and use rational arithmetic for all timing. Some people use floats for frame rate and use 29.97003 and then wind up with floating point precision problems at high frame numbers. Many of the containers store the frame rate as a number of microseconds between frames, eg. 33367 ; so if they store "33367" in the header, should I use that as my frame time increment exactly, or should I use 33366.666666 ?

I'm guessing that tons of people get duplicate and/or dropped frames because of this and just don't notice it.


10-19-09 - Against asynchronous GUIs

Most new Microsoft products have super asynchronous GUIs. This means they do GUI refreshes and window popups and threads and so on. This sounds like a nice thing, it removes big stalls and freezes in the GUIs and makes them "respond" more quickly, in the sense that when you request some new view it pops the window up immediately and then gradually fills it in, whereas before it would just freeze and wait for the new view to pop up.

In practice for power users this just sucks absolute balls. Power users use the keyboard. We use our peripherial vision. We don't sit around and wait for windows to pop up, we hit key combos that we know and we memorize sequences. So for example you might memorize that hitting Alt-F opens a find dialog and puts the focus in the entry window, so you can highlight the text you want to find, Ctrl-C, Alt-F, Ctrl-V, Enter, all in a row boom you're searching for that text.

BUT NO! No you aren't, instead you just pasted that text into your document because the fucking finder window was being built on a thread and wasn't opened in time to receive your ctrl-v so your ctrl-v went to the old window.

There are few obvious GUI design fails here. One is that asynchronicity and focus stealing should never go together. If a window opening is async then it should not transfer focus to that window. Also, if a key pops up a new window and transfers focus, do that immediately and queue up the key presses and don't process them until the action associated with each key press is complete. eg. if you get input like A,B,C, if you process A and it fires something async, then don't respond to B until the action from A is done if it's possible that they affect each other.

More generally, this programming pattern of finding clever complicated ways to hide the fact that your systems are overly bloated and slow is just not the win. You will only make the failure cases less common but more ugly. For example in this particular case I'd rather have a slow modal popup than an unpredictable async popup.

Now, there certainly is a place for asynchronicity in GUIs, namely in tasks spawned from GUIs that run in a pane. When you start some long process, that should be async from the GUI and run on its own thread, but I would argue that the entire GUI should run on one single thread and GUI drawing updates of things like icons and buttons should not be async (not any aspect of it that affects user input - purely visual updates could be async). So like for example, when you open a new folder, the population of the list for that folder should NOT be async because the user may be typing characters to navigate to names within the folder, and if you populate async you will screw up the response to their typing (it could be okay to be async here *if* you make it respond in the same way that it would if it was synchronous!). Ideally GUI operations should just be fast and not need to be async.

How do we not have a fucking universal laptop docking multi-port thing !? WTF.

The delay of Windows/PC's recognizing USB devices is really tilting. Every time I move my laptop or plug in the keyboard, it's wait .. wait .. wait .. okay now I can type. Urg.

I'm trying this new thing where I put my computer on automatic standby after 2 hours. I've tried lots of reminder apps before to make myself get up and stretch, but I always just ignore them. Standby is a hard enough obstacle that I think it might make me actually get up - if nothing else I'll get up in anger because the standby will fuck up what I'm doing. But when I come out of standby I have to deal with the fucking USB.

Computer-laid-out maps are abominations. Hand drawn maps are beautiful and convey so much more information. (what follows are just links related to maps, not examples of great hand-drawn maps)

Metskers Maps in Seattle is pretty great.

FeetFirst is a non-profit that supports walking in Seattle. Their maps are only okay.

See also : City of Seattle Map of public art and Historic Places in Seattle .


09-29-09 - Aliasing is Pretty

I made some images of aliasing/moiray patterns : (BTW inherent to these images is the fact that they don't look at all right except at full zoom since the image structure is created by the pixel grid sampling aliasing, so click through). You may be surprised to know what these are an image of.









Answer : these are pictures of sin( x^2 + y^2 ) , that is, sin( r^2 ) a very simple radial trig function. You don't see the function at all, the images are 100% aliasing as the radial sin is scaled to very high frequency.


09-26-09 - Habits - Code Inlining

There's been an off and on discussion with some peers in email about the issue of putting small code functionality directly inline in a bigger function vs. splitting it out. I don't think it's actually an interesting topic so I don't want to repeat it here, but it made me think of something I want to write about.

Basically the contention is that even if you have a logically separate piece of work, sometimes it's better to go ahead and write it inline in the larger function, because it makes the code flow more linear and imperative and thus easier to read and debug. There are also obvious disadvantages to writing code inline - less strong separation of functional bits, temptation to copy-paste code, less ability to debug and test functions independently, etc. etc. All the pros and cons are obvious and thus not interesting.

The key point to me is the issue that while code inline may be a win sometimes - it's a big loss at other times, and it requires you to make a careful smart logical decision about which way to go.

That's very bad. I think most smart people as they get older come to the realization that you shouldn't put yourself in positions where you have to make careful logical decisions over and over. Instead you should craft habits and rules for yourself that help you do the right thing without having to make careful decisions.

One obvious case that most smart people figure out is diet and exercise, or booze or other indulgences. Yes, it's perfectly fine to have a some dessert once in a while, but if you open that door for yourself you're putting yourself in a situation where you are consciously making a decision "is it okay for me to have cake today?" and you will inevitably get lazy and make the wrong choice sometimes.

As usual the best analogy is poker, and it's how this point was really made real for me. Smart people often start out playing poker trying to logically reason out every single action and think they don't need to be constrained by simple rules or habits. That's great if you really are correctly thinking through every situation, but you inevitably get tired or lazy or make mistakes, and if you're not constraining yourself with rules, you can make huge mistakes.

For example, there might be cases where the best play is to limp aces up front, or to not reraise with aces, but correctly making that decision requires a lot of careful thought, and the upside to making that decision is pretty small, while the downside to doing it in the wrong situation is very big. It's best to just make a rule for yourself that you always raise or reraise. It simplifies things, it removes decision points and lets you focus on more important issues. It might be +EV to call raises sometimes with hands like 68o, but it's best to just give yourself a rule that you never do that.

To be clear - these rules are specifically non-optimal. By making rules for yourself you are intentionally choosing to not try to be 100% optimal, so of course someone can say "you could have done something better there". That's not the point. The point is that if you try to make perfect decisions all the time you will occasionally fail very badly.

Winning poker play (or good coding, or good life living) are largely about making it easy for yourself to do the right thing.


09-25-09 - Motion Search

I just read the paper on Patch Match and it makes me angry so I figure I'll write about the motion search method I'm developing for possible future use in the new RAD video stuff. PatchMatch is just so incredibly trivial and obvious, it's one of those things that never should have been a paper and never should have been accepted in a journal. It's a great thing for someone to write on their blog because you can describe it in about one sentence, and most experts in the field already know the idea and are probably doing it already. (I will say the good thing about the paper is they do a good job of gathering references to other papers that are related, such as stuff in texture synth and hole filling and so on which I find interesting).

Here's the one sentence version of PatchMatch : Seed your match field with some random guess or shitty initial matches; improve by incrementally propagating match offsets to neighbors and trying small random deltas to find improvements. (it's an absolute classic spin network magnetic moment relaxation kind of problem).

Here's what I've been doing : start with a match field set to all nulls (no match found yet). Then incrementally fill it in with matches and propagate them to neighbors. It proceeds in a few steps like this :

Step 1. Use computer vision methods to find feature points in a frame. Match feature points to the previous frame. This bit is a bit tricky and tweaky, you only want to make matches that you're pretty confident in. Note that this matching is done based on a "characteristic" of the feature point which has no distance limit, and is also somewhat immune to rotation and scaling and such. Sometimes this step finds some very good correspondences between the frames, but it's sparse - it only has high confidence at a few places in the frame, so you can't use it to find all the block matches (and you wouldn't want to even if you could). Generally this finds around 100 vectors.

Step 2. Find "distinctive" spots in the frame. The goal is to find some spots that are not degenerate - eg. not flat patches, not straight edges. The idea is that these are places where we can likely find a good motion vector with high confidence, unlike degenerate areas where there are lots of equally good match vectors. I use two mechanisms to find distinctive spots : one is the computer vision feature points that were not already used in the first matching step. The second is to take the "cornerness" map of the image using a Harris or Hessian operator on the derivative of gaussians (this is a lot like an edge map, but it kills straight edges). Find the top 5% highest cornerness values that are local maxima, and use those as distinctive spots. All of the distinctive spots do a long distance brute force block match (something like radius = 16 or 32) to try to find a good motion vector for them.

Step 3 : Flood fill to fill in the gaps. We now have presumably good motion vectors at a few key points in the frame. Go to their neighbors and search for match vectors that are close to the neighboring one that we already found. Put that in the blank and push its neighbors to the queue to continue the flood fill.

Step 4 : Relaxation pass. (this is not critical). We now have a motion vector everywhere in the frame. For each match vector in the frame, look at its 4 neighbors. Examine match vectors that are near my 4 neighboring vectors. If one is better, replace self. Continue to next. Theoretically you should do this pass a few times, but I find 1 or 2 is very close to infinite.

The key thing is that motion is usually semi-coherent (but not fully coherent, because we are not really trying to find true motion here, but rather just the best matching block, which is a lot more random than true motion is). By finding very good motion vectors in seed spots where we have high confidence, we can propagate that good information out to places where we don't have as much confidence. This lets us avoid doing large brute-force searches.

BTW I really do not understand the point of all the "diamond search" type shit in the video compression literature. It seems to just find really shitty motion vectors and is not making good use of the possibilities in the bit stream. Especially with GPU video encoding in this modern age, doing plain old big chunks of brute force motion search is preferrable. (yes, I know it's for speed, but it's a poor way to optimize, and the high quality encoders are still non-realtime anyway, so if you're not realtime you may as well take some more time and do better; plus the vast majority of use of non-realtime video encoders is in an encode-once decode-many type of scenario which means you should spend a lot of cpu and encode as well as possible).

With this method I find motion vectors using local searches of radius 8-16 that are the same quality as brute force searches of radius 50-100, which makes it about two orders of magnitude faster (and higher quality, since nobody does brute force searches that far).

ADDENDUM : To give this post a bit more weight, here are some numbers on quality from my video coder vs. brute force search radius :

 -s16  : rmse : 9.3725 , psnr : 28.7277
 -s26  : rmse : 9.2404 , psnr : 28.8510
 -s48  : rmse : 9.0279 , psnr : 29.0531
 -s64  : rmse : 8.9171 , psnr : 29.1603
 -s100 : rmse : 8.7842 , psnr : 29.2907
 -s9999: rmse : 8.5294 , psnr : 29.5465

(-s16 means it's searching a 33x33 grid for motion vectors) (-s9999 means it searches full frame).

The above described iterative feature point propagation method gets

 -sfast: rmse : 8.8154 , psnr : 29.2600

BTW for doing full-frame brute force search you obviously should use a block-space acceleration structure for high dimensional nearest neighbor search, like a kd-tree, a bd-tree (box decomposition) or vp-tree (vantage point). High dimensional spaces are nasty though; the typical idea of "find a cell then walk to its immediate neighbors" is not fast in high D because you have O(D) neighbors.


09-08-09 - DXTC Addendum

Ryg pointed out that there are a few very important little details that I took for granted and didn't mention in my original DXTC postings , or was just not clear about :

One is that when I try all ways of hitting two given endpoints, I try both 4-color and 3-color versions. That is, given two endpoint colors C0 and C1, I quantize them to 16 bits, then try the DXT1 palette that you get from {C0,C1} and also the one from {C1,C0} (order of DXT1 determines whether it is 3-color or 4-color).

The second and related crucial thing, is that in 3-color mode, the extra color is transparent black. If the texture has no alpha at all, I assume the user will not be using it as an alpha source, so I treat the transparent black as just black. That is, I do color palette selection with alpha just ignored.

Apparently this is pretty important. I suspect this especially helps with the "4 means" method; if a bunch of the colors are near black, you want them to be classed together and then just ignored for the endpoint selection, so that they will go to the hard-coded black in 3-color mode and your interp end points will be chosen from the remaining colors.


08-27-09 - Oodle Image Compression Looking Back Pictures

I thought for the record I should put up some pictures about what I talked about last time.

First of all the R/D trellis quantization issue. Very roughly what we're doing here is coding to a certain bit rate. The "RDO" lets us use a smaller quantization bucket size, which initially lowers distrortion and increases our rate, but then we hammer on some of the values - mainly we just force them to zero, which causes some distortion and decreases rate; we choose to hammer the values that save us the most rate per distortion. (99% of the time all you're doing is turning 1's into 0's, so it's a matter of picking the 1 to squash to 0 which saves you most the rate).

Here are the results on "Moses" at 0.5 bits per pixel :

No R/D : RMSE = 9.9381 :


Unconstrained R/D : RMSE = 9.7843 :


You should be able to see in the R/D image that some of the image looks better, but other parts look much worse. The RDO has stolen rate from places where it was expensive in terms of rate to encode a certain distortion, and moved those bits to parts of the image where you can get more distortion win at a cheaper rate. This is awesome if your goal is to minimize RMSE, but it's unclear to me whether this is *ever* good perceptually.

In this particular case, the RDO Moses image actually has a worse SSIM than the No-RD image; this type of mistake is actually something that SSIM is okay at detecting.

In practice I use some hacks to limit how much the RDO can do to any one block. With those hacks I almost always get an SSIM improvement from RDO, but it's still unclear to me whether or not it's actually a perceptual improvement on many images (in some cases it's a very clear win; images like kodim09 or kodim20 where you have big flat patches in some spots and then a lot of edge detail in other spots, the RDO does a good job of stealing from the flats to give to the edges, which the eye likes, because we don't mind it if an almost perfectly smooth area becomes perfectly smooth).

Now for the hacky perceptual smooth DC issue.

This is "kodim04" at 0.25 bpp ; no RDO ; no unblock , no perceptual DC quantization ; basically a naive DCT coder :


Now we turn on the hacky perceptual quantization that gives more precision to smooth DC's : (unblock still off) :


Note that the perceptual quant of DC means that we are using more of our bitrate for the DC band, so we give less bits to AC, which means using a larger quantizer for AC to match the bit rate constraint.

Now with unblocking , no perceptual DC quant : (RMSE = 12.8565 , SSIM = 58.62%)


With unblocking and perceptual DC quant : (RMSE = 12.9666, SSIM = 57.88%)


I think the improvement is clearest on the unblocked images - the perceptual DC quant one actually looks okay, the parts that are supposed to be smooth still look smooth. The one with uniform DC quant looks disgustingly bumpy. Note that the SSIM of the better image is actually quite a bit worse. Of course RMSE gets worse any time you do a perceptual improvement. You should also be able to see that the detail in the hat thatching is better in the nonperceptual version, but that doesn't bother the eye nearly as much as breaking smoothness.

ADDENDUM : some close up pictures of Moses' waddle area showing the R/D artifacts better. You should zoom these to full screen with a box filter and toggle between them to see most clearly. You should see the RDO killing blocks in the collar area very clearly. All you really need to do is look at the last picture of these four and you should be able to see what I'm talking about with the RDO :

Portion of Moses at 0.75 bpp : No lagrange optimization :


With Lagrange RDO :


Crop of No-L :


Crop of RDO :



08-25-09 - Oodle Image Compression Looking Back

I did a little image compressor for RAD/Oodle. The goal was to make something with quality comparable to a good modern wavelet coder, but using a block-based scheme so that it's more compact and simple in memory use so that it will be easy to stream through the SPU and SIMD and all that good stuff, we also wanted an internal floating point core algorithm so that it extends to HDR and arbitrary bit depths. I wrote about it before, see : here or here . That's been done for a while but there were some interesting bits I never wrote about so I thought I'd note them quickly :

1. I did lagrange R-D optimization to do "trellis quantization" (see previous ). There are some nasty things about this though, and it's actually turned off by default. It usually gives you a pretty nice win in terms of RMSE (because it's measuring "D" (distortion) in terms of MSE, so by design it optimizes that for a given rate), but I find in practice that it actually hurts perceptual quality pretty often. By "perceptual" here I just mean my own eyeballing (because as I'll mention later, I found SSIM to be pretty worthless). The problem is that the R-D trellis optimization is basically taking coefficients and slamming them to zero where the distortion cost of doing that is not worth the rate it would take to have that coefficient. In practice what this does is take individual blocks and makes them very smooth. In some cases that's great, because it lets you put more bits where they're more important (for example on images of just human faces it works great because it takes bits away from the interior patches of skin and gives those bits to the edges and eyes and such).

One of the test images I use is the super high res PNG "moses.png" that I found here . Moses is wearing a herring bone jacket. At low bit rates with R-D Trellis enabled, what happens is the coder just starts tossing out entire blocks in the jacket because they are so expensive in terms of rate. The problem with that is it's not uniform. Perceptually the block that gets killed stands out very strongly and looks awful.

Obviously this could be fixed by using a better measure of "D" in the R-D optimization. This is almost a mantra for me : when you design a very aggressive optimizer and just let it run, you better be damn sure you are rating the target criterion correctly, or it will search off into unexpected spaces and give you bad results (even though they optimize exactly the rating that you told it to optimize).

2. It seems DCT-based coders are better than wavelets on very noisy images (eg. images with film grain, or just images with lots of energy in high frequency, such as images of grasses, etc). This might not be true with fancy shape-adaptive wavelets and such, but with normal wavelets the "prior" model is that the image has most of its energy in the smooth bands, and has important high frequency detail only in isolated areas like edges. When you run a wavelet coder at low bit rate, the result is a very smoothed looking version of the image. That's good in most cases, but on the "noisy" class of images, a good modern wavelet coder will actually look worse than JPEG. The reason (I'm guessing) is that DCT coders have those high frequency pattern basis functions. It might get the detail wrong, but at least there's still detail.

In some cases it makes a big difference to specifically inject noise in the decoder. One way to do this is to do a noisey restore of the quantization buckets. That is, coefficient J with quantizer Q would normally restore to Q*J. Instead we restore to something random in the range [ Q*(J-0.5) , Q*(J+0.5) ]. This ensures that the noisey output would re-encode to the same bit stream the decoder saw. I wound up not using this method for various reasons, instead I optionally inject noise directly in image space, using a simplified model of film grain noise. The noise magnitude can be manually specified by the user, or you can have the encoder measure how noisey the original is and compare to the baseline decoder output and see how much energy we lost, and have the noise injector restore that noise level.

To really do this in a rigorous and sophisticated way you should really have location-variable noise levels, or even context-adaptive noise levels. For example, an image of a smooth sphere on a background of static should detect the local neighborhood and only add noise on the staticy background. Exploring this kind of development is very difficult because any noise injection hurts RMSE a lot, and developing new algorithms without any metric to rate them is a fool's errand. I find that in some cases reintroducing noise clearly looks better to my eye, but there's no good metric that captures that.

3. As I mentioned in the earlier posts, lapping just seems to not be the win. A good post process unblocking filter gives you all the win of lapping without the penalties. Another thing I noticed for the first time is that the JPEG perceptual quantization matrix actually has a built-in bias against blocking artifacts. The key thing is that the AC10 and AC01 (the simplest horizontal and vertical ramps) are quantized *less* than the DC. That guarantees that if you have two adjacent blocks in a smooth gradient area, if the DC's quantize to being one step apart, then you will have at least one step of AC10 linear ramp to bridge between them.

If you don't use the funny JPEG perceptual quantization matrix (which I don't think you should) then a good unblocking filter is crucial at low bit rate. The unblocking filter was probably the single biggest perceptual improvement in the entire codec.

4. I also somewhat randomly found a tiny trick that's a huge improvement. We've long noticed that at high quantization you get this really nasty chroma drift problem. The problem occurs when you have adjacent blocks with very similar colors, but not quite the same, and they sit on different sides of quantization boundary, so one block shifts down and the neighbor shifts up. For example with Quantizer = 100 you might have two neighbors with values {49, 51} and they quantize to {0,1} which restores to {0,100} and becomes a huge step. This is just what quantization does, but when you apply quantization separately to the channels of a color (RGB or YUV or whatever), when one of the channels shifts like that, it causes a hue rotation. So rather than seeing a stair step, what you see is that a neighboring block has become a different color.

Now there are a lot of ideas you might have about how to address this. To really attack it thoroughly, you would need a stronger perceptual error metric, in particular one which can measure non-local patterns, which is something we don't have. The ideal perceptual error metric needs to be able to pick up on things like "this is a smooth gradient patch in the source, and the destination has a block that stands out from the others".

Instead we came up with just a simple hack that works amazingly well. Basically what we do is adaptively resize the quantization of the DC component, so that when you are in a smooth region ("smooth" meaning neighboring block DC's are similar to each other), then we use finer quantization bucket sizes. This lets you more accurately represent smooth gradients and avoid the chroma shift. Obviously this hurts RMSE so it's hard to measure the behavior analytically, but it looks amazingly much better to our eyes.

Of course while this is an exciting discovery it's also terrifying. It reminded me how bad our image quality metrics are, and the fact that we're optimizing for these broken metrics means we are making broken algorithms. There's a whole plethora of possible things you could do along this vein - various types of adaptive quantizer sizes, maybe log quantizers? maybe more coarse quantizers in noisy parts of the image? it's impossible to explore those ideas because we have no way to rate them.

As I mentioned previously, this experiment also convinced me that SSIM is just worthless. I know in the SSIM papers they show examples where it is slightly better than RMSE at telling which image is higher quality, but in practice within the context of a DCT-based image coder I find it almost never differs from RMSE; that is, if you do something like R-D optimized quantization of DCT coefficients with Distortion measured by RMSE, you will produce an image that has almost exactly the same SSIM as if you did R-D with D measured by SSIM. If RMSE and SSIM were significantly different, that would not be the case. I say this within the context of DCT-based image coding because obviously RMSE and SSIM can disagree a lot, but that axis of freedom is not explored by DCT image coders. The main thing is that SSIM is really not measuring anything important visual at all. A real visual metric needs to use global/neighborhood information, and knowledge of shapes and what is important about the image. For example, changing a pixel that's part of a perfect edge is way more important than changing an image that's in some noise. Changing a block from grey to pink is way worse than changing a block from green to blue-green, even if it's a smaller value change. etc. etc.

It seems to me there could very easily be massive improvements possible in perceptual quality without any complexity increase that we just can't get because we can't measure it.


08-05-09 - Relacy License Notes

The lock-free code I posted with Relacy has a clarification to the license agreement added. If you have downloaded this please read and make sure you are in compliance. I've copied the added text here :

ADDENDUM ON RELACY LICENSE : (revised 9-14-09)

Relacy is now released under the BSD license :

    1 /*  Relacy Race Detector
    2  *  Copyright (c) 2009, Dmitry S. Vyukov
    3  *  All rights reserved.
    4  *  Redistribution and use in source and binary forms, with or without modification,
    5  *  are permitted provided that the following conditions are met:
    6  *    - Redistributions of source code must retain the above copyright notice,
    7  *      this list of conditions and the following disclaimer.
    8  *    - Redistributions in binary form must reproduce the above copyright notice, this list of conditions
    9  *      and the following disclaimer in the documentation and/or other materials provided with the distribution.
   10  *    - The name of the owner may not be used to endorse or promote products derived from this software
   11  *      without specific prior written permission.
   19  */

My work with Relacy is 100% free for any use. However, the original Relacy license still applies to all work product made with Relacy, such as my code above.

The version of Relacy that I built my code with was released under a previous less clear and restrictive license. Dmitry says the new BSD license applies.


08-04-09 - CINIT

Two questions I can't find answers to :

1. Is there a way to tell from a piece of code that you are being called from cinit ? eg. in C++ when a constructor causes some code to run, and that calls some function, and then I get called, is there anything I can check to see that I'm currently in cinit, not main?

(obviously a very evil thing I could do is run a stack trace and see what's at the top of the stack). I can't find anything in C that I can check, because the C stdlib is initialized before me, so to my cinit code it looks just like I'm in the app run.

The reason I want this is mainly for asserting & validation - I want to make sure that my own cinit code isn't calling certain things (such as memory allocation) so I want to put in checks like ASSERT( ! in_cinit() );

2. Is there a way to disallow cinit code in certain modules? For example, having cinit stuff in any library is very unsafe because you have to be wary that your OBJ could get dropped from the link and the cinit stuff will not be run, plus you have order of run issues, it's something I can handle fine in my own projects, but not something I want to force on clients. So I want to make sure that my actual deliverable libraries have no cinit stuff - but I do want cinit stuff in my test apps, so I don't want to just break it entirely. I'd really like a compiler error so I know I did a booboo right when I write it.


07-07-09 - Small Image Compression Notes , Part 2

Deblocking survey :

There are a few different ways to come at the problem theoretically.

One is to work on post-decode data in spatial domain. These approaches basically work by explicitly trying to detect block edges and just filter them. This is the approach, for example, of the H264 "in loop deblocking filter" which there is a lot of literature on. See for example "Adaptive Deblocking Filter" by List, Joch, et.al. For an example of the filter-based approach on the 8x8 DCT case see "DCT-Based Image Compression using Wavelet-Based Algorithm with Efficient Deblocking Filter" by Yan and Chen. (BTW the JPEG standard contains a "block smoother" which basically predicts AC1 as a linear function from neighboring block coefficients. This is okay for the specific case of smooth images and very high quantization, but is generally not awesome and is an ancient technique. Ignore.)

A more hardcore version of the filtering approach is "Combined Frequency and Spatial Domain Algorithm for the Removal of Blocking Artifacts" which does adaptively-offset and adaptively-directed gaussian filters ; this is sort of like the image denoising stuff that creates pixel gradient flow vectors - the filters are local gradient adaptive so they don't go across real edges. This appears to perform quite well but is very expensive.

The other general approach is a more abstract maximum-likelihood idea. You received a lossy compressed image I. You know the original image was one of the many which when compressed produces I. You want to output the one that was most likely the true original image. This is a maximum likelihood problem, and requires some a-priori model of what you think "natural" images look like. In particular, for the case of quantized DCT coefficients, you have a quantized DCT coefficient C ; instead of just reproducing Q*C you can reproduce anything in the range { Q*C - Q/2 , Q*C + Q/2 } , and you should choose the thing in that range that makes the "best" image.

"Optimal JPEG Decoding" (1998) by Jung, Antonini, Barlaud takes this approach directly. Their results are not awesome though; presumably because their prior is not good. A more modern version of the same idea is "Block Artifact Reduction Using a Transform-Domain Markov Random Field Model" by Li and Delp which uses a better model for image likelihood, but is in the same vein of doing a brute force search in the allowed coefficient space to find the maximum-likelihood reproduction.

A related method that was popular for a while is "Projection onto Convex Sets". This is basically just a method of satisfying simple convex constraints in an optimization. Here our constraint is that the quantized coefficient stay the same, that is, repro in { Q*C - Q/2 , Q*C + Q/2 } . You then apply some target function, such as you want smoothness or something, and take iterative steps towards that goal and project onto the constraints one by one. There are a lot more details to this, I haven't paid too much attention to it because these are all crazy expensive and I want something realtime.

"Blocking Artifact Detection and Reduction in Compressed Data" by Triantafyllidis etal (2002) is in the same vein but simpler and more analytical. It again worse directly in DCT space on coefficients within their quantization range, but it directly solves for the ideal reconstruction value as a function of neighbors based on minimization of specific simple deblocking metric. You wind up with just some equations for how to modify each coefficient in terms of neighbor coefficients. While the paper is good, I think one of their base assumptions - that the frequencies can be dealt with independently - is not sound, and most other people do not make that assumption.

"Derivation of Prediction Equations for Blocking Effect Reduction" by Gopal Lakhani and Norman Zhong (1999) is an older, simpler still version of the Triantafyllidis paper. They only correct the first few coefficients and solve for optimal reconstruction to minimize MSDS (mean squared difference of slopes). You can actually look at the equations here and they're very intuitively obviously right. For example, the first AC coefficient should be corrected using the difference of the neighboring DC coefficients. In case you don't see that that's obviously right, if you have DC's like [8],[16],[24] after dequantization at Q=8, and your AC's all got quantized to zero, obviously the original image most likely had a smooth slope, so the first AC in the middle block should be predicted to be the linear interpolation.

An interesting one I found that's related to the stuff I tried with smooth reconstruction of the DC band is : "Improvement of DCT-based Compression Algorithms Using Poisson�s Equation" by Yamatani and Saito (2006) .

BTW a related issue that often comes up is the incorrectness of center dequantization of AC coefficients. I've written about this before and lots of these papers mention it; the best full note on it is : "Biased Reconstruction for JPEG Decoding" by Price.

The very modern stuff has gotten quite arcane. People now are doing things like directional overcomplete wavelets on the reproduced image; with this they can detect both block artifacts and also ringing and other quantized transform artifacts. They then use maximum-likelihood markov models to guess what the source image was that produced this output. This stuff is extremely complex and I haven't really followed it because it's nowhere near realtime, but probably the best solution for offline very high quality JPEG decoders.

An interesting outlier is John Costella's Unblock . It's based on a clever simple idea that I've never seen anywhere else. Unblock is based on the assumption that pixels near the block boundaries come from the same model as pixels in the centers of blocks. That sounds obvious but it's quite profound. It means that pixels near the edges of blocks should have the same statistics as pixels in the centers (in the maximum likelihood lingo, this is a prior we can use to choose an optimal output). In particular, it's useful because in the DCT the interior pixels are much more accurate than the edge pixels. What Unblock does is looks at the statistics of the decompressed interior pixels and assumes those are our goal, and then it forces the pixels near the edge to match the statistics of the interior. The corrections are applied as wide smooth filters.


07-06-09 - Small Image Compression Notes

Lapping appears to be a complete red herring. I've wasted a lot of time on it and I'm very angry. I've been trying to work up a lapped block DCT image coder. The idea is that block-DCT-based is good for speed and parallelization for micro-core architectures, good for memory bandwidth, etc. and the lapping theoretically lets you avoid some of the nasty block artifacts by effectively extending your basis functions.

In practice it just doesn't work. I've tried lots of different lapping methods, and in all of them if I make a parameterized lap amount based on a kaiser-bessel-derived window and then tweak the lap amount to maximize SSIM, it tunes to no lapping at all. Basically what's happening is that the extra bit rate cost caused by the forward lap scrambling things up is too great for the win of smoother basis functions on decompress to make up. Obviously in a few contrived cases it does help, such as on very smooth images at very high compression. (of course the large lap basis functions are a form of modeling - they will help any time the image is smooth over the larger area, and hurt when it is not).

The really silly thing about this is that areas where the image is very smooth over a large area are the cases we already handle very well!! Yeah sure naive JPEG looks awful, but even a deblocking filter after decompress can fix that case very easily. In areas that aren't smooth, lapping actually makes artifacts like ringing worse.

The other issue is I'm having a little trouble with lagrange bitstream optimization. Basically my DCT block coder does a form of "trellis quantization" (which I wrote about before) where it can selectively zero coefficients if it decides it gets an R/D win by doing so. Obviously this gives you a nice RMSE win at a given rate (by design it does so - any time it finds a coefficient to zero, it steps up the R/D slope). But what does this actually do?

Think about trying to make the best bit stream for a given rate. Say two bits per pixel. If we don't do any lagrange optimization at all, we might pick some quantizer, say Q = 16. Now we turn on lagrange optimization, it finds some coefficients to zero, that reduces the bit rate, so to get back to the target bit rate, we can use a lower quantizer. It searches for the right lagrange lambda by iterating a few times and we wind up with something like Q = 12 , and some values zeroed, and a better RMSE. What's happened is we got to use a lower quantizer, so we made more, larger, nonzero coefficients, and then we selectively zeroed a few that took the most R/D.

But what does this actually do to the image qualitatively? What it does is increase the quality everywhere (Q =16 goes to Q=12) , but then it stomps on the quality in a few isolated spots (trellis quantization zeros some coefficients). If you compare the two images, the lagrange optimized one looks better everywhere, but then is very smooth and blurred out in a few spots. Normally this is not a big deal and it's just a win, but sometimes I've found it actually looks really awful.

Even if you optimize for some perceptual metric like SSIM it doesn't detect how bad this is, because SSIM is still a local measurement and this is a nonlocal artifact. Your eyes very quickly pick out that part of the image has been blurred way more than the rest of it. (in other cases it does the same thing, but it's actually good; it sort of acts like a bilateral filter actually, it will give bits to the high contrast edges and kill coefficients in the texture part, so for like images of skin it does a nice job of keeping the edges sharp and just smoothing out the interior, as opposed to non-lagrange-optimized JPEG which allocates bits equally and will preserve the skin pore detail and make the edges all ringy and chopped up).

I guess the fix to this is some hacky/heuristic way to just force the lagrange optimization not to be too aggressive.

I guess this is also an example of a computer problem that I've observed many times in various forms : when you let a very aggressive optimizer run wild seeking some path to maximize some metric, it will do so, and if your metric does not perfectly measure exactly the thing that you actually want to optimize, you can get some very strange/bad results.


06-22-09 - Redraw Dilemma

This apartment searching is really annoying me. I can't handle having "many balls in the air" ; when I put something on my todo list, I like to work at it until it's gone. God I fucking hate shit on my todo list (the fucking health care keeps reinserting itself on my todo list and it's pissing me off; they got me again today with some billing fuckup, but I digress...).

Anyway, it's reminding me of a concept I often think about. I'll call it "the redrawer's dilemma" but there must be a better/standard name for this.

The hypothetical game goes something like this :

You are given a bag with 100 numbers in it. You know the numbers are in [0,1000] but don't know how many of each number there are in the bag. You start by drawing a random number from the bag.

At each turn of play, you can either keep your current number (in which case that is your final score), or you can put your current number back in the bag and draw again, but drawing again costs you -1 that will be subtracted from your final score.

How do you play this game optimally?

There are two things that are interesting to me about this game in real life. One is that humans almost always play it incredibly badly, and the second is that when you finally decide to stop redrawing you're almost always unhappy about it (unless you got super lucky and draw a 900+ number).

The two classic human player errors in this game are the "I just started drawing, I shouldn't stand yet" and the "I can't stop now, I already passed on something better than this". The "I just started drawing, I shouldn't stand yet" guy draws something like an 800 on one of his early draws. He thinks dang that's really good, but maybe this bag just has lots of high numbers in it, I just started drawing, I should put some time into it. Now of course that reasoning is based in correct logic - if you have reason to believe that your chance of drawing higher is good enough to merit the cost of continued looking, then yes, do so, but just drawing more because "it's early" makes no sense - the game is totally non temporal, the cost of continuing drawing doesn't go up over time. This often leads into the "I can't stop now, I already passed on something better than this" guy, who's mainly motivated by pride and shame - he doesn't want to admit to himself that he made a big mistake passing early when he got a high number, so he has to keep drawing until he gets something better. He might draw an 800, then a whole mess of single digit numbers and he's thinking "oh fuck I blew it" and then he draws a 400. At that point he should stand and quit redrawing, but he can't, so he draws again.

The thing is, even if he played correctly and just took the 400 after passing on the 800, he would be really unhappy about. And if the early termination guy played correctly and just got an early 800 and didn't draw, he would be unhappy too, because he'd always be wondering if he could've done better.

The other game theory / logical fallacy that plagues me in these kind of things is "I'm already spending X I may as well spend X". First I was looking for places around $1500, then I bumped it to $1700, then $1900. Now I'm looking at places for $2500 cuz fuck it they're nicer and I was looking at places for $2000 so it's only $500 more.

In other news, hotpads is actually a pretty cool apartment search site. It seems they are just scraping craigslist and maybe some other classifieds sites, so it's not like they have anything new, but the map interface and search features and such are solid. One thing is really annoying me about it though - the wheel zooming in the map is totally broken, I keep trying to wheel zoom and it sends the map off the never never land. Urg!

In more random news, I've really enjoyed the "Wallander" series on PBS ; the crime stories are pretty dumb/ridiculous, but I like the muddled contemplative pace of it, and the washed out monochrome color palette.


06-21-09 - Fast Exp & Log

So in an earlier post I wrote about approximation of log2 and Ryg commented with links to Robin Green's great GDC 2003 talk : part1 (pdf) and part2 (pdf) ( main page here ).

It's mostly solid, but in part 2 around page 40 he talks about "fastexp" and "bitlog" and my spidey senses got tingling. Either I don't understand, or he was just smoking crack through that section.

Let's look at "bitlog" first. Robin writes it very strangely. He writes :

A Mathematical Oddity: Bitlog
  A real mathematical oddity
  The integer log2 of a 16 bit integer
  Given an N-bit value, locate the leftmost nonzero bit.
  b = the bitwise position of this bit, where 0 = LSB.
  n = the NEXT three bits (ignoring the highest 1)

    bitlog(x) = 8x(b-1) + n

  Bitlog is exactly 8 times larger than log2(x)-1

Bitlog Example
 For example take the number 88
88 = 1011000
b = 6th bit
n = 011 = 3
bitlog(88) = 8*(6-1)+3
= 43
  (43/8)+1 = 6.375
  Log2(88) = 6.4594
  This relationship holds down to bitlog(8)

Okay, I just don't follow. He says it's "exact" but then shows an example where it's not exact. He also subtracts off 1 and then just adds it back on again. Why would you do this :

    bitlog(x) = 8x(b-1) + n

  Bitlog is exactly 8 times larger than log2(x)-1

When you could just say :

    bitlog(x) = 8xb + n

  Bitlog is exactly 8 times larger than log2(x)

??? Weird.

Furthermore this seems neither "exact" nor an "oddity". Obviously the position of the MSB is the integer part of the log2 of a number. As for the fractional part of the log2, this is not a particular good way to get it. Basically what's happening here is he takes the next 3 bits and uses them for linear interpolation to the next integer.

Written out verbosely :

x = int to get log2 of
b = the bitwise position of top bit, where 0 = LSB.

x >= (1 << b) && x < (2 << b)

fractional part :
f = (x - (1 << b)) / (1 << b)

f >= 0 && f < 1

x = 2^b * (1 + f)

correct log2(x) = b + log2(1+f)

approximate with b + f

note that "f" and "log2(1+f)" both go from 0 to 1, so it's exact at the endpoints
but wrong in the middle

So far as I can tell, Robin's method is actually like this :

uint32 bitlog_x8(uint32 val)
    if ( val <= 8 )
        static const uint32 c_table[9] = { (uint32)-1 , 0, 8, 13, 16, 19, 21, 22, 24 };
        return c_table[val];
        unsigned long index;
        _BitScanReverse(&index,(unsigned long)val);
        ASSERT( index >= 3 );
        uint32 bottom = (val >> (index - 3)) & 0x7;
        uint32 blog = (index << 3) | bottom;

        return blog;

where I've removed the weird offsets of 1 and this just returns log2 times 8. You need the check for val <= 8 because shifting by negative amounts is fucked.

But you might well ask - why only use 3 bits ? And in fact you're right, I see no reason to use only 3 bits. In fact we can do a fixed point up to 27 bits : (we need to save 5 bits at the top to store the max possible integer part of the log2)

float bitlogf(uint32 val)
    unsigned long index;
    _BitScanReverse(&index,(unsigned long)val);

    uint32 vv = (val << (27 - index)) + ((index-1) << 27);

    return vv * (1.f/134217728); // 134217728 = 2^27

what we've done here is find the pos of the MSB, shift val up so the MSB is at bit 27, then we add the index of the MSB (we subtract one because the MSB it self starts the counting at one in the 27th bit pos). This makes a fixed point value with 27 bits of fractional part, the bits below the MSB act as the fractional bits. We scale to return a float, but you could of course do this with any # of fixed point bits and return a fixed point int.

But of course this is exactly the same kind of thing done in an int-to-float so we could use that too :

float bitlogf2(float fval)
    FloatAnd32 fi;
    fi.f = fval;
    float vv = (float) (fi.i - (127 << 23));
    return vv * (1.f/8388608); // 8388608 = 2^23

which is a lot like what I wrote about before. The int-to-float does the exact same thing we did manually above, finding the MSB and making the log2 and fractional part.

One note - all of these versions are exact for the true powers of 2, and they err consistently low for all other values. If you want to minimize the maximum error, you should bias them.

The maximum error of ( log2( 1 + f) - f ) occurs at f = ( 1/ln(2) - 1 ) = 0.442695 ; that error is 0.08607132 , so the correct bias is half that error : 0.04303566

Backing up in Robin's talk we can now talk about "fastexp". "fastexp" is doing "e^x" by using the floating point format again, basically he's just sticking x into the exponent part to get the int-to-float to do the 2^x. To make it e^x instead of 2^x you just scale x by 1/ln(2) , and again we use the same trick as with bitlog : we can do exact integer powers of two, to get the values in between we use the fractional bits for linear interpolation. Robin's method seems sound, it is :

float fastexp(float x)
    int i = ftoi( x * 8.f );
    FloatAnd32 f;
    f.i = i * 1512775 + (127 << 23) - 524288;
    // 1512775 = (2^20)/ln(2)
    // 524288 = 0.5*(2^20)

    return f.f;

for 3 bits of fractional precision. (note that Robin says to bias with 0.7*(2^20) ; I don't know where he got that; I get minimum relative error with 0.5)).

Anyway, that's all fine, but once again we can ask - why just 3 bits? Why not use all the bits of x as fractional bits? And if we put the multiply by 1/ln(2) in the float math before we convert to ints, it would be more accurate.

What we get is :

float fastexp2(float x)
    // 12102203.16156f = (2^23)/ln(2)
    int i = ftoi( x * 12102203.16156f );
    FloatAnd32 f;
    f.i = i + (127 << 23) - 361007;
    // 361007 = (0.08607133/2)*(2^23)

    return f.f;

and indeed this is much much more accurate. (max_rel_err = 0.030280 instead of 0.153897 - about 5X better).

I guess Robin's fastexp is preferrable if you already have your "x" in a fixed point format with very few fractional bits (3 bits in that particular case, but it's good for <= 8 bits). The new method is preferred if you have "x" in floating point or if "x" is in fixed point with a lot of fractional bits (>= 16).


I found the Google Book where bitlog apparently comes from; it's Math toolkit for real-time programming By Jack W. Crenshaw ; so far as I can tell this book is absolute garbage and that section is full of nonsense and crack smoking.


it's obvious that log2 is something like :

x = 2^I * (1+f)

(I is an int, f is the mantissa)

log2(x) = I + log2(1+f)

log2(1+f) = f + f * (1-f) * C

We've been using log2(1+f) ~= f , but we know that's exact at the ends and wrong in the middle
so obvious we should add a term that humps in the middle.

If we solve for C we get :

C = ( log2(1+x) - x ) / x*(1-x)

Integrating on [0,1] gives C = 0.346573583

hence we can obviously do a better bitlog something like :

float bitlogf3(float fval)
    FloatAnd32 fi;
    fi.f = fval;
    float vv = (float) (fi.i - (127<<23));
    vv *= (1.f/8388608);
    //float frac = vv - ftoi(vv);
    fi.i = (fi.i & 0x7FFFFF) | (127<<23);
    float frac = fi.f - 1.f;
    const float C = 0.346573583f;
    return vv + C * frac * (1.f - frac);


06-17-09 - Inverse Box Sampling - Part 1.5

In the previous post we attacked the problem :

If you are given a low res signal L and a known down-sampler D() (in particlar, box down sampling), find an up sampler U() such that :

L = D ( U( L ) )

and U( L ) is as close as possible to the actual high res signal that L was made from (unknown).

I'm also interested in the opposite problem :

If you are given a high res signal H, and a known up-sampler U() (in particular, bilinear filtering), find a down sampler D() such that :

E = ( H - U( D( H ) ) )^2 is minized

This is a much more concrete and tractable problem. In particular in games/3d we know we are forced to use bilinear filtering as our up-sampler. If you use box down-sampling for D() as many people do, that's horrible, because bilinear filtering and box-downsampling are both interpolating and variance reducing. That, they both take noisey signals and force them towards gray. If you know that U() is going to be bilinear filtering, then you should use a D() that compensates for that. It's intuitively obvious that D should be something a bit like a sinc to bring in some neighbors with negative lobes to compensate for the blurring aspect of bilinear upsample, but what exactly I don't know yet.

(note that this is a different problem than making mips - in making mips you are actually going to be viewing the mip at a 1:1 resolution, it will not be upsampled back to the original resolution; you would use this if you were trying to substitute a lower res texture for a higher one).

I haven't tried my hand at solving this yet, maybe it's been done? Much like the previous problem, I'm surprised this isn't something well known and standard, but I haven't found anything on it.

06-17-09 - DXTC More Followup

I finally came back to DXTC and implemented some of the new slightly different techniques. ( summary of my old posts )

See the : NVidia Article or NVTextureTools Wiki for details.

Briefly :

DXT1 = my DXT1 encoder with annealing. (version reported here is newer and has some more small improvements; the RMSE's are slightly better than last time). DXT1 is 4 bits per pixel (bpp)

Humus BC4BC5 = Convert to YCoCg, Put Y in a single-channel BC4 texture (BC4 = the alpha part of DXT5, it's 4 bpp). Put the CoCg in a two-channel BC5 texture - downsampled by 2X. BC5 is two BC4's stuck together; BC5 is 8 bpp, but since it's downsampled 2x, this is 2bpp per original pixel. The net is a 6 bpp format

DXT5 YCoCg = the method described by JMP and Ignacio. This is 8 bpp. I use arbitrary CoCg scale factors, not the limited ones as in the previously published work.

Here are the results in RMSE : (modified 6-19 with new better results for Humus from improved down filter)

name DXT1 Humus DXT5 YCoCg
kodim01.bmp 8.2669 3.9576 3.8355
kodim02.bmp 5.2826 2.7356 2.643
kodim03.bmp 4.644 2.3953 2.2021
kodim04.bmp 5.3889 2.5619 2.4477
kodim05.bmp 9.5739 4.6823 4.5595
kodim06.bmp 7.1053 3.4543 3.2344
kodim07.bmp 5.6257 2.6839 2.6484
kodim08.bmp 10.2165 5.0581 4.8709
kodim09.bmp 5.2142 2.519 2.4175
kodim10.bmp 5.1547 2.5453 2.3435
kodim11.bmp 6.615 3.1246 2.9944
kodim12.bmp 4.7184 2.2811 2.1411
kodim13.bmp 10.8009 5.2525 5.0037
kodim14.bmp 8.2739 3.9859 3.7621
kodim15.bmp 5.5388 2.8415 2.5636
kodim16.bmp 5.0153 2.3028 2.2064
kodim17.bmp 5.4883 2.7981 2.5511
kodim18.bmp 7.9809 4.0273 3.8166
kodim19.bmp 6.5602 3.2919 3.204
kodim20.bmp 5.3534 3.0838 2.6225
kodim21.bmp 7.0691 3.5069 3.2856
kodim22.bmp 6.3877 3.5222 3.0243
kodim23.bmp 4.8559 3.045 2.4027
kodim24.bmp 8.4261 5.046 3.8599
clegg.bmp 14.6539 23.5412 10.4535
FRYMIRE.bmp 6.0933 20.0976 5.806
LENA.bmp 7.0177 5.5442 4.5596
MONARCH.bmp 6.5516 3.2012 3.4715
PEPPERS.bmp 5.8596 4.4064 3.4824
SAIL.bmp 8.3467 3.7514 3.731
SERRANO.bmp 5.944 17.4141 3.9181
TULIPS.bmp 7.602 3.6793 4.119
lena512ggg.bmp 4.8137 2.0857 2.0857
lena512pink.bmp 4.5607 2.6387 2.3724
lena512pink0g.bmp 3.7297 3.8534 3.1756
linear_ramp1.BMP 1.3488 0.8626 1.1199
linear_ramp2.BMP 1.2843 0.7767 1.0679
orange_purple.BMP 2.8841 3.7019 1.9428
pink_green.BMP 3.1817 1.504 2.7461

And here are the results in SSIM :

Note this is an "RGB SSIM" computed by doing :

SSIM_RGB = ( SSIM_R * SSIM_G ^2 * SSIM_B ) ^ (1/4)

That is, G gets 2X the weight of R & B. The SSIM is computed at a scale of 6x6 blocks which I just randomly picked out of my ass.

I also convert the SSIM to a "percent similar". The number you see below is a percent - 100% means perfect, 0% means completely unrelated to the original (eg. random noise gets 0%). This percent is :

SSIM_Percent_Similar = 100.0 * ( 1 - acos( ssim ) * 2 / PI )

I do this because the normal "ssim" is like a dot product, and showing dot products is not a good linear way to show how different things are (this is the same reason I show RMSE instead of PSNR like other silly people). In particular, when two signals are very similar, the "ssim" gets very close to 0.9999 very quickly even though the differences are still pretty big. Almost any time you want to see how close two vectors are using a dot product, you should do an acos() and compare the angle.

name DXT1 Humus DXT5 YCoCg
kodim01.bmp 84.0851 92.6253 92.7779
kodim02.bmp 82.2029 91.7239 90.5396
kodim03.bmp 85.2678 92.9042 93.2512
kodim04.bmp 83.4914 92.5714 92.784
kodim05.bmp 83.6075 92.2779 92.4083
kodim06.bmp 85.0608 92.6674 93.2357
kodim07.bmp 85.3704 93.2551 93.5276
kodim08.bmp 84.5827 92.4303 92.7742
kodim09.bmp 84.7279 92.9912 93.5035
kodim10.bmp 84.6513 92.81 93.3999
kodim11.bmp 84.0329 92.5248 92.9252
kodim12.bmp 84.8558 92.8272 93.4733
kodim13.bmp 83.6149 92.2689 92.505
kodim14.bmp 82.6441 92.1501 92.1635
kodim15.bmp 83.693 92.0028 92.8509
kodim16.bmp 85.1286 93.162 93.6118
kodim17.bmp 85.1786 93.1788 93.623
kodim18.bmp 82.9817 92.1141 92.1309
kodim19.bmp 84.4756 92.7702 93.0441
kodim20.bmp 87.0549 90.5253 93.2088
kodim21.bmp 84.2549 92.2236 92.8971
kodim22.bmp 82.6497 91.0302 91.9512
kodim23.bmp 84.2834 92.4417 92.4611
kodim24.bmp 84.6571 92.3704 93.2055
clegg.bmp 77.4964 70.1533 83.8049
FRYMIRE.bmp 91.3294 72.2527 87.6232
LENA.bmp 77.1556 80.7912 85.2508
MONARCH.bmp 83.9282 92.5106 91.6676
PEPPERS.bmp 81.6011 88.7887 89.0931
SAIL.bmp 83.2359 92.4974 92.4144
SERRANO.bmp 89.095 75.7559 90.7327
TULIPS.bmp 81.5535 90.8302 89.6292
lena512ggg.bmp 86.6836 95.0063 95.0063
lena512pink.bmp 86.3701 92.1843 92.9524
lena512pink0g.bmp 89.9995 79.9461 84.3601
linear_ramp1.BMP 92.1629 94.9231 93.5861
linear_ramp2.BMP 92.8338 96.1397 94.335
orange_purple.BMP 89.0707 91.6372 92.1934
pink_green.BMP 87.4589 93.5702 88.4219

Conclusion :

DXT5 YCoCg and "Humus" are both significantly better than DXT1.

Note that DXT5-YCoCg and "Humus" encode the luma in exactly the same way. For gray images like "lena512ggg.bmp" you can see they produce identical results. The only difference is how the chroma is encoded - either a DXT1 block (+scale) at 4 bpp, or a downsampled 2X BC4 block at 2 bpp.

In RGB RMSE , DXT5-YCoCg is measurably better than Humus-BC4BC5 , but in SSIM they are are nearly identical. This is because almost all of the RMSE loss in Humus comes from the YCoCg lossy color conversion and the CoCg downsampling. The actual BC4BC5 compression is very near lossless. (as much as I hate DXT1, I really like BC4 - it's very easy to produce near optimal output, unlike DXT1 where you have to run a really fancy compressor to get good output). The CoCg loss hurts RMSE a lot, but doesn't hurt actual visual quality or SSIM much in most cases.

In fact on an important class of images, Humus actually does a lot better than DXT5-YCoCg. That class is simple smooth ramp images, which we use very often in the form of lightmaps. The test images at the bottom of the table (linear_ramp and pink_green) show this.

On a few images where the CoCg downsample kills you, Humus does very badly. It's bad on orangle_purple because that image is specifically designed to be primarily in Chroma not Luma ; same for lena512pink0g.bmp ; note that normal chroma downsampling compressors like JPEG have this same problem. You could in theory choose a different color space for these images and use a different reconstruction shader.

Since Humus is only 6 bpp, size is certainly not a reason to prefer DXT1 over it. However, it does require two texture fetches in the shader, which is a pretty big hit. (BTW the other nice thing about Humus is that it's already down-sampled in CoCg, so if you are using something like a custom JPEG in YCoCg space with downsampled CoCg - you can just directly transcode that into Humus BC4BC5, and there's no scaling up or down or color space changes in the realtime recompress). I think this is probably what will be in Oodle because I really can't get behind any other realtime recompress.

I also tried something else, which is DXT1 optimized for SSIM. The idea is to use a little bit of neighbor information. The thing is, in my crazy DXT1 encoder, I'm just trying various end points and measuring the quality of each choice. The normal thing to do it to just take the MSE vs the original, but of course you could do other error metrics.

One such error metric is to decompress the block you're working on into its context - decompress into a chunk of neighbors that have already been DXT1 compressed & decompressed as well. Then compare that block and its neighbors to the original image in that neighborhood. In my case I used 2 pixels around the block I was working on, making a total region of 8x8 pixels (with the 4x4 DXT1 block in the middle).

You then compare the 8x8 block to the original image and try to optimize that. If you just used MSE in this comparison, it would be the same as before, but you can use other things. For example, you could add a term that penalizes not changes in values, but changes in *slope*.

Another approach would be to take the DCT of the 8x8 block and the DCT of the 8x8 original. If you then just take the L2 difference in DCT domain, that's no different than the original method, because the DCT is unitary. But you can apply non-uniform quantizers at this step using the JPEG visual quantization weights.

The approach I used was to use SSIM (using a 4x4 SSIM block) on the 8x8 windows. This means you are checking the error not just on your block, but on how your block fits into the neighborhood.

For example if the original image is all flat color - you want the output to be all flat color. Just using MSE won't give you that, eg. MSE considers 4444 -> 3535 to be just as good as 4444 -> 5555 , but we know the latter is better.

This does in fact produce slightly better looking images - it hurts RMSE of course because you're no longer optimizing for RMSE.

06-17-09 - Inverse Box Sampling - Part 2

Okay, in Part 1.5 I asked about the downsample that was the best inverse of bilinear upsampling. I have a solution that pleases me.

Sean reminded me that he tackled this before; I dunno if he has any notes about it on the public net, he can link them. His basic idea was to do a full solve for the entire down-sampled image. It's quite simple if you think about. Consider the case of 2X up & down sampling. The bilinear filter upsample will make a high res image where each pixel is a simple linear combo of 4 low res. You take the L2 error :

E = Sum[all high res pixel] ( Original - Upsampled ) ^2

For Sean's full solution approach, you set Upsampled = Bilinear_Upsample( X) , and just solve this for X without any assumption of how X is made from Original. For an N-pixel low res image you have 4N error terms, so it's plenty dense (you could also artificially regularize it more by starting with a low res image that's equal to the box down-sample, and then solve for the deltas from that, and add an extra "Tikhonov" regularization term that presumes small deltas - this would fix any degenerate cases).

I didn't do that. Instead I assumed that I want a discrete local linear filter and solved for what it should be.

A discrete local linear filter is just a bunch of coefficients. It must be symmetric, and it must sum to 1.0 to be mean-preserving (flat source should reproduce flat exactly). Hence it has the form {C2,C1,C0,C0,C1,C2} with C0+C1+C2 = 1/2. (this example has two free coefficients). Obviously the 1-wide case must be {0.5,0.5} , then you have {C1,0.5-C1,0.5-C1,C1} etc. as many taps as you want. You apply it horizontally and then vertically. (in general you could consider asymetric filters, but I assume H & V use the same coefficients).

A 1d application of the down-filter is like :

L_n = Sum[k] { C_k * [ H_(2*n-k) + H_(2*n+1+k) ] }

That is : Low pixel n = filter coefficients times High res samples centered at (2*n * 0.5) going out both directions.

Then the bilinear upsample is :

U_(2n) = (3/4) * L_n + (1/4) * L_(n-1)

U_(2n+1) = (3/4) * L_n + (1/4) * L_(n+1)

Again we just make a squared error term like the above :

E = Sum[n] ( H_n - U_n ) ^2

Substitute the form of L_n into U_n and expand so you just have a matrix equation in terms of H_n and C_k. Then do a solve for the C_k. You can do a least-squares solve here, or you can just directly solve it because there are generally few C's (the matrix is # of C's by # of pixels).

Here's how the error varies with number of free coefficients (zero free coefficients means a pure box downsample) :

r:\>bmputil mse lenag.256.bmp bilinear_down_up_0.bmp  rmse : 15.5437 psnr : 24.3339

r:\>bmputil mse lenag.256.bmp bilinear_down_up_1.bmp  rmse : 13.5138 psnr : 25.5494

r:\>bmputil mse lenag.256.bmp bilinear_down_up_2.bmp  rmse : 13.2124 psnr : 25.7454

r:\>bmputil mse lenag.256.bmp bilinear_down_up_3.bmp  rmse : 13.0839 psnr : 25.8302
you can see there's a big jump from 0 to 1 but then only gradually increasing quality after that (though it does keep getting better as it should).

Two or three free terms (which means a 6 or 8 tap filter) seems like the ideal width to me - wider than that and you're getting very nonlocal which means ringing and overfitting. Optimized on all my test images the best coefficients I get are :

// 8 taps :

static double c_downCoef[4] = { 1.31076, 0.02601875, -0.4001217, 0.06334295 };

// 6 taps :

static double c_downCoef[3] = { 1.25 , 0.125, - 0.375 };

(the 6-tap one was obviously so close to those perfect fractions that I just manually rounded it; I assume that if I solved this analytically that's what I would get. The 8-tap one is not so obvious to me what it would be).

Now, how do these static ones compare to doing the lsqr fit to make coefficients per image ? They're 99% of the benefit. For example :

// solve :
lena.512.bmp : doing solve exact on 3 x 524288
{ 1.342242526 , -0.028240414 , -0.456030369 , 0.142028257 }  // rmse : 10.042138

// static fit :
lena.512.bmp :  // rmse : 10.116388


// static fit :
clegg.bmp :  // rgb rmse : 50.168 , gray rmse : 40.506

// solve :
fitting : clegg.bmp : doing lsqr on 3 x 1432640 , c_lsqr_damping = 0.010000
{ 1.321164423 , 0.002458499 , -0.381711250 , 0.058088329 }  // rgb rmse : 50.128 , gray rmse : 40.472

So it seems to me this is in fact a very simple and high quality way to down-sample to make the best reproduction after bilinear upsampling.

I'm not even gonna touch the issue of the [0,255] range clamping or the fact that your low res image should actually be considered discrete, not continuous.

ADDENDUM : it just occured to me that you might do the bilinear 2X upsampling using offset-taps instead of centered taps. That is, centered taps reconstruct like :

+---+    +-+-+
|   |    | | |
|   | -> +-+-+
|   |    | | |
+---+    +-+-+

That is, the area of four high res pixels lies directly on one low res pixel. Offset taps do :

+---+     | |
|   |    -+-+-
|   | ->  | |
|   |    -+-+-
+---+     | |

that is, the center of a low res pixel corresponds directly to a high res pixel.

With centered taps, the bilinear upsample weights in 1d are always (3/4,1/4) then (1/4,3/4) , (so in 2d they are 9/16, etc.)

With offset taps, the weights in 1d are (1) (1/2,1/2) (1) etc... that is, one pixel is just copied and the tweeners are averages.

Offset taps have the advantage that they aren't so severely variance decreasing. Offset taps should use a single-center down-filter of the form :


(instead of {C2,C1,C0,C0,C1,C2} ).

My tests show single-center/offset up/down is usually slightly worse than symmetric/centered , and occasionally much better. On natural/smooth images (such as the entire Kodak set) it's slightly worse. Picking one at random :

symmetric :
kodim05.bmp : { 1.259980122 , 0.100375561 , -0.378468204 , 0.018112521 }   // rmse : 25.526521

offset :
kodim05.bmp : { 0.693510045 , 0.605009745 , -0.214854612 , -0.083665178 }  // rgb rmse : 26.034 

that pattern holds for all. However, on weird images it can be better, for example :

symmetric :
c:\src\testproj>Release\TestProj.exe t:\test_images\color\bragzone\clegg.bmp f
{ 1.321164423 , 0.002458499 , -0.381711250 , 0.058088329 }  // rgb rmse : 50.128 , gray rmse : 40.472

offset :
c:\src\testproj>Release\TestProj.exe t:\test_images\color\bragzone\clegg.bmp f
{ 0.705825115 , 0.561705835 , -0.267530949 }  // rgb rmse : 45.185 , gray rmse : 36.300

so ideally you would choose the best of the two. If you're decompressing in a pixel shader you need another parameter for whether to offset your sampling UV's by 0.5 of a pixel or not.

ADDENDUM : I got Humus working with a KLT color transform. You just do the matrix transform in the shader after fetching "YUV" (not really YUV any more). It helps on the bad cases, but still doesn't make it competitive. It's better just to go with DXT1 or DXT5-YCoCg in those cases. For example :

On a pure red & blue texture :

Humus YCoCg :

rmse : 11.4551 , psnr : 26.9848
ssim : 0.9529 , perc : 80.3841%

Humus KLT with forced Y = grey :

KLT : Singular values : 56.405628,92.022781,33.752548
 KLT : 0.577350,0.577350,0.577350
 KLT : -0.707352,0.000491,0.706861
 KLT : 0.407823,-0.816496,0.408673

rmse : 11.4021 , psnr : 27.0251
ssim : 0.9508 , perc : 79.9545%

Humus KLT  :

KLT : Singular values : 93.250313,63.979282,0.230347
 KLT : -0.550579,0.078413,0.831092
 KLT : -0.834783,-0.051675,-0.548149
 KLT : -0.000035,-0.995581,0.093909

rmse : 5.6564 , psnr : 33.1140
ssim : 0.9796 , perc : 87.1232%

(note the near perfect zero in the last singular value, as it should be)

DXT1 :

rmse : 3.0974 , psnr : 38.3450
ssim : 0.9866 , perc : 89.5777%

DXT5-YCoCg :

rmse : 2.8367 , psnr : 39.1084
ssim : 0.9828 , perc : 88.1917%

So, obviously a big help, but not enough to be competitive. Humus also craps out pretty bad on some images that have single pixel checkerboard patterns. (again, any downsampling format, such as JPEG, will fail on the same cases). Not really worth it to mess with the KLT, better just to support one of the other formats as a fallback.

One thing I'm not sure about is just how bad the two texture fetches is these days.


06-16-09 - Inverse Box Sampling

A while ago I posed this problem to the world :

Say you are given the box-downsampled version of a signal (I may use "image" and "signal" interchangeably cuz I'm sloppy). Box-downsampled means groups of N values in the original have been replaced by the average in that group and then downsampled N:1. You wish to find an image which is the same resolution as the source and if box-downsampled by N, exactly reproduces the low resolution signal you were given. This high resolution image you produce should be "smooth" and close to the expected original signal.

Examples of this are say if you're given a low mip and you wish to create a higher mip such that downsampling again would exactly reproduce the low mip you were given. The particular case I mainly care about is if you are given the DC coefficients of a JPEG, which are the averages on 8x8 blocks, you wish to produce a high res image which has the exact same average on 8x8 blocks.

Obviously this is an under-constrained problem (for N > 1) because I haven't clearly spelled out "smooth" etc. There are an infinity of signals that when downsampled produce the same low resolution version. Ideally I'd like to have a way to upsample with a parameter for smoothness vs. ringing that I could play with. (if you're nitty, I can constrain the problem precisely : The correlation of the output image and the original source image should be maximized over the space of all real world source images (eg. for example over the space of all images that exist on the internet)).

Anyway, after trying a whole bunch of heuristic approaches which all failed (though Sean's iterative approach is actually pretty good), I found the mathemagical solution, and I thought it was interesting, so here we go.

First of all, let's get clear on what "box downsample" means in a form we can use in math.

You have an original signal f(t) . We're going to pretend it's continuous because it's easier.

To make the "box downsample" what you do is apply a convolution with a rectangle that's N wide. Since I'm treating t as continuous I'll just choose coordinates where N = 1. That is, "high res" pixels are 1/N apart in t, and "low res" pixels are 1 apart.

Convolution { f , g } (t) = Integral{ ds * f(s) * g(t - s) }

The convolution with rect gives you a smoothed signal, but it's still continuous. To get the samples of the low res image, you multiply this by "comb". comb is a sum of dirac delta functions at all the integer coordinates.

F(t) = Convolve{ rect , f(t) }

low res = comb * F(t)

low res = Sum[n] L_n * delta_n

Okay ? We now have a series of low res coefficients L_n just at the integers.

This is what is given to us in our problem. We wish to try to guess what "f" was - the original high res signal. Well, now that we've written is this way, it's obvious ! We just have to undo the comb filtering and undo the convolution with rect !

First to undo the comb filter - we know the answer to that. We are given discrete samples L_n and we wish to reproduce the smooth signal F that they came from. That's just Shannon sampling theorem reconstruction. The smooth reconstruction is made by just multiplying each sample by a sinc :

F(t) = Sum[n] L_n * sinc( t - n )

This is using the "normalized sinc" definition : sinc(x) = sin(pi x) / (pi x).

sinc(x) is 1.0 at x = 0 and 0.0 at all other integer x's and it oscillates around a lot.

So this F(t) is our reconstruction of the rect-filtered original - not the original. We need to undo the rect filter. To do that we rely on the Convolution Theorem : Convolution in Fourier domain is just multiplication. That is :

Fou{ Convolution { f , g } } = Fou{ f } * Fou{ g }

So in our case :

Fou{ F } = Fou{ Convolution { f , rect } } = Fou{ f } * Fou{ rect }

Fou{ f } = Fou{ F } / Fou{ rect }

Recall F(t) = sinc( t - n ) , so :

Fou{ f } = Sum[n] L_n * Fou{ sinc( t - n ) } / Fou{ rect }

Now we need some Fourier transform knowledge. The easiest way for me to find this stuff is just to do the integrals myself. Integrals are really fun and easy. I won't copy them here because it sucks in ASCII so I'll leave it as an exercise to the reader. You can easily figure out the Fourier translation principle :

Fou{ sinc( t - n ) } = e^(-2 pi i n v) * Fou{ sinc( t ) }

As well as the Fourier sinc / rect symmetry :

Fou{ rect(t) } = sinc( v )

Fou{ sinc(t) } = rect( v )

All that means for us :

Fou{ f } = Sum[n] L_n * e^(-2 pi i n v) * rect(v) / sinc(v)

So we have the Fourier transform of our signal and all that's left is to do the inverse transform !

f(t) = Sum[n] L_n * Fou^-1{ e^(-2 pi i n v) * rect(v) / sinc(v) }

because of course constants pull out of the integral. Again you can easily prove a Fourier translation principle : the e^(-2 pi i n v) term just acts to translate t by n, so we have :

f(t) = Sum[n] L_n * h(t - n)

h(t) = Fou^-1{ rect(v) / sinc(v) }

First of all, let's stop and see what we have here. h(t) is a function centered on zero and symmetric around zero - it's a reconstruction shape. Our final output signal, f(t), is just the original low res coefficients multiplied by this h(t) shape translated to each integer point n. That should make a lot of sense.

What is h exactly? Well, again we just go ahead and do the Fourier integral. The thing is, "rect" just acts to truncate the infinite range of the integral down to [-1/2, 1/2] , so :

h(t) = Integral[-1/2,1/2] { dv e^(2 pi i t v) / sinc(v) }

Since sinc is symmetric around zero, let's take the two halves of the range around zero and add them together :

h(t) = Integral[0,1/2] { dv ( e^(2 pi i t v) + e^(- 2 pi i t v) ) / sinc(v) }

h(t) = Integral[0,1/2] { dv 2 * cos ( 2 pi t v ) * pi * v / sin( pi v) }

(note we lost the c - sinc is now sin). Let's change variables to w = pi v :

h(t) = (2 / pi ) * Integral[ 0 , pi/2 ] { dw * w * cos( 2 t w ) / sin( w ) }

And.. we're stuck. This is an integral function; it's a pretty neat form, it sure smells like some kind of Bessel function or something like that, but I can't find this exact form in my math books. (if anyone knows what this is, help me out). (actually I think it's a type of elliptic integral).

One thing we can do with h(t) is prove that it is in fact exactly what we want. It has the box-unit property :

Integral[ N - 1/2 , N + 1/2 ] { h(t) dt } = 1.0 if N = 0 and 0.0 for all other integer N

That is, the 1.0 wide window box filter of h(t) centered on integers is exactly 1.0 on its own unit interval, and 0 on others. In other words, h(t) reconstructs its own DC perfectly and doesn't affect any others. (prove this by just going ahead and doing the integral; you should get sin( N * pi ) / (N * pi ) ).

While I can't find a way to simplify h(t) , I can just numerically integrate it. It looks like this :


You can see it sort of looks like sinc, but it isn't. The value at 0 is > 1. The height of the central peak vs. the side peaks is more extreme than sinc, the first negative lobes are deeper than sinc. It actually reminds me of the appearance of a wavelet.

Actually the value h(0) is exactly 4 G / pi = 1.166243... , where "G" is Catalan's constant.

Anyway, this is all very amusing and it actually "works" in the sense that if you blow up a low-res image using this h(t) basis shape, it does in fact make a high res image that is smooth and upon box-down sampling exactly reproduces the low-res original.

It is, however, not actually useful. For one thing, it's computationally ridiculous. Of course you would precompute the h(t) and store it in a table, but even then, the reach of h(t) is infinite, and it doesn't get small until very large t (beyond the edges of any real image), so in practice every output pixel must be a weighted sum from every single DC values in the low res image. Even without that problem, it's useless because it's just too ringy on real data. Looking at the shape above it should be obvious it will ring like crazy.

I believe these problems basically go back to the issue of using the ideal Shannon reconstruction when I did the step of "undoing the comb". By using the sinc to reproduce I doomed myself to non-local effect and ringing. The next obvious question is - can you do something other than sinc there? Why yes you can, though you have to be careful.

Say we go back to the very beginning and make this reconstruction :

F(t) = Sum[n] L_n * B( t - n )

We're making F(t) which is our reconstruction of the smooth box-filter of the original. Now B(t) is some reconstruction basis function (before we used sinc). In order to be a reconstruction, B(t) must be 1.0 at t = 0, and 0.0 at all other integer t. Okay.

If we run through the math with general B, we get :

again :

f(t) = Sum[n] L_n * h(t - n)

but with :

h(t) = Fou^-1{ Fou{ B } / sinc(v) }

For example :

If B(t) = "triangle" , then F(t) is just the linear interpolation of the L_n

Fou{ triangle } = sinc^2 ( v)

h(t) = Fou^-1{ sinc^2 ( v) / sinc(v) } = Fou^-1{ sinc } = rect(t)

Our basis functions are rects ! In fact this is the reconstruction where the L_n is just made a constant over each DC domain. In fact if you think about it that should be obvious. If you take the L_n and make them constant on each domain, then you run a rectangle convolution over that - as you slide the rectangle window along, you get linear interpolation, which is our F(t).

That's not useful, but maybe some other B(t) is. In particular I think the best line of approach is for B(t) to be some kind of windowed sinc. Perhaps a Guassian-windowed sinc. Any real world window I can think of leads to a Fourier transform of B(t) that's too complex to do analytically, which means our only approach to finding h is to do a double-numerical-integration which is rather a disastrous thing to do, even for precomputing a table.

So I guess that's the next step, though I think this whole approach is a practical dead end and is now just a scientific curiosity. I must say it was a lot of fun to actually bust out pencil and paper and do some math and real thinking. I really miss it.

old rants