cbloom rants: 03/2011

3/31/2011

03-31-11 - Some image filter notes

Say you have a filter like F = [1,2,3,2,1] . The normal thing to do is compute the sum and divide so you have pre-normalized values and you just do a bunch of madd's. eg. you make N = [1/9,2/9,3/9,2/9,1/9].

Now there's the question of how you handle the boundaries of the image. The normal thing to do is to take the pre-normalized filter N and apply all over, and when one of the taps samples off edge, you have to give it something to sample. You can use various edge modes, such as :


SampleBounded(int i, int w) :

clamp :
  return Sample( Clamp(i,0,w-1) );

wrap :
  return Sample( (i+256*w)%w );

mirror (no duplicated edge pixel) :
  if ( i < 0  ) return SampleBounded( -i, w );
  if ( i >= w ) return SampleBounded( -i + 2*w - 2 , w );
  else return Sample( i );

mirror with duplicated edge pixel :
  if ( i < 0  ) return SampleBounded( - i - 1, w );
  if ( i >= w ) return SampleBounded( -i + 2*w - 1 , w );
  else return Sample( i );

(the correct edge mode depends on the usage of the image, which is one of those little annoying gotchas in games; eg. the mips you should make for tiling textures are not the same as the mips for non-tiling textures). (another reasonable option not implemented here is "extrapolate" , but you have to be a bit careful about how you measure the slope at the edge of the image domain)

The reason we do all this is because we don't want to have to accumulate the sum of filter weights and divide by the weight.

But really, in most cases what you really should be doing is applying the filter only where its domain overlaps the image domain. Then you sum the weights in the area that is valid and renormalize. eg. if our filter F is two pixels off the edges, we just apply [3,2,1] / 6 , we don't clamp the sampler and put an extra [1,2] on the first pixel.

ADDENDUM : in video games there's another special case that needs to be handled carefully. When you have a non-tiling texture which you wish to abutt seamlessly to another texture. That is, you have two textures T1 and T2 that are different and you wish to line them up beside each other without a seam.

I call this mode "shared", it sort of acts like "clamp" but has to be handled specially in filtering. Lets say T1 and T2 are layed against eachother horizontally, so they abutt along a column. What the artist should do is make the pixels in that border column identical in both textures (or you could have your program enforce this). Then, the UV mapping on the adjacent rectangles should be inset by half a pixel - that is, it picks the center of the pixels, not the edge of the texture. Thus the duplicated pixel edge only appears to be a single column of pixels.

But that's not the special case handling - the special case is whenever you filter a "shared" image, you must make border column pixels only from other border column pixels. That is, that shared edge can only be vertically filtered, not horizontally filtered. That way it stays identical in both images.

Note that this is not ideal with mipping, what happens is the shared edge gets fatter at higher mip levels - but it never develops a seam, so it is "seamless" in that sense. To do it right without any artifacts (eg. to look as if it was one solid bigger texture) you would have to know what image is on the other side of the shared edge and be able to filter tap into those pixels. Obviously that is impossible if your goal is a set of terrain tiles or something like that where you use the same shared edge in multiple different ways.

(is there a better solution to this issue?)

I did a little look into the difference between resizing an image 8X by either doubling thrice or directly resizing. I was sanity checking my filters and I thought - hey if I use a Gaussian filter, it should be the same thing, because convolution of a Gaussian with a Gaussian is a Gaussian, right?

In the continuous case, you could either use one Gaussian with an sdev of 8 (not actually right for 8X mag, but you get the idea). If you had a Gaussian with sdev 2 and convolved it 3 times - you should get a Gaussian with sdev of 8.

So I tried it on my filters and I got :


Gaussian for doubling, thrice :

1.0000,0.9724,0.8822,0.7697,0.5059,0.3841,0.2635,0.1964,0.1009,0.0607,0.0281,0.0155,0.0067,0.0034,0.0012,0.0004,...

Gaussian for direct 8x :

1.0000,0.9439,0.8294,0.6641,0.4762,0.3057,0.1784,0.0966,0.0492,0.0235,0.0103,0.0041,0.0014,0.0004,...

and I was like yo, WTF they're way off, I must have a bug. (note : these are scaled to make the max value 1.0 rather than normalizing because it's easier to compare this way, they look more unequal after normalizing)

But then I realized - these are not really proper Gaussians. These are discrete samples of Gaussians. If you like, it's a Gaussian multiplied by a comb. It's not even a Gaussian convolved with a box filter - that is, we are not applying the gaussian over the range of the pixel as if the pixel was a box, but rather just sampling the continuous function at one point on the pixel. Obviously the continuous convolution theorem that Gauss [conv] Gauss = Gauss doesn't apply.

As for the difference between doing a direct 8X and doubling thrice, I can't see a quality difference with my eyes. Certain the filters are different numerically - particularly filters with negatives, eg. :


sinc double once : 
1.0000,0.6420,0.1984,-0.0626,-0.0974,-0.0348,0.0085,0.0120,
sinc double twice : 
1.0000,0.9041,0.7323,0.5193,0.3042,0.1213,-0.0083,-0.0790,-0.0988,-0.0844,-0.0542,-0.0233,-0.0007,0.0110,0.0135,0.0107,0.0062,0.0025,0.0004,-0.0004,-0.0004,
sinc double thrice : 
1.0000,0.9755,0.9279,0.8596,0.7743,0.6763,0.5704,0.4617,0.3549,0.2542,0.1633,0.0848,0.0203,-0.0293,-0.0645,-0.0861,-0.0960,-0.0962,-0.0891,-0.0769,-0.0619,-0.0459,-0.0306,-0.0169,-0.0057,0.0029,0.0087,0.0120,0.0133,0.0129,0.0116,0.0096,0.0073,0.0052,0.0033,0.0019,0.0008,0.0001,-0.0003,-0.0004,-0.0004,-0.0004,-0.0002,

sinc direct 8x : 
1.0000,0.9553,0.8701,0.7519,0.6111,0.4595,0.3090,0.1706,0.0528,-0.0386,-0.1010,-0.1352,-0.1443,-0.1335,-0.1090,-0.0773,-0.0440,-0.0138,0.0102,0.0265,0.0349,0.0365,0.0328,0.0259,0.0177,0.0097,0.0029,-0.0019,-0.0048,-0.0059,

very different, but visually meh? I don't see much.

The other thing I constantly forget about is "filter inversion". What I mean is, if you're trying to sample between two different grids using some filter, you can either apply the filter to the source points or the dest points, and you get the same results.

More concretely, you have filter shape F(t) and some pixels at regular locations P[i].

You create a continuous function f(t) = Sum_i P[i] * F(i-t) ; so we have placed a filter shape at each pixel center, and we are sampling them all at some position t.

But you can look at the same thing a different way - f(t) = Sum_i F(t-i) * P[i] ; we have a filter shape at position t, and then we are sampling it at each position i around it.

So, if you are resampling from one size to another, you can either do :

1. For each source pixel, multiply by filter shape (centered at source) and add shape into dest, or :

2. For each dest pixel, multiply filter shape (centered at dest) by source pixels and put sum into dest.

And the answer is the same. (and usually the 2nd is much more efficient than the first)

And for your convenience, here are some doubling filters :


box        : const float c_filter[1] = { 1.00000 };
linear     : const float c_filter[2] = { 0.25000, 0.75000 };
quadratic  : const float c_filter[3] = { 0.28125, 0.68750, 0.03125 };
cubic      : const float c_filter[4] = { 0.00260, 0.31510, 0.61198, 0.07031 };
mitchell0  : const float c_filter[4] = { -0.02344, 0.22656, 0.86719, -0.07031 };
mitchell1  : const float c_filter[4] = { -0.01476, 0.25608, 0.78212, -0.02344 };
mitchell2  : const float c_filter[4] = { 0.01563, 0.35938, 0.48438, 0.14063 };
gauss      : const float c_filter[5] = { 0.00020, 0.20596, 0.78008, 0.01375, 0.00000 };
sqrtgauss  : const float c_filter[5] = { 0.00346, 0.28646, 0.65805, 0.05199, 0.00004 };
sinc       : const float c_filter[6] = { 0.00052, -0.02847, 0.23221, 0.87557, -0.08648, 0.00665 };
lanczos4   : const float c_filter[4] = { -0.01773, 0.23300, 0.86861, -0.08388 };
lanczos5   : const float c_filter[5] = { -0.04769, 0.25964, 0.89257, -0.11554, 0.01102 };
lanczos6   : const float c_filter[6] = { 0.00738, -0.06800, 0.27101, 0.89277, -0.13327, 0.03011 };

These are actually pairs of filters to create adjacent pixels in a double-resolution output. The second filter of each pair is simply the above but in reverse order (so the partner for linear is 0.75, 0.25).

To use these, you scan it over the source image and apply centered at each pixel. This produces all the odd pixels in the output. Then you take the filter and reverse the order of the coefficients and scan it again, this produces all the even pixels in the output (you may have to switch even/odd, I forget which is which).

These are created by taking the continuous filter function and sampling at 1/4 offset locations - eg. if 0 is the center (maximum) of the filter, you sample at -0.75,0.25,1.25, etc.

And here's the same thing with a 1.15 X blur built in :


box        : const float c_filter[1] = { 1.0 };
linear     : const float c_filter[2] = { 0.30769, 0.69231 };
quadratic  : const float c_filter[3] = { 0.00000, 0.33838, 0.66162 };
cubic      : const float c_filter[5] = { 0.01586, 0.33055, 0.54323, 0.11034, 0.00001 };
mitchell0  : const float c_filter[5] = { -0.05174, 0.30589, 0.77806, -0.03143, -0.00078 };
mitchell1  : const float c_filter[5] = { -0.02925, 0.31410, 0.69995, 0.01573, -0.00052 };
mitchell2  : const float c_filter[5] = { 0.04981, 0.34294, 0.42528, 0.18156, 0.00041 };
gauss      : const float c_filter[6] = { 0.00000, 0.00149, 0.25842, 0.70629, 0.03379, 0.00002 };
sqrtgauss  : const float c_filter[6] = { 0.00000, 0.01193, 0.31334, 0.58679, 0.08726, 0.00067 };
sinc       : const float c_filter[7] = { 0.00453, -0.05966, 0.31064, 0.78681, -0.03970, -0.00277, 0.00015 };
lanczos4   : const float c_filter[5] = { -0.05129, 0.31112, 0.78006, -0.03946, -0.00042 };
lanczos5   : const float c_filter[6] = { 0.00499, -0.09023, 0.33911, 0.80082, -0.04970, -0.00499 };
lanczos6   : const float c_filter[7] = { 0.02600, -0.11420, 0.34931, 0.79912, -0.05497, -0.00837, 0.00312 };

The best doubling filters to my eyes are sinc and lanczos5, they have a good blend of sharpness and lack of artifacts. Stuff like gauss and cubic are too blurry, but are very smooth ; lanczos6 is sharper but has more ringing and stair-steps; wider lanczos filters get worse in that way. Sinc and lanczos5 without any blur built in can have a little bit of visible stair-steppiness (there's an inherent tradeoff when linear upsampling of sharpness vs. stair-steps) (by stair steps I mean the ability to see the original pixel blobs).

3/24/2011

03-24-11 - Image filters and Gradients

A friend recently pointed me at John Costella's supposedly superior edge detector . It's a little bit tricky to figure out what's going on there because his writing is quite obtuse, so I thought I'd record it for posterity.

You may recognize Costella's name as the guy who made Unblock which is a rather interesting and outside-the-norm deblocker. He doesn't have an image science background, and in the case of Unblock that led him to some ideas that normal research didn't find. Did he do it again with his edge detector?

Well, no.

First of all, the edge detector is based on what he calls the magic kernel . If you look at that page, something is clearly amiss.

The discrete 1d "magic kernel" for upsampling is [1,3,3,1] (unnormalized). Let's back up a second, we wish to upsample an image without offseting it. That is, we replace one pixel with four and they cover the same area :


+---+     +-+-+
|   |     | | |
|   |  -> +-+-+
|   |     | | |
+---+     +-+-+

A 1d box upsample would be convolution with [1,1] , where the output discrete taps are half the distance apart of the original taps, and offset by 1/4.

The [1331] filter means you take each original pixel A and add the four values A*[1331] into the output. Or if you prefer, each output pixel is made from (3*A + 1*B)/4 , where A is the original pixel closer to the output and B is the one farther :


+---+---+
| A | B |
+---+---+

+-+-+-+-+
| |P| | |
+-+-+-+-+

P = (3*A + 1*B)/4

but clever readers will already recongize that this is just a bilinear filter. The center of P is 1/4 of an original pixel distance to A, and 3/4 of a pixel distance to B, so the 3,1 taps are just a linear filter.

So the "magic kernel" is just bilinear upsampling.

Costella shows that Lanczos and Bicubic create nasty grid artifacts. This is not true, he simply has a bug in his upsamplers.

The easiest way to write your filters correctly is using only box operations and odd symmetric filters. Let me talk about this for a moment.

In all cases I'm talking about discrete symmetric filters. Filters can be of odd width, in which case they have a single center tap, eg. [ a,b,c,b,a ] , or even width, in which case the center tap is duplicated : [a,b,c,c,b,a].

Any even filter can be made from an odd filter by convolution with the box , [1,1]. (However, it should be noted that an even "Sinc" is not made by taking an odd "Sinc" and convolving with box, it changes the function).

That means all your library needs is odd filters and box resamplers. Odd filters can be done "in place", that is from an image to an image of the same size. Box upsample means replicate a pixel with four identical ones, and box downsample means take four pixels are replace them with their average.

To downsample you just do : odd filter, then box downsample.
To upsample you just do : box upsample, then odd filter.

For example, the "magic kernel" (aka bilinear filter) can be done using an odd filter of [1,2,1]. You just box upsample then convolve with 121, and that's equivalent to upsampling with 1331.

Here are some odd filters that work for reference :


Box      : 1.0
Linear   : 0.25,0.50,0.25
Quadratic: 0.128,0.235,0.276,0.235,0.128
Cubic    : 0.058,0.128,0.199,0.231,0.199,0.128,0.058
Gaussian : 0.008,0.036,0.110,0.213,0.267,0.213,0.110,0.036,0.008
Mitchell1: -0.008,-0.011,0.019,0.115,0.237,0.296,0.237,0.115,0.019,-0.011,-0.008
Sinc     : -0.003,-0.013,0.000,0.094,0.253,0.337,0.253,0.094,0.000,-0.013,-0.003
Lanczos4 : -0.008,0.000,0.095,0.249,0.327,0.249,0.095,0.000,-0.008
Lanczos5 : -0.005,-0.022,0.000,0.108,0.256,0.327,0.256,0.108,0.000,-0.022,-0.005

Okay, so now let's get back to edge detection. First of all let's clarify something : edge detectors and gradients are not the same thing. Gradients are slopes in the image; eg. big planar ramps may have large gradients. "edges" are difficult to define things, and different applications may have different ideas of what should constitute an "edge". Sobel kernels and such are *gradient* operators not edge detectors. The goal of the gradient operator is reasonably well defined, in the sense that if our image is a height map, the gradient should be the slope of the terrain. So henceforth we are talking about gradients not edges.

The basic centered difference operator is [-1,0,1] and gives you a gradient at the middle of the filter. The "naive difference" (Costella's terminology) is [-1,1] and gives you a gradient half way between the original pixels.

First of all note that if you take the naive difference at two adjacent pels, you get two gradients at half pel locations; if you want the gradient at the integer pixel location between them you would combine the taps - [-1,1,0] and [0,-1,1] - the sum is just [-1,0,1] , the central difference.

Costella basically proposes using some kind of upsampler and the naive difference. Note that the naive difference operator and the upsampler are both just linear filters. That means you can do them in either order, since convolution commutes, A*B = B*A, and it also means you could just make a single filter that does both.

In particular, if you do "magic upsampler" (bilinear upsampler) , naive difference, and then box downsample the taps that lie within an original pixel, what you get is :


-1  0  1
-6  0  6
-1  0  1

A sort of Sobel-like gradient operator (but a bad one). (this comes from 1331 and the 3's are in the same original pixel).

So upsampling and naive difference is really just another form of linear filter. But of course anybody who's serious about gradient detection knows this already. You don't just use the Sobel operator. For example in the ancient/classic Canny paper, they use a Gaussian filter with the Sobel operator.

One approach to making edge detection operators is to use a Gaussian Derivative, and then find the discrete approximation in a 3x3 or 5x5 window (the Scharr operator is pretty close to the Gaussian Derivative in a 3x3 window, though Kroon finds a slightly better one). Of course even Gaussian Derivatives are not necessarily "optimal" in terms of getting the direction and magnitude of the gradient right, and various people (Kroon, Scharr, etc.) have worked out better filters in recent papers.

Costella does point out something that may not be obvious, so we should appreciate that :

Gradients at the original res of the image do suffer from aliasing. For example, if your original image is [..,0,1,0,1,0,1,..] , where's the gradient? Well, there are gradients between each pair of pixels, but if you only look at original image pixel locations you can't place a gradient anywhere. That is, convolution with [-1,0,1] gives you zero everywhere.

However, to address this we don't need any "magic". We can just double the resolution of our image using whatever filter we want, and then apply any normal gradient detector at the higher resolution. If we did that on the [0,1,0,1] example we would get gradients at all the half taps.

Now, finally, I should point out that "edge detection" is a whole other can of worms than gradient operators, since you want to do things like suppress noise, connect lines, look for human perceptual effects in edges, etc. There are tons and tons of papers on these topics and if you really care about visual edge detection you should go read them. A good start is to use a bilateral or median filter before the sharpen operator (the bilateral filter suppresses speckle noise and joins up dotted edges), and then sharpen should be some kind of laplacian of gaussian approximation.

3/21/2011

03-21-11 - ClipCD

Copy current dir to clipboard :


c:\bat>type clipcd.bat
@echo off
cechonr "clip " > s:\t.bat
cd >> s:\t.bat
REM type r:\t.bat
s:\t.bat

(cechonr is my variant of "echo" that doesn't put a \n on the end).

I'm sure it could be done easier, but I've always enjoyed this crufty way of making complex batch files by having them write a new batch file. For example I've long done my own savedir/recalldir this way :


c:\bat>type savedir.bat
@echo off
cd > r:\t1.z
cd \
cd > r:\t2.z
zcopy -o c:\bat\echo_off.bat r:\t3.z
attrib -r r:\t3.z
type r:\t2.z >> r:\t3.z
cechonr "cd " >> r:\t3.z
type r:\t1.z >> r:\t3.z
zcopy -o r:\t3.z c:\bat\recalldir.bat
echo cls >> c:\bat\recalldir.bat
call dele r:\t1.z r:\t2.z r:\t3.z
call recalldir.bat

Less useful now that most CLI's have a proper pushdir/popdir. But this is a bit different because it actually makes a file on disk (recalldir.bat), I use it to set my "home" dir and my dos startup bat runs recalldir.

In other utility news, my CLI utils (move,copy,etc) have a new option which everyone should copy - when you have a duplicate name, you can ask it to check for binary identity right there in the prompt :


r:\>zc aikmi.BMP z
 R:\z\aikmi.BMP exists; overwrite? (y/n/A/N/u/U/c/C)?
  (y=yes, n=no, A=all,N=none,u=update newer,U=all,c=check same,C=all)
 R:\z\aikmi.BMP exists; overwrite? (y/n/A/N/u/U/c/C)c
CheckFilesSame : same
 R:\z\aikmi.BMP exists; overwrite? (y/n/A/N/u/U/c/C)y
R:\aikmi.BMP -> R:\z\aikmi.BMP

And of course like all good prompts, for each choice there is a way to say "do this for every prompt".

(BTW if you want a file copier for backing up big dirs, robocopy is quite good. The only problems is the default number of retries is no good, when you hit files with problems it will just hang forever (well, 30 million seconds anyway, which is essentially forever) You need to use /R:10 and /W:10 or something like that).

3/19/2011

03-19-11 - Fitness Links

I've started working out again recently. I'm trying to do things differently this time, hopefully in a way that leads to more long term good foundational structure for my body problems. Obviously that would have been much easier to do at a young age, but better late than never I guess. I believe that in the past I may have overdeveloped the easy muscles, which is basically the "front" - pecs, abs, biceps, etc. I'm not sure if that contributed to my series of shoulder injuries, but it certainly didn't help.

My intention this time is to try to develop musculature that will help support my unstable shoulders as well as generally help with "programmer's disease". So generally that means strengthening the back, shoulder stabilizers, lots of over-head work, and dynamic work that involves full body moves, flexibility and extension.

The other change is that the gym I'm going to here happens to have no proper weights (aka barbells and racks). Hey dumb gym owners : if you only put ONE thing in your gym, it should be a power rack with barbells. And of course this gym has no power rack, just a bunch of those stupid fucking machines. That is the most useful and general purpose single piece of gym equipment. You could get a full workout with just bodyweight moves for the small muscles and a power rack for the big ones. In fact I would love a gym that's just a big empty room and a bunch of racks and bars, but that's reserved for pro athletes and nutters like crossfit.

Anyway, the one thing they do have is kettlebells, so I'm doing that. It's pretty fun learning the new moves. If you read the forums you'll see a bunch of doofuses talking about how kettlebells "change everything" and are "so much more fun". No, they're not. But they are different. So if you've done normal weights for many years and you're sick of it, it might be a nice change of pace. Learning new moves gives you mind something to do while your body is lugging weight around, it keeps you from dieing of boredom.

I'm also trying to avoid all crunch-like movements for abs, that is, all contractions. So far I'm doing a bunch of plank variants, and of course things like overhead farmers walks, but I may have to figure out some more to add to that. One of the best exercises for abs is just heavy deadlifts, but sadly I can't do that in the dumb yuppie gym.

My new links :

My Old Links :

3/14/2011

03-14-11 - cbloom.com-exe BmpUtil update

I put up a new BmpUtil on the cbloom.com/exe page . Release notes :


bmputil built Mar 14 2011 12:49:42
bmp view <file>
bmp info <file>
bmp copy <fm> <to> [bits] [alpha]
bmp jpeg <fm> <to> [quality]
bmp crop <fm> <to> <w> <h> [x] [y]
bmp pad <fm> <to> <w> <h> [x] [y]
bmp cat <h|v> <fm1> <fm2> <to>
bmp size <fm> <to> <w> [h]
bmp mse <im1> <im2>
bmp median <fm> <to> <radius> [selfs]
file extensions : bmp,tga,png,jpg
  jpg gets quality from last # in name


fimutil by cbloom built Mar 14 2011 12:50:56
fim view <file>
fim info <file>
fim copy <fm> <to> [planes]
fim mse <fm> <to>
fim size <fm> <to> <w> [h]
fim make <to> <w> <h> <d> [r,g,b,a]
fim eq <fm> <to> <eq>
fim eq2 <fm1> <fm2> <to> <eq>
fim cmd <fm> <to> <cmd>  (fim cmd ? for more)
fim interp <to> <fm1> <fm2> <fmt>
fim filter <fm> <to> <filter> [repeats] ; (filter=? for more)
fim upfilter/double <fm> <to> <filter> [repeats]
fim downfilter/halve <fm> <to> <filter> [repeats]
fim gaussian <fm> <to> <sdev> [width]
fim bilateral <fm> <to> <spatial_sdev> <value_sdev> [spatial taps]
file extensions : bmp,tga,png,jpg,fim
 use .fim for float images; jpg gets quality from last # in name

fim cmd <fm> <to> <cmd>
 use cmd=? for help
RGBtoYUV
YUVtoRGB
ClampUnit
Normalize
ScaleBiasUnit
ReGamma
DeGamma
normheight
median5

Some notes :

Most of the commands will give more help if you run them, but you may have to give some dummy args to make them think they have enough args. eg. run "fimutil eq ? ? ?"

FimUtil sizers are much better than the BmpUtil ones. TODO : any resizing except doubling/halving is not very good yet.

FimUtil eq & eq2 provide a pretty generate equation parser, so you can do any kind of per-sample manipulation you want there.

"bmputil copy" is how you change file formats. Normally you put the desired jpeg quality in the file name when you write jpegs, or you can use "bmputil jpeg" to specify it manually.

Unless otherwise noted, fim pixels are in [0,1] and bmp pixels are in [0,255] (just to be confusing, many of the fimutil commands do a *1/255 for you so that you can pass [0,255] values on the cmd line); most fim ops do NOT enforce clamping automatically, so you may wish to use ClampUnit or ScaleBiasUnit.

Yeah, I know imagemagick does lots of this shit but I can never figure out how to use their commands. All the source code for this is in cblib, so you can examine it, fix it, laugh at it, what have you.

3/12/2011

03-12-11 - C Coroutines with Stack

It's pretty trivial to do the C Coroutine thing and just copy your stack in and out. This lets you have C coroutines with stack - but only in a limitted way.

[deleted]

Major crack smoking. This doesn't work in any kind of general way, you would have to find the right hack per compiler, per build setting, etc.

Fortunately, C++ has a mechanism built in that lets you associate some data per function call and make those variable references automatically rebased to that chunk of memory - it's called member variables, just use that!

3/11/2011

03-11-11 - Worklets , IO , and Coroutines

So I'm working on this issue of combining async CPU work with IO events. I have a little async job queue thing, that I call "WorkMgr" and it runs "Worklets". See previous main post on this topic :

cbloom rants 04-06-09 - The Work Dispatcher

And also various semi-related other posts :
cbloom rants 09-21-10 - Waiting on Thread Events
cbloom rants 09-21-10 - Waiting on Thread Events Part 2
cbloom rants 09-12-10 - The deficiency of Windows' multi-processor scheduler
cbloom rants 04-15-09 - Oodle Page Cache

So I'm happy with how my WorkMgr works for pure CPU work items. It has one worker thread per core, the Worklets can be dependent on other Worklets, and it has a dispatcher to farm out Worklets using lock-free queues and all that.

(ASIDE : there is one major problem that ryg describes well , which is that it is possible for worker threads that are doing work to get swapped out for a very long time while workers on another core that could have CPU time can't find anything to do. This is basically a fundamental issue with not being in full control of the OS, and is related to the "deficiency of Windows' multi-processor scheduler" noted above. BTW this problem is much worse if you lock your threads to cores; because of that I advise that in Windows you should *never* lock your threads to cores, you can use affinity to set the preferred core, but don't use the exclusive mask. Anyway, this is an interesting topic that I may come back to in the future, but it's off topic so let's ignore it for now).

So the funny issues start arising when your work items have dependencies on external non-CPU work. For concreteness I'm going to call this "IO" (File, Network, whatever), but it's just anything that takes an unknown amount of time and doesn't use the CPU.

Let's consider a simple concrete example. You wish to do some CPU work (let's call it A), then fire an IO and wait on it, then do some more CPU work B. In pseduocode form :

WorkletLinear
{
    A();
    h = IO();
    Wait(h);
    B();
}

Now obviously you can just give this to the dispatcher and it would work, but while your worklet is waiting on the IO it would be blocking that whole worker thread.

Currently in my system the way you fix this is to split the task. You make two Worklets, the first does work A and fires the IO, the second does work B and is dependent on the first and the IO. Concretely :


Worklet2
{
    B();    
}

Worklet1
{
    A();
    h = IO();
    QueueWorklet( Worklet2, Dependencies{ h } );
}

so Worklet1 finishes and the worker thread can then do other work if there is anything available. If not, the worker thread goes to sleep waiting for one of the dependencies to be done.

This way works fine, it's what I've been using for the past year or so, but as I was writing some example code it occurred to me that it's just a real pain in the ass to write code this way. It's not too bad here, but if you have a bunch of IO's, like do cpu work, IO, do cpu work, more IO, etc. you have to make a whole chain of functions and get the dependencies right and so on. It's just like writing code for IO completion callbacks, which is a real nightmare way to write IO code.

The thing that struck me is that basically what I've done here is create one of the "ghetto coroutine" systems. A coroutine is a function call that can yield, or a manually-scheduled thread if you like. This split up Worklet method could be written as a state machine :


WorkletStatemachine
{
  if ( state == 0 )
  {
    A();
    h = IO();
    state++; enqueue self{ depends on h };
  }
  else if ( state == 1 )
  {
    B();
  }
}

In this form it's obviously the state machine form of a coroutine. What we really want is to yield after the IO and then be able to resume back at that point when some condition is met. Any time you see a state machine, you should prefer a *true* coroutine. For example, game AI written as a state machine is absolutely a nightmare to work with. Game AI written as simple linear coroutines are very nice :


    WalkTo( box )
    obj = Open( box )
    PickUp( obj )

with implicit coroutine Yields taking place in each command that takes some time. In this way you can write linear code, and when some of your actions take undetermined long amounts of time, the code just yields until that's done. (in real game AI you also have to handle interruptions and such things).

So, there's a cute way to implement coroutines in C using switch :

Protothreads - Lightweight, Stackless Threads in C
Coroutines in C

So one option would be to use something like that. You would put the hidden "state" counter into the Worklet work item struct, and use some macros and then you could write :


WorkletCoroutine
{
  crStart   // macro that does a switch on state

    A();
    h = IO();

  crWait(h,1)  // macro that does re-enqueue self with dependency, state = 1; case 1:

    B();

  crEnd
}

that gives us linear-looking code that actually gets swapped out and back in. Unfortunately, it's not practical because this C-coroutine hack doesn't preserve local variables, is creating weird scopes all over, and just is not actually usable for anything but super simple code. (the switch method gives you stackless coroutines; obvious Worklet can be a class and you could use member variables). Implementing a true (stackful) coroutine system doesn't really seem practical for cross-platform (it would be reasonably easy to do for any one platform, you just have to record the stack in crStart and copy it out in crWait, but it's just too much of a low-level hacky mess that would require intimate knowledge of the quirks of each platform and compiler). (you can do coroutines in Windows with fibers, not sure if that would be a viable solution on Windows because I've always heard "fibers are bad mmkay").

Aside : some links on coroutines for C++ :

Thinking Asynchronously in C++ Composed operations, coroutines and code makeover
Dr Dobbs Cross-Platform Coroutines in C++
COROUTINE (Keld Helsgaun)
Chapter�1.�Boost.Coroutine proposal

The next obvious option is a thread pool. We go ahead and let the work item do IO and put the worker thread to sleep, but when it does that we also fire up a new worker thread so that something can run. Of course to avoid creating new threads all the time you have a pool of possible worker threads that are just sitting asleep until you need them. So you do something like :


WorkletThreadPool
{
  A();
  h = IO();
  TheadPoolWait(h);
  B();
}

TheadPoolWait(h)
{
  number of non-waiting workers --;

  CheckThreadPool();

  Wait(h);

  number of non-waiting workers ++;
  CheckThreadPool();
}

CheckThreadPool();
{
  if ( number of non-waiting workers < desired number of workers &&
    is there any work to do )
  {
    start a new worker from the pool
  }

  if ( number of non-waiting workers > desired number of workers )
  {
    sleep worker to the pool
  }
}

// CheckThreadPool also has to be called any time a work item is added to the queue

or something like that. Desired number of workers would be number of cores typically. You have to be very careful of the details of this to avoid races, though races here aren't the worst thing in the world because they just mean you have not quite the ideal number of worker threads running.

This is a reasonably elegant solution, and on Windows is probably a good one. On the consoles I'm concerned about the memory use overhead and other costs associated with having a bunch of threads in a pool.

Of course if you were Windows only, you should just use the built-in thread pool system. It's been in Windows forever in the form of IO Completion Port handling. New in Vista is much simpler, more elegant thread pool that basically just does exactly what you want a thread pool to do, and is managed by the kernel so it's fast and robust and all that. For example, with the custom system you have to be careful to use ThreadPoolWait() instead of normal OS Wait() and if you can't get nice action when you do something that puts you to sleep in other ways (like locking a mutex or whatever).

Some links on Windows thread pools and the old IO completion stuff :

MSDN Pooled Threads Improve Scalability With New Thread Pool APIs (Vista)
MSDN Thread Pools (Windows) (Vista)
MSDN Thread Pooling (Windows) (old)
MSDN Thread Pool API (Windows) (Vista)
So you need a worker thread pool... - Larry Osterman's WebLog - Site Home - MSDN Blogs
Managed ThreadPool vs Win32 ThreadPool (pre-Vista) - Junfeng Zhang's Windows Programming Notes - Site Home - MSDN Blogs
Dr Dobbs Multithreaded Asynchronous IO & IO Completion Ports
Concurrent, Multi-Core Programming on Windows and .NET (Part II -- Threading Stephen Toub)
MSDN Asynchronous Procedure Calls (Windows)
Why does Win32 even have Fibers - Larry Osterman's WebLog - Site Home - MSDN Blogs
When does it make sense to use Win32 Fibers - Eric Eilebrecht's blog - Site Home - MSDN Blogs
Using fibers to simplify enumerators, part 3 Having it both ways - The Old New Thing - Site Home - MSDN Blogs

So I've rambled a while and don't really have a point. The end.