cbloom rants

5/24/2013

05-24-13 - Hunger Cues

I'm so exhausted that I've been eating from fatigue, which is terrible, so a note for myself. I very much believe in listening to your body, but you have to know what to listen to, and hunger cues can be confusing.

1. Tiredness is not a hunger cue. Yes, sure popping some sugar will give you a boost, but that is not the right solution. When you are tired you need sleep, not food. This is always a huge problem for me in work crunches, I start jamming candy down my throat to try to keep my energy up, and I'm tempted to do it now for baby, so hey self - no, sleepiness is not solved by eating. Go sleep.

2. Your belly is actually not a great cue. Sure extreme belly ache and rumbling means you need to eat, but a slight hollow empty feeling, which most people take as a "I must eat now" is not actually a good hunger cue. Humans are not meant to feel full in the belly all the time, but in our culture we get used to that feeling and so it feels strange when it's gone and you think something is wrong that you have to fix by putting more food in. It's really not; you need to try to detach the mental association of "belly empty" with "eat now".

3. The actual correct hunger cue is light headedness, dizzyness, or weakness. That means you really do need to eat something now, but perhaps not a lot. (getting quantities right for yourself takes some experimentation over time to figure out)

I believe that the primary goal of food consumption portioning and scheduling should be to eat as little as possible, without ever going into that red zone of critically low blood sugar. (of course I'm assuming that you want to be near your "ideal" body weight, with "ideal" being the modern standard of trim; if your ideal is to be as large as possible then you would do something different). Note that belly feelings show up nowhere in the "primary goal", you just ignore them. Perhaps even learn to enjoy that slight hollow feeling in your gut, which gives you a bit of a hungry wolf feeling, it's sort of energizing. (if I'm slightly hungry, slightly horny, and slightly angry, good god, get out of the way!)

I'm convinced that the right way to eat is something like 5 small meals throughout the day. Long ago when I was single and quite fit, I ate like that and was quite successful at meeting the "primary goal of food consumption portioning and scheduling" (minimal eating without going into the red zone). It's very hard to keep that up in a relationship, because eating a big meal is such a key part of our social conventions. When I was single I would very rarely eat a proper dinner; I would eat a sandwich at 4 PM or something so that I wouldn't really be too hungry at 8 when Fifth Meal came around, so I might just eat a salad and some canned tuna. It is possible to do in a relationship, and I'm sort of trying it now. You have to just make sure that you eat less at the standard meal times, or eat more low-cal food like cooked veg and salads, and then go ahead and eat the intermediate meals yourself. (it's particularly hard when someone else cooks and you feel compelled to eat a large amount to show that you like it; it's also hard at restaurants where the portions are always huge and you feel like you have to eat it to "get your money's worth"; eating around other men is also a problem, because of the stupid macho "let's stuff our faces" competition)

5/22/2013

05-22-13 - Baby Work

Jesus christ it's a lot of work. I was hoping to get back to doing a little bit of RAD work by Monday (two days ago), but it's just not possible yet. I'm doing all the work for Tasha and baby and it's completely swamping me. I get little moments of free time (that I oddly use to write things like this), but never a solid enough block to actually do programming work, and you can never predict when that solid block of time will be, which is so hard for programming. A few times I tried to get going on some actual coding, and then baby wakes up and I have to stop and run around changing diapers and getting mom snacks. I give up.

(even when I do get a bit of solid time, I'm just too tired to be productive; I stare blankly at the code. Actually I've had a few funny "tired dad" moments already. I went to the grocery store and was shopping along, and all of a sudden I noticed my cart was full of stuff that wasn't mine. Oh crap I took someone else's cart at some point because I was so asleep on my feet. I remember all through my childhood my mom would make shopping mistakes that I found so infuriating (Mom, I said corn chex and you got wheat chex, omg mom how could you be so daft!) and now I finally have some sympathy; you just get so exhausted that you can't even perform the most rudimentary of tasks without spacing out and making mistakes).

If you haven't had your baby yet, get some help, don't try to do it alone. (our relief should arrive tomorrow, and it's getting easier each day anyway; the hardest days were the beginning when we were still exhausted from labor and cleaning up after the home birth, and baby hadn't figured out nursing very well yet, that was some serious crunch time). We thought it would be sweet to have some time for just the three of us, and it has been, actually it's been really nice just being alone together, but it's too much work, I don't recommend it.

(we've been incredibly fortunate so far that our baby is super easy, really well behaved, a good sleeper and nurser; we just have none of the problems that you hear about (*). Of course that's not entirely luck (though perhaps mostly luck), we're both super healthy people and we've done what we believe is right to make a happy newborn (singing to the womb, immediate mommy skin contact after birth, never separating baby from parents, cosleeping, breatfeeding, etc.). But it's been so hard even with a well-behaved baby I now have new respect for the parents that go through a baby with colic or feeding difficulties or one that doesn't sleep, you guys are heros).

(* = of course we're struggling with some acid reflux problems (what used to be called "colic" back when parents were awful and thought babies just cried because they were a nuisance, rather than because they were in serious pain that should be adressed) and forceful letdown and latching difficulties and etc etc, some other typical minor baby fussiness struggles, but that's all just normal baby stuff that I can't complain about, not a serious hardship like a baby with an actual health problem or disability)

House work is so annoying in the way that you can't just get it all done at once. Even during this time when the house work is so much harder than usual, it's still only something like 4-6 hours total of work, but it's all spread out over the day; you work for a while, then you have a 15 minute break (waiting for laundry to finish or the baby to poop again), then you do more work, then another tiny break, etc. I don't do well with work like that; I'm a work sprinter (actually I'm an everything sprinter), I want to get all my tasks on a list and then I'm just gonna strap on the gusto and knock it out as fast as possible so that I can be done and have a deep rest.

I'm sad that I'm so busy running around doing chores that I can't just lie in bed with mom and baby very much. I've always used doing work for people as a way of showing love, and it's fucking retarded. It's not what they really want, and it's not perceived as love. I'm sure that if I had someone else do all the work and I just spent more time cuddling, I would be a better dad.

At family gatherings, there are always those people who disappear from the group to help out in the kitchen; they might help cook, then help set up, help clean up; they do a lot of work and show their love for the family that way. There are other people who just hang out with the group the whole time and chat and smile and are more overtly interested in being there. Of course the hard worker is subconsciously perceived negatively, as standoffish, while the smiler is a "good family man" that everyone enjoys. In my youth I would rage about the unfairness of it all. I'm past that and can see things more clearly :

1. If you have a choice, then of course it's better to be the smiler who does no work and everyone loves. There is no reward in the real world for being a hard worker; not in social situations nor in the workplace. It's much better to be perceived as nice than to actually do deeply giving nice things.

2. Being a friendly, funny social creature is of course a type of "work" that's contributing value to the social situation. Think of them as an MC if you like; they're providing a service to the group, telling stories or laughing or whatever. That's valuable work as well.

3. The people who are really stealing from the group are the ones that don't do the work, and also don't provide laughs and good vibes. They are energy vampires and you should minimize your contact with them.

4. If you're a "smiler", the hard-working types will give you dirty looks or even drop passive aggressive shitty remarks about how "some people don't contribute" or whatever. Fuck them. They're just morons who have chosen a bad path for themselves and are trying to bring you down. Just laugh at them inside your head. Foolish hard-workers, your judgement has no sway over me, I don't need your approval. If someone else wants to do all the hard work for you, and then make themselves feel all sour about the fact that you didn't do the work, fantastic.

5. If you're a hard-worker, don't despair about the unfair world. You have found your lot in life. Maybe you are simply incapable of being a smiler. That's too bad for you, but we all have our place, and it's better to accept it and be content than to rage about what cannot be yours.

In my youth I used to struggle with trying it both ways. One of the nice things about age is you figure out your lot in life and just accept it; some years ago I concluded that I would never really contribute great social energy to groups, so I should just be a dish-doer in order to avoid being an energy vampire.

Anyway.

I'm a bit worried that I will never be able to really concentrate in my home office the way I have in the past. It's been a wonderful sanctuary of peace and alone time for me, where I can dive into my work and there's nobody making noise or peeking in at me (the way people do in offices). But now even just knowing that my baby is in the house, my mind is partly on the baby; is she okay, should I go check on her now? I'm sure that will decrease over time, but never go away. And of course once the child is running around making noise that will be a new distraction. Oh well, I guess we'll see how it goes.

Programming is such hard work to mix with anything else because you really need that solid block of uninterrupted time. You can't just pause and resume; or I can't anyway, I need to get into this sort of trance, which takes a while to acheive, and is quite draining. I feel a bit like a wizard in a fantasy novel; I can cast this amazing spell ("write code"), but to do so drains a bit of my life force, and if I do it more than once per day I bring myself close to death; if you're interrupted in coding, the spell is cancelled. I can write code without the spell, but then progress is slow and difficult, just like a normal human trying to write code, it's so frustrating for me to write code without the spell that I don't like to do it at all.

Anyway.

The actual point of this post is that I feel this need to get back to doing RAD work right away, and it's making me angry. Why do I have to feel that way? Fuck RAD work, I need to be with family. But my god damn hard-working WASPy martyr upbringing makes me feel like I can't ever take anything for myself, that I need to go and sacrifice and work for other people.

The whole time Tasha was pregnant I was crunching like crazy trying to finish Oodle 1.1 and get the real public release done, and to just get as much work done then as I could so that I would feel better about taking time off after the birth. I neglected Tasha when she needed me and she was really upset at me for it. I wanted to get ahead on my schedule, and I did, but now that I'm here my brain won't let me have that and wants me to go back to work.

One of the things I've really struggled with at RAD is the lack of structure and the self-scheduling. There's never a point where I can get ahead of an official schedule; I can't hit all my milestones for the month and then feel okay about taking it easy for a while. Any time I do take it easy because I just need a break, I feel bad about it. In general, my stupid brain makes me productive as a programmer, but also miserable as a programmer.

People who have a job where they just have a list of things to do and they can actually do them all and then go "I'm done, I'm going home" are very fortunate. In programming, the todo list is always effectively infinite (it's finite, but always longer than what you can ever accomplish). You might make a schedule and set a target set of tasks for a given month, but if you get them done sooner you don't go "great, I'm done for the month, time for a few days off", you go "oh, I went faster than expected, I guess I'll adjust the schedule and start on next month's tasks".

Even in a structured programming work environment, if you do your tasks faster than scheduled, you don't get sent home - you get given more tasks. In the traditional producer/team work system, your reward for being the fastest on the team is not more free time or even more pay, it's more work. Yay. Cynical "realist" programmers learn this at some point and many of them start to sandbag. They might finish their tasks quickly, but don't report it to production until their previously alotted time. Or they will intentionally take "slow work" breaks, like browing the web or watching videos while working.

I used to use my speed as a way to work on features I wasn't supposed to; mainly in the pre-Oddworld days, I would sprint and do my assigned tasks, and then not tell anyone I had finished much faster than scheduled, so that I could work on VIPM or secretly rewriting the DX render layer or some other task that had been decided by production was "low priority". Oo, what a rebel I was, secretly giving my employer masses of value for free, great way to use your youth cbloom.

Anyway. I guess this post is all just my way of trying to convince myself that it's okay for me to take a few more days off.

5/20/2013

05-20-13 - Thoughts on Data Compression for MMOs

I've been thinking about this a bit and thought I'd scribble some ideas publicly. (MMO = not necessarily just MMOs but any central game server with a very large number of clients, that cares about total bandwidth).

The situation is roughly this : you have a server and many clients (let's say 10k clients per server just to be concrete). Data is mostly sent server->client , not very much is client->server. Let's say 10k bytes per second per channel from server->client, and only 1k bytes per second from client->server. So the total data rate from the server is high (100 MB/s) but the data rate on any one channel is low. The server must send packets more than once per second for latency reasons; let's say 10 times per second, so packets are only 1k on average; server sends 100k packets/second. You don't want the compressor to add any delay by buffering up packets.

I'm going to assume that you're using something like TCP so you have gauranteed packet order and no loss, so that you can use previous packets from any given channel as encoder history on that channel. If you do have an error in a connection you have to reset the encoder.

This is a rather unusual situation for data compression, and the standard LZ77 solutions don't work great. I'm going to talk only about the server->client transmission for now; you might use a completely different algorithm for the other direction. Some properties of this situation :

1. You care more about encode time than decode time. CPU time on the server is one of the primary limiting factors. The client machine has almost no compression work to do, so decode time could be quite slow.

2. Per-call encode time is very important (not just per-byte time). Packets are small and you're doing lots of them (100k packets/sec), so you can't afford long startup/shutdown times for the encoder. This is mostly just an annoyance for coding, it means you have to be really careful about your function setup code and such.

3. Memory use on the server is a bit limited. Say you allow 4 GB for encoder states; that's only 400k per client. (this is assuming that you're doing per-client encoder state, which is certainly not the only choice).

4. Cache misses are much worse than a normal compression encoder scenario. Say you have something like a 256k hash table to accelerate match finding. Normally when you're compressing you get that whole table into L2 so your hash lookups are in cache. In this scenario you're jumping from one state to another all the time, so you must assume that every memory lookup is a cache miss.

5. The standard LZ77 thing of not allowing matches at the beginning or end is rather more of a penalty. In general all those inefficiencies that you normally have on tiny buffers are more important than usual.

6. Because clients can be added at any time and connections can be reset, encoder init/reset time can't be very long. This is another reason aside from memory use that encoder state must be small.

7. The character of the data being sent doesn't really vary much from client to client. This is one way in which this scenario differs from a normal web server type of situation (in which case, different clients might be receiving vastly different types of data). The character of the data can change from packet to packet; there are sort of a few different modes of the data and the stream switches between then, but it's not like one client is usually receiving text and another one is receiving images. They're all generally receiving bit-packed 3d positions and the same type of thing.

And now some rambling about what encoder you might use that suits this scenario :

A. It's not clear that adaptive encoding is a win here. You have to do the comparison with CPU use held constant, if you just compare an encoder running adaptive vs the same encoder with a static model, that's not fair, because the static model can be so much faster you should use a more sophisticated encoder. The static model can also use vastly more memory. Maybe not a whole 4G, but a lot more than 400k.

B. LZ77 is not great here. The reason we love LZ77 is the fast, simple decoder. We don't really care about that here. An LZP or ROLZ variant would be better, that has a slightly slower and more memory-hungry decoder, but a simpler/faster encoder.

C. There are some semi-static options. Perhaps a static match dictionary with something like LZP, and then an adaptive simple context model per channel. That makes the per-channel adaptive part small in memory, but still allows some local adaptation for packets of different character. Another option would be a switching static-model scheme. Do something like train 4 different static models for different packet types, and send 2 bits to pick the model then encode the packet with that static model.

D. Static context mixing is kind of appealing. You can have static hash tables and a static mixing scheme, which eliminates a lot of the slowness of CM. Perhaps the order-0 model is adaptive per channel, and perhaps the secondary-statistics table is adaptive per channel. Hitting 100 MB/s might still be a challenge, but I think it's possible. One nice thing about CM here is that it can have the idea of packets of different character implicit in the model.

E. For static-dictionary LZ, the normal linear offset encoding doesn't really make a lot of sense. Sure, you could try to optimize a dictionary by laying out the data in it such that more common data is at lower offsets, but that seems like a nasty indirect way of getting at the solution. Off the top of my head, it seems like you could use something like an LZFG encoding. That is, make a Suffix Trie and then send match references as node or leaf indexes; leaves all have equal probability, nodes have a child count which is proportional to their probability (relative to siblings).

F. Surely the ideal solution is a blended static/dynamic coder. That is, you have some large trained static model (like a CM or PPM model, or a big static dictionary for LZ77) and then you also run a local adaptive model in a circular window for each channel. Then you compressing using a mix of the two models. There are various options on how to do this. For LZ77 you might send 0-64k offsets in the local adaptive window, and then 64k-4M offsets as indexes into the static dictionary. Or you could more explicitly code a selector bit to pick one of the two and then an offset. For CM it's most natural, you just mix the result of the static model and the local adaptive model.

G. What is not awesome is model preconditioning (and it's what most people do, because it's the only thing available with off-the-shelf compressors like zlib or whatever). Model precondition means taking an adaptive coder and initially loading its model (eg. an LZ dictionary) from some static data; then you encode packets adaptively. This might actually offer excellent compression, but it has bad channel startup time, and high memory use per channel, and it doesn't allow you to use more efficient algorithms that are possible with fully static models (such as different types of data structures that provide fast lookup but not fast update).

Obviously if you're doing UDP or some other unreliable packet scheme, then static-model compression is the only way to go. Also if clients are very frequently joining and leaving or moving servers, then they will never build up much channel history and static model is the way to go there as well. If streams vary vastly in size, like if they're usually less than 1k but occasionally you do large 1M+ transfers (like for content serving as opposed to updating game state) you would use totally different schemes.

I'd like to do some tests. If you work on an MMO or similar game situation and can give me some real-world test data, please be in touch.

5/17/2013

05-17-13 - Cloth Diapering

Oh yes indeed, you are in for a spate of baby-related blogging.

I'm pretty sure clother diapers are bullshit. I'm about to cancel my diaper service. In this first week I've been using a semi-alternating mix of cloth and disposable. I assumed that I would start out with disposables just for ease in the first few days and then switch to cloth because it's "better", but I don't think I will.

(I make all my decisions now based only on 1. personal observations and 2. serious scientific studies where I can read the original papers. I try to avoid and discount 3. journalism 4. hearsay 5. the internet 6. mass-market nonfiction. I think they are garbage and mental poison.)

What I'm seeing is :

Disposable diapers actually work the way they claim to. The seal around the borders is good. The entire diaper itself has a nice low profile so is not too bulky or uncomfortable. But most importantly, they actually do trap and absorb moisture. When baby has a heavy pee in a disposable diaper, the moisture stays right in one little spot and doesn't spread all over. When I remove the diaper I can feel her skin all over the nether regions is pretty dry.

Cloth diapers don't. The worst aspect is that when baby has a heavy pee, the cloth soaks it up, and because it's cloth and wicks moisture, the pee is spread all over her entire lower parts. When I get the diaper off, she's soaking wet all over. (and yes of course I'm changing her almost instantly after peeing because at this point we're watching her constantly). That alone is enough to turn me off cloth diapers, but there's lots more that sucks about them. It's really hard to get the diaper cover on such that it actually makes a water-tight seal, so leakage is much more likely (and if you do try to make it water tight, it's easy to make it too tight and cut off circulation, which I accidentally did once). The cloth diaper alone looks pretty comfortable on her, but the diaper cover is much rougher and more bulky than a dispoable; the result is that she has this huge awkward thing on.

When you add the inconvenience of cloth diapers (longer changing times, having to store poop in your house, taking the pail in and out for pickup), it just seems like a massive lose.

The only possible argument pro-cloth that makes sense to me is the reduction of the landfill load. Now, environmental arguments are always complicated; there are arguments for the other side based on the environmental cost of washing (though I think they're bogus). But even assuming that the environmental case is clear, being a hypocritical liberal I wouldn't actually inconvenience myself and discomfort my baby for the benefit of the landfill.

Eh, actually I take back that false self-accusation. That's a retarded Fox News style "gotcha" that's based on misrepresentation and not understanding. I've never advocated the standard liberal martyrdom (and if I once did, I certainly don't now). I don't believe in choosing to undermine yourself because you believe the world would be better if everyone did it. I believe in changing the laws such that they encourage you to make the choice that is better for the world. eg. people who don't drive because they believe it's evil, even if it would be much to their benefit, are just being dumb martyrs. The US government massively subsidizes driving, so if you don't take advantage of that you are essentially paying for other people to drive. I would love it if the government would subsidize *not driving* rather than the other way around, but until they do I'm driving up a storm. (tangent : the massive subsidies for Teslas is a great example of the way that Dems and Reps are in fact both really working for the same cause : creating loop holes and kick backs so that they can give money to rich people).

I'm a big tangent wanderer. My political philosophy in a nutshell :

Government's role is to create a market structure (through laws, regulation, the Fed, direct market action, etc) such that when each actor maximimizes their own personal utility, the net result is as good for the entire world (nation) as possible.

(if you're out of high school (or the 18th century) you should know that a free market does not do that on its own)

(And crucially, "good for" must be defined on something like a sum-of-logs scale, or perhaps just maximize the median, or minimize the number in poverty; if you maximize the sum (basically GDP) then giving huge profits to Larry Ellison and fucking everyone else looks like it's "good for the world")

And, uh, oh yeah, cloth diapers suck.

5/15/2013

05-15-13 - Baby

I suppose this is the easiest way to announce to various friends and semi-friends rather than trying to mass-email. I have a new baby, yay! No pictures, you paparazzos. She's adorable and healthy. I love how simple and direct her communication is. Suckling lips = needs to nurse. Squirming = needs a diaper change. Fussing = cold or gassy. Everything else = needs to sleep. I wish all humans communicated so clearly.

I want to write about the wonderful experience of having a home birth (see *2), but don't want to intrude on Tasha's privacy. Suffice it to say it was really good, so good to be home and have everything at hand to make Tasha comfortable, and then be able to take baby in our arms and settle into bed right away. We spent the first 36 hours after birth all in bed together and I think that time was really important.

I've always wanted to have kids, but I'm (mostly) glad that I waited this long. For one thing Tasha is a wonderful mom and I'm glad I found her. But also, I realize now that I wasn't ready in my twenties. I've changed a lot in the last five years and I'm a much better person now. I've learned important lessons that are helping me a lot in this challenging time, like to do hard work correctly you have to not only complete the task but also keep a good attitude and be nice to the people around you while you do it. And that when you are tired and hungry is when you can really show your character; anyone can have a good attitude when they're fresh, but if you get nasty when the going gets tough then you are nasty. etc. standard cbloom slowly figures out things that most people learned in their teens.

Now for some old-style ranting.

1. "We had a baby". No you fucking did not. Your wife had a baby. If you were a really good husband, you held her hand and got her snacks. She squeezed a watermelon out of her vagina. You do not get to take any credit for that act, it was all her. It's a bit like Steve Jobs saying "we invented" anything; no you did not you fucking credit-stealing douchebag, your company didn't even invent it, much less you.

(tangent : I can't stand the political correctness in sport post-game interviews these days; they're all so sanitized and formulaic. They must go to interview coaching classes or something because everyone says exactly the same things. Of course it's not the athlete's fault, they would love to have emotional honest outbursts, it's the god damn stupid public who throw a coniption if anybody says anything remotely true. In particular this post reminds me of how athletes always immediately go "it wasn't just me, it was the team"; no it was not, Kobe, you just had an 80 point game, it was all fucking you, don't give me this bullshit credit to the team stuff. Be a man and say "*I* won this game".)

2. People are busy-body dicks. When we would tell acquaintances about our plans to have a home birth, a good 25% would feel like they had to tell us what a bad idea that was and nag us about the dangers of childbirth. Shut the fuck up you stupid asshole. First of all, don't you think that maybe we've researched that more than you before making our decision, so you don't know WTF you're talking about? Second of all, we're not going to change our mind because of your nagging, so all you're doing is being nasty about something you're not going to change. We didn't ask for your opinion, you can just stay the hell out of it. (The doctors that we would occasionally see for tests were often negative and naggy as well, which only made us more confident in our choice).

It's a bit like if a friend tells you they're marrying someone and you go "her?". Even if the marriage is a questionable choice, they're not going to stop it due to your misgivings, so all you're doing is adding some unpleasantness to their experience.

You always run into these idiots when you do software reviews or brainstorming sessions. You'll call a meeting to discuss revisions to the boss fight sequence, and some asshole will always chime in with "I really think the whole idea of boss fights sucks and we should start over". Umm, great, thanks, very helpful. We're not going to tear up the whole design of the game a few months from shipping, so maybe you could stick to the topic at hand and get some kind of clue about what things are reasonable to change and which need to be taken as a given and worked within as constraints.

Like when I'd ask for reviews of Oodle, a few of the respondents would give me something awesomely unhelpful like "I don't like the entire style of the API, and I'd throw it out and do a new one" , or "actually I think a paging + data compression library is a bad idea and I'd just start over on something else". Great, thanks; I might agree with you but obviously you must know that that is not going to happen and it's not what I was asking for, so if you don't want to say anything helpful then just say "no".

ADDENDUM : a few notes on home birth and midwives.

Even if you are planning to do home birth (without a doctor present), you should get an OB and do a prenatal visit with them to "establish care". That way you are officially their patient, even if you don't see them again. In the US health care system, if you do wind up having a problem, or even just a question, and you have not "established care" with a certain practice, you are just fucked. You would wind up at the ER and that's never good.

While the midwives seemed reasonably competent at popping out a healthy baby from a healthy mother with no complications, I certainly would not do it if there were any major risk factors. They also are less than thorough at the prenatal and postnatal exams, so it's probably worth seeing a regular doc for those at least once (probably only once).

5/08/2013

05-08-13 - A Lock Free Weak Reference Table

It's very easy (almost trivial (*)) to make the table-based {index/guid} style of weak reference lock free.

(* = obviously not trivial if you're trying to minimize the memory ordering constraints, as evidenced by the revisions to this post that were required; it is trivial if you just make everything seq_cst)

Previous writings on this topic :

Smart & Weak Pointers - valuable tools for games - 03-27-04
cbloom rants 03-22-08 - 6
cbloom rants 07-05-10 - Counterpoint 2
cbloom rants 08-01-11 - A game threading model
cbloom rants 03-05-12 - Oodle Handle Table

The primary ops conceptually are :


Add object to table; gives it a WeakRef id

WeakRef -> OwningRef  (might be null)

OwningRef -> naked pointer

OwningRef construct/destruct = ref count inc/dec

The full code is in here : cbliblf.zip , but you can get a taste for how it works from the ref count maintenance code :


    // IncRef looks up the weak reference; returns null if lost
    //   (this is the only way to resolve a weak reference)
    Referable * IncRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];

        handle_type guid = handle_get_guid(h);

        // this is just an atomic inc of state
        //  but checking guid each time to check that we haven't lost this slot
        handle_type state = s->m_state.load(mo_acquire);
        for(;;)
        {
            if ( state_get_guid(state) != guid )
                return NULL;
            // assert refcount isn't hitting max
            LF_OS_ASSERT( state_get_refcount(state) < state_max_refcount );
            handle_type incstate = state+1;
            if ( s->m_state.compare_exchange_weak(state,incstate,mo_acq_rel,mo_acquire) )
            {
                // did the ref inc
                return s->m_ptr;
            }
            // state was reloaded, loop
        }
    }

    // IncRefRelaxed can be used when you know a ref is held
    //  so there's no chance of the object being gone
    void IncRefRelaxed( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        handle_type state_prev = s->m_state.fetch_add(1,mo_relaxed);
        state_prev;
        // make sure we were used correctly :
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 0 );
        LF_OS_ASSERT( state_get_refcount(state_prev) < state_max_refcount );
    }

    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state.fetch_add((handle_type)-1,mo_release);
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 1 );
        if ( state_get_refcount(state_prev) == 1 )
        {
            // I took refcount to 0
            //  slot is not actually freed yet; someone else could IncRef right now
            //  the slot becomes inaccessible to weak refs when I inc guid :
            // try to inc guid with refcount at 0 :
            handle_type old_guid = handle_get_guid(h);
            handle_type old_state = make_state(old_guid,0); // == state_prev-1
            handle_type new_state = make_state(old_guid+1,0); // == new_state + (1<<handle_guid_shift);
  
            if ( s->m_state($).compare_exchange_strong(old_state,new_state,mo_acq_rel,mo_relaxed) )
            {
                // I released the slot
                // cmpx provides the acquire barrier for the free :
                FreeSlot(s);
                return;
            }
            // somebody else mucked with me
        }
    }

The maintenance of ref counts only requires relaxed atomic increment & release atomic decrement (except when the pointed-at object is initially made and finally destroyed, then some more work is required). Even just the relaxed atomic incs could get expensive if you did a ton of them, but my philosophy for how to use this kind of system is that you inc & dec refs as rarely as possible. The key thing is that you don't write functions that take owning refs as arguments, like :


void bad_function( OwningRefT<Thingy> sptr )
{
    more_bad_funcs(sptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    if ( sptr != NULL )
    {
        bad_function(sptr);
    }
}

hence doing lots of inc & decs on refs all over the code. Instead you write all your code with naked pointers, and only use the smart pointers where they are needed to ensure ownership for the lifetime of usage. eg. :

void good_function( Thing * ptr )
{
    more_good_funcs(ptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    Thingy * ptr = sptr.GetPtr();
    if ( ptr != NULL )
    {
        good_function(ptr);
    }
}

If you like formal rules, they're something like this :


1. All stored variables are either OwningRef or WeakRef , depending on whether it's
an "I own this" or "I see this" relationship.  Never store a naked pointer.

2. All variables in function call args are naked pointers, as are variables on the
stack and temp work variables, when possible.

3. WeakRef to pointer resolution is only provided as WeakRef -> OwningRef.  Naked pointers
are only retrieved from OwningRefs.

And obviously there are lots of enchancements to the system that are possible. A major one that I recommend is to put more information in the reference table state word. If you use a 32-bit weak reference handle, and a 64-bit state word, then you have 32-bits of extra space that you can check for free with the weak reference resolution. You could put some mutex bits in there (or an rwlock) so that the state contains the lock for the object, but I'm not sure that is a big win (the only advantage of having the lock built into the state is that you could atomically get a lock and inc refcount in a single op). A better usage is to put some object information in there that can be retrieved without chasing the pointer and inc'ing the ref and so on.

For example in Oodle I store the status of the object in the state table. (Oodle status is a progression through Invalid->Pending->Done/Error). That way I can take a weak ref and query status in one atomic load. I also store some lock bits, and you aren't allowed to get back naked pointers unless you have a lock on them.

The code for the weak ref table is now in the cbliblf.zip that I made for the last post. Download : cbliblf.zip

( The old cblib has a non-LF weak reference table that's similar for comparison. It's also more developed with helpers and fancier templates and such that could be ported to this version. Download : cblib.zip )

ADDENDUM : alternative DecRef that uses CAS instead of atomic decrement. Removes the two-atomic free path. Platforms that implement atomic add as a CAS loop should probably just use this form. Platforms that have true atomic add should use the previously posted version.


    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state($).load(mo_relaxed);
        
        handle_type old_guid  = handle_get_guid(h);

        for(;;)
        {
            // I haven't done my dec yet, guid must still match :
            LF_OS_ASSERT( state_get_guid(state_prev) == old_guid );
            // check current refcount :
            handle_type state_prev_rc = state_get_refcount(state_prev);
            LF_OS_ASSERT( state_prev_rc >= 1 );
            if ( state_prev_rc == 1 )
            {
                // I'm taking refcount to 0
                // also inc guid, which releases the slot :
                handle_type new_state = make_state(old_guid+1,0);

                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_acq_rel,mo_relaxed) )
                {
                    // I released the slot
                    // cmpx provides the acquire barrier for the free :
                    FreeSlot(s);
                    return;
                }
            }
            else
            {
                // this is just a decrement
                // but have to do it as a CAS to ensure state_prev_rc doesn't change on us
                handle_type new_state = state_prev-1;
                LF_OS_ASSERT( new_state == make_state(old_guid,  state_prev_rc-1) );
                
                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_release,mo_relaxed) )
                {
                    // I dec'ed a ref
                    return;
                }
            }
        }
    }

5/02/2013

05-02-13 - Simple C++0x style LF structs and adapter for x86-Windows

I've seen a lot of lockfree libraries out there that are total crap. Really weird non-standard ways of doing things, or overly huge and complex.

I thought I'd make a super simple one in the correct modern style. Download : cbliblf.zip

(If you want a big fully functional much-more-complete library, Intel TBB is the best I've seen. The problem with TBB is that it's huge and entangled, and the license is not clearly free for all use).

There are two pieces here :

"cblibCpp0x.h" provides atomic and such in C++0x style for MSVC/Windows/x86 compilers that don't have real C++0x yet. I have made zero attempt to make this header syntatically identical to C++0x, there are various intentional and unintentional differences.

"cblibLF.h" provides some simple lockfree utilities (mostly queues) built on C++0x atomics.

"cblibCpp0x.h" is kind of by design not at all portable. "cblibLF.h" should be portable to any C++0x platform.

WARNING : this stuff is not super well tested because it's not what I use in Oodle. I've mostly copy-pasted this from my Relacy test code, so it should be pretty strong but there may have been some copy-paste errors.

ADDENDUM : In case it's not clear, you do not *want* to use "cblibCpp0x.h". You want to use real Cpp0x atomics provided by your compiler. This is a temporary band-aid so that people like me who use old compilers can get a cpp0x stand-in, so that they can do work using the modern syntax. If you're on a gcc platform that has the __atomic extensions but not C1X, use that.

You should be able to take any of the C++0x-style lockfree code I've posted over the years and use it with "cblibCpp0x.h" , perhaps with some minor syntactic fixes. eg. you could take the fastsemaphore wrapper and put the "semaphore" from "cblibCpp0x.h" in there as the base semaphore.

Here's an example of what the objects in "cblibLF.h" look like :


//=================================================================         
// spsc fifo
//  lock free for single producer, single consumer
//  requires an allocator
//  and a dummy node so the fifo is never empty
template <typename t_data>
struct lf_spsc_fifo_t
{
public:

    lf_spsc_fifo_t()
    {
        // initialize with one dummy node :
        node * dummy = new node;
        m_head = dummy;
        m_tail = dummy;
    }

    ~lf_spsc_fifo_t()
    {
        // should be one node left :
        LF_OS_ASSERT( m_head == m_tail );
        delete m_head;
    }

    void push(const t_data & data)
    {
        node * n = new node(data);
        // n->next == NULL from constructor
        m_head->next.store(n, memory_order_release); 
        m_head = n;
    }

    // returns true if a node was popped
    //  fills *pdata only if the return value is true
    bool pop(t_data * pdata)
    {
        // we're going to take the data from m_tail->next
        //  and free m_tail
        node* t = m_tail;
        node* n = t->next.load(memory_order_acquire);
        if ( n == NULL )
            return false;
        *pdata = n->data; // could be a swap
        m_tail = n;
        delete t;
        return true;
    }

private:

    struct node
    {
        atomic<node *>      next;
        nonatomic<t_data>   data;
        
        node() : next(NULL) { }
        node(const t_data & d) : next(NULL), data(d) { }
    };

    // head and tail are owned by separate threads,
    //  make sure there's no false sharing :
    nonatomic<node *>   m_head;
    char                m_pad[LF_OS_CACHE_LINE_SIZE];
    nonatomic<node *>   m_tail;
};

Download : cbliblf.zip

4/30/2013

04-30-13 - Packing Values in Bits - Flat Codes

One of the very simplest forms of packing values in bits is simply to store a value with non-power-of-2 range and all values of equal probability.

You have a value that's in [0,N). Ideally all code lengths would be the same ( log2(N) ) which is fractional for N not a power of 2. With just bit output, we can't write fractional bits, so we will lose some efficiency. But how much exactly?

You can of course trivially write a symbol in [0,N) by using log2ceil(N) bits. That's just going up to the next integer bit count. But you're wasting values in there, so you can take each wasted value and use it to reduce the length of a code that you need. eg. for N = 5 , start with log2ceil(N) bits :

0 : 000
1 : 001
2 : 010
3 : 011
4 : 100
x : 101
x : 110
x : 111
The first five codes are used for our values, and the last three are wasted. Rearrange to interleave the wasted codewords :
0 : 000
x : 001
1 : 010
x : 011
2 : 100
x : 101
3 : 110
4 : 111
now since we have adjacent codes where one is used and one is not used, we can reduce the length of those codes and still have a prefix code. That is, if we see the two bits "00" we know that it must always be a value of 0, because "001" is wasted. So simply don't send the third bit in that case :
0 : 00
1 : 01
2 : 10
3 : 110
4 : 111

(this is a general way of constructing shorter prefix codes when you have wasted values). You can see that the number of wasted values we had at the top is the number of codes that can be shortened by one bit.

A flat code is written thusly :


void OutputFlat(int sym, int N)
{
    ASSERT( N >= 2 && sym >= 0 && sym < N );

    int B = intlog2ceil(N);
    int T = (1<<B) - N;
    // T is the number of "wasted values"
    if ( sym < T )
    {
        // write in B-1 bits
        PutBits(sym, B-1);
    }
    else
    {
        // write in B bits
        // push value up by T
        PutBits(sym+T, B);
    }
}

int InputFlat(int sym,int N)
{
    ASSERT( N >= 2 && sym >= 0 && sym < N );

    int B = intlog2ceil(N);
    int T = (1<<B) - N;

    int sym = GetBits(B-1);
    if ( sym < T )
    {
        return sym;
    }
    else
    {
        // need one more bit :
        int ret = (sym<<1) - T + GetBits(1);        
        return ret;
    }
}

That is, we write (T) values in (B-1) bits, and (N-T) in (B) bits. The intlog2ceil can be slow, so in practice you would want to precompute that or pass it in as a parameter.

So, what is the loss vs. ideal, and where does it occur? Let's work it out :


H = log2(N)  is the ideal (fractional) entropy

N is in (2^(B-1),2^B]
so H is in (B-1,B]

The number of bits written by the flat code is :

L = ( T * (B-1) + (N-T) * B ) / N

with T = 2^B - N

Let's set

N = f * 2^B

with f in (0.5,1] our fractional position in the range.

so T = 2^B * (1 - f)

At f = 0.5 and 1.0 there's no loss, so there must be a maximum in that interval.

Doing some simplifying :

L = (T * (B-1) + (N-T) * B)/N
L = (T * B - T + N*B - T * B)/N
L = ( N*B - T)/N = B - T/N
T/N = (1-f)/f = (1/f) - 1
L = B - (1/f) + 1

The excess bits is :

E = L - H

H = log2(N) = log2( f * 2^B ) = B + log2(f)

E = (B - (1/f) + 1) - (B + log2(f))
E = 1 - (1/f) - log2(f)

so find the maximum of E by taking a derivative :

d/df(E) = 0
d/df(E) = 1/f^2 - (1/f)/ln2
1/f^2 = (1/f)/ln2
1/f = 1/ln(2)
f = ln(2)
f = 0.6931472...

and at that spot the excess is :

E = 1 - (1/ln2) - ln(ln2)/ln2
E = 0.08607133...

The worst case is 8.6% of a bit per symbol excess. The worst case appears periodically, once for each power of two.

The actual excess bits output for some low N's :

The worst case actually occurs as N->large, because at higher N you can get f closer to that worst case fraction (ln(2)). At lower N, the integer steps mean you miss the worst case and so waste less. This is perhaps a bit surprising, you might think that the worst case would be at something like N = 3.

In fact for N = 3 :


H = l2(3) = 1.584962 ...

L = average length written by OutputFlat

L = (1+2+2)/3 = 1.66666...

E = L - H = 0.08170421...

(obviously if you measure the loss as a percentage of the output length, the worst case is at N=3, and there it's 5.155% of the entropy).

4/21/2013

04-21-13 - How to grow Linux disk under VMWare

There's a lot of these guides around the net, but I found them all a bit confusing to follow, so my experience :
  • 1. Power off the VM.

  • 2. Make a backup of the whole VM in case something goes wrong. Just find the dir containing it and copy the whole thing.

  • 3. VMWare Settings -> Hardware -> Hard Disk -> Utilities -> Expand change to whatever size you want.

  • 4. This has just expanded the virtual disk, now you must grow the partition on the disk. Linux does not have good tools to grow a partition that is running the OS, so don't boot your VM back into Linux.

    (there is some LVM stuff that lets you make multiple partitions and then treat them as a single one, but for a Unix newb like myself that looks too scary).

  • 5. Download GParted ISO. VMWare Settings -> Hardware -> CD/DVD -> Use ISO -> point it at GParted.

    Also make sure "Connect at Power On" is checked.

  • 6. Now you have to get the virtual machine to boot from CD. Getting into the BIOS interactively was impossible for me. Fortunately VMWare has a solution. Find your VM files and edit the .vmx config file. Add this line :
    bios.forceSetupOnce = "TRUE"
    

  • 7. Power on the VM and you should enter the BIOS. Go to "Boot" and put the CD first. Save and Exit and you should now boot into GParted.

  • 8. Using GParted is pretty self explanatory. It's a good tool. When you're done, shut down and Power Off the VM.

    My VM was set up with a swap partitition, so I had to move that to the end before I could grow the primary partition. I hear that you can set up Linux with a swap file instead of a swap partition; that would be preferable. A swap partition makes zero sense in a VM where the disks are virtualized anyway (so the advantage of keeping the swap thrashing off your main disk doesn't exist). Not something I want to change though.

  • 9. VMWare Settings -> Hardware -> CD/DVD -> turn off "Connect at Power On"
It seems to me it's a good idea to leave it this way - BIOS is set to boot first from CD, but the VM is set with no CD hardware enabled. This makes it easy to change the ISO and just check that box any time you want to boot from an ISO, rather than having to go into that BIOS nightmare again.


More generally, what have I learned about multi-platform development from working at RAD ?

That it's horrible, really horrible, and I pray that I never have to do it again in my life. Ugh.

Just writing cross-platform code is not the issue (though that's horrible enough, solely due to stupid non-fundamental issues like the fact that struct packing isn't standardized, adding signed ints isn't standardized, restrict/noalias isn't standardized, inline linkage varies greatly, etc. urg wtf etc etc). If you're just releasing some code on the net and offering it for many platforms (leaving it up to the downloaders to actually build it and test it), your life is easy. The horrible part is if you actually have to maintain machines and build systems for all those platforms, test them, be able to debug on them, keep all the sdk's up to date, etc. etc.

(in general coding is easy when you don't actually test your code and make sure it works well, which a surprising number of people think is "done"; hey it compiles, I'm done! umm, no...)

(I guess that's a more general life thing; I observe a lot of people who just do things and don't actually measure whether the "doing" was successful or done well, but they just move on and are generally happy. People who stress over whether what they're doing is actually a good job or not are massively less happy but also actually do good work.)

I feel like I spend 90% of my time on stupid fucking non-algorithmic issues like this Linux partition resizing shit (probably more like 20%, but that's still frustratingly high). The regression tests are failing on Linux, okay have to figure out why, oh it's because the VM disk is too small, okay how do I fix that; or the PS4 compiler has a bug I have to work around, or the system software on this system has a bug, or the Mac clang wants to spew pointless warnings about anonymous namespaces, or my tests aren't working on Xenon .. spend some time digging .. oh the console is just turned off, or the IP changed or it got reflashed and my SDK doesn't work anymore, and blah blah fucking blah. God dammit I just want to be able to write algorithms. I miss coding, I miss thinking about hard problems. Le sigh.

I've written before about how in my imagination I could hire some kid for $50k to do all this shit work for me and it would be a huge win for me overall. But I'm afraid it's not that easy in reality.

What really should exist is a "coder cloud" service. There should be a bunch of VMs of different OS'es with various compilers and SDKs installed, so I can just say "build my shit for X with Y". Of course you need to be able to run tests on that system as well, and if something goes wrong you need remote desktop for interactive debugging. It's got to have every platform, including things like game consoles where you need license agreements, which is probably a no-go in reality because corporations are jerks. There's got to be superb customer service, because if I can't rely on it for builds at every moment of every day then it's a no-go. Unfortunately, programmers are almost uniformly moronic about this kind of thing (in that they massively overestimate their own ability to manage these things quickly) so wouldn't want to pay what it costs to run that service.

4/10/2013

04-10-13 - Waitset Resignal Notes

I've been redoing my low level waitset and want to make some notes. Some previous discussion of the same issues here :

cbloom rants 11-28-11 - Some lock-free rambling
cbloom rants 11-30-11 - Some more Waitset notes
cbloom rants 12-08-11 - Some Semaphores

In particular, two big things occurred to me :

1. I talked before about the "passing on the signal" issue. See the above posts for more in depth details, but in brief the issue is if you are trying to do NotifyOne (instead of NotifyAll), and you have a double-check waitset like this :


{
waiter = waitset.prepare_wait(condition);

if ( double check )
{
    waiter.cancel();
}
else
{
    waiter.wait();
    // possibly loop and re-check condition
}

}

then if you get a signal between prepare_wait and cancel, you didn't need that signal, so a wakeup of another thread that did need that signal can be missed.

Now, I talked about this before as an "ugly hack", but over time thinking about it, it doesn't seem so bad. In particular, if you put the resignal inside the cancel() , so that the client code looks just like the above, it doesn't need to know about the fact that the resignal mechanism is happening at all.

So, the new concept is that cancel atomically removes the waiter from the waitset and sees if it got a signal that it didn't consume. If so, it just passes on that signal. The fact that this is okay and not a hack came to me when I thought about under what conditions this actually happens. If you recall from the earlier posts, the need for resignal comes from situations like :


T0 posts sem , and signals noone
T1 posts sem , and signals T3
T2 tries to dec count and sees none, goes into wait()
T3 tries to dec count and gets one, goes into cancel(), but also got the signal - must resignal T2

the thing is this can only happen if all the threads are awake and racing against each other (it requires a very specific interleaving); that is, the T3 in particular that decs count and does the resignal had to be awake anyway (because its first check saw no count, but its double check did dec count, so it must have raced with the sem post). It's not like you wake up a thread you shouldn't have and then pass it on. The thread wakeup scheme is just changed from :

T0 sem.post --wakes--> T2 sem.wait
T1 sem.post --wakes--> T3 sem.wait

to :

T0 sem.post
T1 sem.post --wakes--> T3 sem.wait --wakes--> T2 sem.wait

that is, one of the consumer threads wakes its peer. This is a tiny performance loss, but it's a pretty rare race, so really not a bad thing.

The whole "double check" pathway in waitset only happens in a race case. It occurs when one thread sets the condition you want right at the same time that you check it, so your first check fails and after you prepare_wait, your second check passes. The resignal only occurs if you are in that race path, and also the setting thread sent you a signal between your prepare_wait and cancel, *and* there's another thread waiting on that same signal that should have gotten it. Basically this case is quite rare, we don't care too much about it being fast or elegant (as long as it's not disastrously slow), we just need behavior to be correct when it does happen - and the "pass on the signal" mechanism gives you that.

The advantage of being able to do just a NotifyOne instead of a NotifyAll is so huge that it's worth adopting this as standard practice in waitset.

2. It then occurred to me that the waitset PrepareWait and Cancel could be made lock-free pretty trivially.

Conceptually, they are made lock free by turning them into messages. "Notify" is now the receiver of messages and the scheme is now :


{
waiter w;
waitset : send message { prepare_wait , &w, condition };

if ( double check )
{
    waitset : send message { cancel , &w };
    return;
}

w.wait();
}

-------

{
waitset Notify(condition) :
first consume all messages
do prepare_wait and cancel actions
then do the normal notify
eg. see if there are any waiters that want to know about "condition"
}

The result is that the entire wait-side operation is lock free. The notify-side still uses a lock to ensure the consistency of the wait list.

This greatly reduces contention in the most common usage patterns :


Mutex :

only the mutex owner does Notify
 - so contention of the waitset lock is non-existant
many threads may try to lock a mutex
 - they do not have any waitset-lock contention

Semaphore :

the common case of one producer and many consumers (lots of threads do wait() )
 - zero contention of the waitset lock

the less common case of many producers and few consumers is slow

Another way to look at it is instead of doing little bits of waitlist maintenance in three different places (prepare_wait,notify,cancel) which each have to take a lock on the list, all the maintenance is moved to one spot.

Now there are some subtleties.

If you used a fresh "waiter" every time, things would be simple. But for efficiency you don't want to do that. In fact I use one unique waiter per thread. There's only one OS waitable handle needed per thread and you can use that to implement every threading primitive. But now you have to be able to recycle the waiter. Note that you don't have to worry about other threads using your waiter; the waiter is per-thread so you just have to worry about when you come around and use it again yourself.

If you didn't try to do the lock-free wait-side, recycling would be easy. But with the lock-free wait side there are some issues.

First is that when you do a prepare-then-cancel , your cancel might not actually be done for a long time (it was just a request). So if you come back around on the same thread and call prepare() again, prepare has to check if that earlier cancel has been processed or not. If it has not, then you just have to force the Notify-side list maintenance to be done immediately.

The second related issue is that the lock-free wait-side can give you spurious signals to your waiter. Normally prepare_wait could clear the OS handle, so that when you wait on it you know that you got the signal you wanted. But because prepare_wait is just a message and doesn't take the lock on the waitlist, you might actually still be in the waitlist from the previous time you used your waiter. Thus you can get a signal that you didn't want. There are a few solutions to this; one is to allow spurious signals (I don't love that); another is to detect that the signal is spurious and wait again (I do this). Another is to always just grab the waitlist lock (and do nothing) in either cancel or prepare_wait.


Ok, so we now have a clean waitset that can do NotifyOne and gaurantee no spurious signals. Let's use it.

You may recall we've looked at a simple waitset-based mutex before :


U32 thinlock;

Lock :
{
    // first check :
    while( Exchange(&thinlock,1) != 0 )
    {
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
        {
            // got it
            w.Cancel();
            return;
        }

        w.Wait();
    }
}

Unlock :
{
    if ( Exchange(&thinlock,0) == 2 )
    {
        waitset.NotifyAll( &thinlock );
    }
}
This mutex is non-recursive, and of course you should spin doing some TryLocks before going into the wait loop for efficiency.

This was an okay way to build a mutex on waitset when all you had was NotifyAll. It only does the notify if there are waiters, but the big problem with it is if you have multiple waiters, it wakes them all and then they all run in to try to grab the mutex, and all but one fail and go back to sleep. This is a common type of unnecessary-wakeup thread-thrashing pattern that sucks really bad.

(any time you write threading code where the wakeup means "hey wakeup and see if you can grab an atomic" (as opposed to "wakeup you got it"), you should be concerned (particularly when the wake is a broadcast))

Now that we have NotifyOne we can fix that mutex :


U32 thinlock;

Lock :
{
    // first check :
    while( Exchange(&thinlock,2) != 0 ) // (*1)
    {
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
        {
            // got it
            w.Cancel(waitset_resignal_no); // (*2)
            return;
        }

        w.Wait();
    }
}

Unlock :
{
    if ( Exchange(&thinlock,0) == 2 ) // (*3)
    {
        waitset.NotifyOne( &thinlock );
    }
}
We changed the NotifyAll to NotifyOne , but two funny bits are worth noting : (*1) - we must now immediately exchange in the waiter-flag here; in the NotifyAll case it worked to put a 1 in there for funny reasons (see cbloom rants 07-15-11 - Review of many Mutex implementations , where this type of mutex is discussed as "Three-state mutex using Event" ), but it doesn't work with the NotifyOne. (*2) - with a mutex you do not need to pass on the signal when you stole it and cancelled. The reason is just that there can't possibly be any more mutex for another thread to consume. A mutex is a lot like a semaphore with a maximum count of 1 (actually it's exactly like it for non-recursive mutexes); you only need to pass on the signal when it's possible that some other thread needs to know about it. (*3) - you might think the check for == 2 here is dumb because we always put in a 2, but there's code you're not seeing. TryLock should still put in a 1, so in the uncontended cases the thinlock will have a value of 1 and no Notify is done. The thinlock only goes to a 2 if there is some contention, and then the value stays at 2 until the last unlock of that contended sequence.

Okay, so that works, but it's kind of silly. With the mechanism we have now we can do a much neater mutex :


U32 thinlock; // = 0 initializes thinlock

Lock :
{
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinlock );

    if ( Fetch_Add(&thinlock,1) == 0 )
    {
        // got the lock (no need to resignal)
        w.Cancel(waitset_resignal_no);
        return;
    }

    w.Wait();
    // woke up - I have the lock !
}

Unlock :
{
    if ( Fetch_Add(&thinlock,-1) > 1 )
    {
        // there were waiters
        waitset.NotifyOne( &thinlock );
    }
}
The mutex is just a wait-count now. (as usual you should TryLock a few times before jumping in to the PrepareWait). This mutex is more elegant; it also has a small performance advantage in that it only calls NotifyOne when it really needs to; because its gate is also a wait-count it knows if it needs to Notify or not. The previous Mutex posted will always Notify on the last unlock whether or not it needs to (eg. it will always do one Notify too many).

This last mutex is also really just a semaphore. We can see it by writing a semaphore with our waitset :


U32 thinsem; // = 0 initializes thinsem

Wait :
{
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinsem );

    if ( Fetch_Add(&thinsem,-1) > 0 )
    {
        // got a dec on count
        w.Cancel(waitset_resignal_yes); // (*1)
        return;
    }

    w.Wait();
    // woke up - I got the sem !
}

Post :
{
    if ( Fetch_add(&thinsem,1) < 0 )
    {
        waitset.NotifyOne( &thinsem );
    }
}

which is obviously the same. The only subtle change is at (*1) - with a semaphore we must do the resignal, because there may have been several posts to the sem (contrast with mutex where there can only be one Unlock at a time; and the mutex itself serializes the unlocks).


Oh, one very subtle issue that I only discovered due to relacy :

waitset.Notify requires a #StoreLoad between the condition check and the notify call. That is, the standard pattern for any kind of "Publish" is something like :


Publish
{
    change shared variable
    if ( any waiters )
    {
        #StoreLoad
        waitset.Notify()
    }
}

Now, in most cases, such as the Sem and Mutex posted above, the Publish uses an atomic RMW op. If that is the case, then you don't need to add any more barriers - the RMW synchronizes for you. But if you do some kind of more weakly ordered primitive, then you must force a barrier there.

This is the exact same issue that I've run into before and forgot about again :

cbloom rants 07-31-11 - An example that needs seq_cst -
cbloom rants 08-09-11 - Threading Links (see threads on eventcount)
cbloom rants 06-01-12 - On C++ Atomic Fences Part 3

old rants