7/14/2011

07-14-11 - Some obscure threading APIs

Futex :

Futex from a Win32 programmer's perspective is an enhanced version of the windows Event. It has absolutely nothing to do with a "fast user space mutex". It's a much lower level primitive - it's a "waitset" if you like (more about "waitset" in a future post). Basically it lets you put a thread to sleep with a Wait(), or wake up one or more threads. It has several advantages over Win32 Event which make it very nice.

Futex is associated with the address of an int. This means you don't have to actually create a Futex object, any int in your system can be used as the handle for a waitable event. This is nice.

Futex Wait atomically checks the int vs. a certain value. The basic futex Wait op is :

atomically {
  if ( *address == check_value ) Wait();
  else return;
}
this atomic check before waiting is exactly what you want for implementing lots of threading primitives (mutex, conditionvar, etc.) so that's very handy. The normal thing you would write is something like :

thread 1:

if ( queue empty / mutex locked / whatever )
{
    Wait();
}

thread 2:

push queue / unlock mutex / whatever
Signal();

but that contains a race. You can fix it very easily with futex by passing in the condition to check to the Wait, like :

thread 1 :
if ( num_items == 0 )
    futex_wait( &num_items, 0 );

thread 2 :
num_items++;
Signal();

(note that the alternative to this is a prepare_wait/wait pair with a double-check of the condition after the prepare_wait, and signal applies to anyone who has prepared, not anyone who has waited)

Futex Wake can wake up N threads (sadly Win32 event only provides "wake 1" in the robust auto-reset mode). That's nice.

Some reference :
Futexes are tricky (PDF) (no they're not, BTW)
Thin Lock vs. Futex � ��Bartosz Milewski's Programming Cafe
Mutexes and Condition Variables using Futexes
futex(2) - Linux manual page

Now some Win32 :

SignalObjectAndWait :

SignalObjectAndWait (Win2k+).

This seems pretty hot, because you would think it was atomic (signal and wait in one kernel op), but it is *NOT* :

Note that the "signal" and "wait" are not guaranteed to be performed as an atomic operation. Threads executing on other processors can observe the signaled state of the first object before the thread calling SignalObjectAndWait begins its wait on the second object.

which means it's useless. It's just the same as calling Signal() then Wait(). Maybe it's less likely that you get swapped out between the two calls, which would reduce thread thrashing and also reduce the number of system calls, but it does not help you with correctness issues (an actually atomic SignalAndWait would let you implement things differently and solve some lost-wakeup problems).


thread1 :
  Signal(event1);
  /* (!1) */
  Wait(event2);

thread 2:
  Wait(event1);
  Signal(event2); // (!2)
  Wait(event2); // ! stolen wakeup

So first of all, the separate Signal & Wait is a performance bug because if thread1 loses the CPU at (!1) (or immediately upon signalling), then in thread 2 before (!2) you transfer execution back to thread1, it immediately goes to sleep. That's lame, and it's why you prefer to do these things atomically. But, it's even worse in this case, because at (!2) we intended that Signal to go to thread1 and wake him up, but he isn't in his sleep yet. Because SignalAndWait is not atomic we have a race. If it was atomic, code like this could actually be correct.

Windows NT Keyed Events (NtWaitForKeyedEvent/NtReleaseKeyedEvent) :

This is an NT internal API (it's in NTDLL since Win XP, so despite being internal-only it's actually more usable than the Win2k+ or Vista+ stuff (revision : duh, brain fart, not true, 2k+ is fine, only Vista+ stuff is problematic)). It's a very nice extension to the Win32 Event. Basically it lets you associate a value with the Wait and Signal.

NtWaitForKeyedEvent : analogous to Wait() on an Event, but also takes a value (the value is PVOID but it is just a value, it doesn't deref). Execution is stopped until a signal is received with that particular value.

NtReleaseKeyedEvent : analogous to SetEvent() (on an auto-reset event) - eg. it wakes exactly 1 one thread - except that it actually blocks if no thread is waiting - that is, this always wakes exactly 1 thread (never zero, which SetEvent can do). Release also takes a value and only wakes threads waiting on that value.

So for example if you want you can use this to make a primitive implementation of futex. You have one global event (futex_event), then FutexWait(address) does WaitForKeyedEvent(futex_event,address) , using address as the value to wait for, and FutexWake(address) does ReleaseKeyedEvent(futex_event,address). (though obviously this is not a proper futex because it can't broadcast and can't check a value and so on).

More usefully, you can create a mutex which applies to an array! Something like :


template<typename T>
lockable_vector
{
  Event  vector_event;
  vector<T>   items;
  vector<int> contention;

  T * lock(int i)
  {
    if ( contention[i] ) NtWaitForKeyedEvent(vector_event, i);
    contention[i]++;
    return &items[i];
  }

  void unlock(int i)
  {
    contention[i]--;
    if ( contention[i] ) NtReleaseKeyedEvent(vector_event, i);
  }
}

(obviously this just is a rough sketch; you have to update "contention" properly atomically so that you only do the Wait and Release in the right cases) (I'll post a working version of this sometime soon).

(small issue with this : the lowest bit of the key must be unset ; apparently they use it as a flag bit, so you need a 2* or something to use it this way, or just give it aligned pointer addresses as the key)

The point is - you only have one kernel event, and you can mutex on any item in your vector; you can resize the vector and don't have to create/destroy mutexes. Cool! In the typical scenario that you have maybe 2-4 threads accessing an array of 4k items, this is a big win. (in fact this is something that's important in Oodle so I have a few implementations of how to do this without WaitForKeyedEvent).

KeyedEvent is implemented in the obvious way - when a thread in Win32 waits on a handle it gets stuck in that handle's data structure on a linked list. They just added a "wait_value" field to that linked list. So now when you do ReleaseKeyedEvent instead of just popping off the first thread in the list, it walks through the list in the kernel and tries to find a thread whose "wait_value" matches your signal value.

Some reference :
Slim Reader Writer Locks - Under The Hood - Matt Pietrek - Site Home - MSDN Blogs
NtCreateKeyedEvent
Keyed Events (lockless inc)
Concurrent programming on Windows - Google Books

4 comments:

ryg said...

"it's in NTDLL since Win XP, so despite being internal-only it's actually more usable than the Win2k+ or Vista+ stuff"
What's wrong with Win2k+ stuff? Win2k < XP. The only Win versions you lose with Win2k+ are NT <=4.0 (who gives a damn?) and 95/98/ME (which can't even use more than 160MB physical memory!)

cbloom said...

Oh yeah you're right, I brain-farted and swapped XP/ME.

Anything 2k+ is probably fine
XP+ is fine

It's only the Vista+ stuff that's problematic.

It confuses me too much to try to think about targetting each windows so I have to stick my simplification of

32-bit pre-Vista
64-bit post-Vista

I'll revise...

Anthony Williams said...

You've got the futex the wrong way round: it is equivalent to:

atomically {
if ( *address != check_value ) return;
else Wait();
}

i.e. it only waits if the value hasn't changed.

cbloom said...

Oops, thanks, fixed!

old rants