7/26/2010

07-26-10 - Virtual Functions

Previous post on x86/PPC made me think about virtual functions.

First of all, let's be clear than any serious game code base must have *some* type of dynamic dispatch (that is, data-dependent function calls). When people say "avoid virtual functions" it just makes me roll my eyes. Okay, assume I'm not just over-abstracting and I'm doing the dynamic dispatch for a good reason, because I actually need to do different things on different objects. The issue is just how you implement it.

How C++ virtual functions are implemented on most compilers :

    There's a "vtable" of function pointers associated with each class type. Individual objects have a pointer to their vtable. The advantage of this is that vtables are shared, but the disadvantage is that you get an extra cache miss. What actually happens when you make a virtual call? (multiple and virtual inheritance ignored here for now).
    
    vtable pointer = obj->vtable;
    
    func_pointer = vtable->func
    
    jump func_pointer
    
    

Why does this hurt ?

    vtable pointer = obj->vtable;
    
    
    The load of vtable pointer may be a cache miss, but that doesn't count against us since you are working on obj anyway you have to have one cache miss there.
    func_pointer = vtable->func
    
    
    Then you fetch the func pointer, which is maybe a cache miss (if this type of object has not been used recently).
    jump func_pointer
    
    
    Then you jump to variable, which may or may not be able to use branch prediction or fuck up your CPU's pipelining.

How can virtual calls be removed ?

  • A good practice is to never expose virtuals directly. In the base class you should have something like :
    
    void DoStuff(int x);
    
    
    as the public interface, and then hidden lower down, something like :
    
    virtual void v_DoStuff(int x) = 0;
    void DoStuff(int x) { v_DoStuff(x); }
    
    
    this gives you a single controlled call-through point which you can then make non-virtual or whatever at some point. (though, this screws up the next issue:)

  • The compiler can do it better than you (in theory anyway). Virtual functions that aren't actually overriden anywhere can obviously replaced with direct calls. But it's better than that - if you hold a pointer to some type, as long as nothing *derived* from that type override the virtual, it can be called direct. Even if sibling or cousin or parent classes override that virtual, by holding it by a derived type you can know that you are in a leaf part of the virtual call.

    No C++ compiler that I know of actually does this, because it requires knowledge of the class heirarchy in the whole program. But Java and C# are doing this now, so hopefully we will get it in C++ soon.

    When/if we get this, it is a very powerful optimization, because you can make then make functions that take concrete types and get joint-dispatch where you turn many virtual calls into just one. An example to be clear :

    
    class Base { virtual int Func() { return 0; } };
    
    class A : public Base { virtual int Func() { return 1; } };
    
    class B : public A { int m_data; };
    
    class C : public Base { virtual int Func() { return 2; } };
    
    void Test(A * obj)
    {
      obj->Func();
    }
    
    
    in this case the virtual call to Func() can be replaced with the direct call A::Func() by the compiler because no child of A overrides Func.

  • Don't use the method of virtual calls to hide implementation from interface. Some C++ coders will make a naked pure virtual interface class for their D3DHardware or ResourceManager or whatever, and then make a concrete class that impelements those virtuals. This is pretty nice for code cleanliness, but gives you virtual calls all over that are unnecessary. Note that if we had compiler-virtual-elimination this method would be fine since all those virtuals could be eliminated as there is only one implementation of that interface in the program.

  • Large/rare classes could use in-object vtables. In some cases, the space savings of sharing the vtable for all instances of a given class is not a huge win, and in that case you'd rather have the vtable directly in line in the object, because it gives you one less cache miss. There's no way to make the compiler do this for you, but you can do it reasonably easily with function pointers.

  • They can be converted into normal branches. In the DoStuff() example, instead of calling v_DoStuff, you could just branch on object type and call some concrete functions. Like :
    
    void DoStuff(int x)
    {
        if ( this is Actor ) Actor::DoStuff(x);
        else Base::DoStuff(x);
    }
    
    
    the advantage of this is that you avoid a cache miss for the vtable, and you use a normal branch which can be predicted even on shitty chips.

    This is only practical if you have only a few overrides for the function. The shitty thing about this (and many of these patterns) is that it pushes knowledge of the derived class up to the base. One nice variant of this is :

  • Rare overrides can use a test. If DoStuff() usually calls the default implementation, but once in a rare while needs an override, you can do that more efficiently with a pattern like :
    
    void DoStuff(int x)
    {
        if ( base is okay ) Base::DoStuff(x);
        else v_DoStuff(x);
    }
    
    
    so we check a bool (could be in a bitfield) and if so we use direct call, else we call the virtual.

  • Pushing data up to parent instead of calling down to child. A good method is simply to eliminate the virtual through code flow redesign.

    For example, a lot of bad game engines might have something like "virtual GetPosition()" on the base object. This at first seems like a good idea - not all objects have positions, and they might want to implement different ways of reporting it. In fact it is a bad idea, and you should just store m_position as public data, then have the different object types push their different position into m_position. (For example silly people implement attachments by making GetPosition() query the parent position then add some offset; instead you should do that computation only once in your object update and store it into the shared m_position).

    Most good games have a chunk of the most common data stored directly in the base object type so it can be accessed without virtuals.

  • Specialize for common cases. I sort of noted about this before, but when you're calling a bunch of virtuals from something, you can change many virtuals into one by dispatching to a func that calls them concretely. eg. Say you have a function like :
    
    void func(obj * o)
    {
      o->v1();
      o->v2();
      o->v3();
    }
    
    
    that does a bunch of virtual calls. Instead make a concrete call version t_func which does no virtuals :
    
    template < typename T >
    void t_func(T * o)
    {
      o->T::v1();
      o->T::v2();
      o->T::v3();
    }
    
    void func(obj * o)
    {
        dispatch to actual type of obj :
           t_func < typeof(o) >( o );
    }
    
    
    the difficulty is the dispatching from func to t_func. C++ has no mechanism to do type-dependent dispatch other than the vtable mechanism, which is annoying when you want to write a helper function and not add it to your class definition. There are general solutions to this (see for example in cblib the PrefDispatcher which does this by creating a separate but parallel class heirarchy to do dispatch) but they are all a bit ugly. A better solution for most games is to either add func() to the vtable or to just to know what concrete types you have and do manual dispatch.
  • Note that just flattening your hierarchy is not generally a great solution. For example, rather than have a bunch of different types of Crate (ExplodingCrate, FlammableCreate) you might decide to just make one single UberCrate that can be any type of crate. This eliminates virtual calls when you are working on UberCrate since it is just one concrete type, but it adds a lot of branches (if (crate.isExploding)) , and often makes the classes fatter in memory, etc. Making objects data-polymorphic instead of type-polymorphic may or may not be a win depending on the details.
  • In general it's good practice to make queries as fast and non-virtual as possible, and push the virtuals to the updates.

8 comments:

won3d said...

I wonder how much space is taken up by vtables, and how often they incur cache misses? That being said, increasing your load chain dependency might be annoying, even if you are hitting the cache when you are in-order.

Folks (like Herb Sutter) recommend non-virtual forwarding method approach, going so far as to recommend making your virtual methods private. It has the nice benefit of having distinct "caller" and "inheriter" interfaces. One thing that would help the later compiler issue might be to force inline the forwarding methods. Also, it is an interesting idea to use hungarian notation for the virtual methods.

Do you use profile-guided or link-time optimizations? I've actually found that recent GCC will inline virtual methods (maybe these are only leaf methods). In the near future, though, I would expect that they will be able to do things like partial inlining. The compiler should figure out that one virtual method is more likely than others and emit the normal-branch/if()-style polymorphism for you. Unfortunately, it doesn't sound like you can wait 1-2 versions of GCC.

Re: implementation hiding

I suppose opaque pointers (which may be wrapped in a class to be more C++ idiomatic) might be better than vtables for implementation hiding/dependency minimization. Or were you suggesting something else?

ryg said...

A very important element missing from your coarse cost model is cache footprint, both I$ and D$. Let's go with the Xenon/PS3 again for a minute. There's 32k of I$ and 32k of D$, with 128-byte cache lines - i.e. 256 cache lines each (and suddenly it feels quite a bit smaller). It's not unrealistic to assume that no two different vtables end up sharing the same cache line (that's virtually guaranteed if the two types are implemented in different source files). So in a loop that iterates over objects of different classes, each class costs you one cache line (or more, depending on what you do). Even with a low number of distinct classes and careful coding style, you can quickly lose 15-16 cache lines to this: about 6% of your L1 cache, to buffer less than a cache line's worth of data you're actually interested in right now! It's not a big hit, but a little bit of friction everywhere.

I$ is worse. 1k of code = 256 instructions = not very much. How many unique instructions do you need per object in a typical update loop? It's very easy to run out of I$. Not all-out thrashing, but a bit of extra work on every call.

These two things have a notable effect on performance in typical virtual-heavy code. And it's often trivial to fix, without removing the virtuals or any big redesign efforts: if possible, just keep your entity list sorted by type. Easy fix as long as there's no particular order that things have to happen in.

"First of all, let's be clear than any serious game code base must have *some* type of dynamic dispatch (that is, data-dependent function calls). When people say "avoid virtual functions" it just makes me roll my eyes."
The big shift in perspective is not in changing the mechanism of dynamic dispatch, but the granularity. Instead of dispatching per object, you group everything into coherent batches and do one dispatch per batch (that's an incremental change from the "sort by type" version). In fact, you can do it all incrementally:

- Sort by type (reduce D$, I$ misses)
- Group into batches, dispatch per batch (reduce function call overhead, both virtual and non-virtual)
- Allocate as batches, not individual objects (D$ again)
- Turn inheritance into composition and segregate storage. Adds extra ptrs and indirections, but transforms code from "one loop that does everything" into "10 loops that each do 1/10th of the work". D$, I$ benefits. Code that only cares about base class fields never needs to access cache lines
containing derived class data.

You can't always do this. But it's definitely worth trying. As a bonus, if you get through it all, your code is in a form where it's relatively easy to efficiently use SIMD (or move work to a SPU, or both) if it's still too slow.

  said...

@ryg, sorting your game entities is a fantastic idea!

jeskola said...

"No C++ compiler that I know of actually does this, because it requires knowledge of the class heirarchy in the whole program"

Intel compiler does it.

cbloom said...

"The big shift in perspective is not in changing the mechanism of dynamic dispatch, but the granularity. Instead of dispatching per object, you group everything into coherent batches and do one dispatch per batch (that's an incremental change from the "sort by type" version). In fact, you can do it all incrementally:

- Sort by type (reduce D$, I$ misses)
- Group into batches, dispatch per batch (reduce function call overhead, both virtual and non-virtual)"

Yeah, this is a very extreme example of exactly what I'm talking about.

First of all you're putting the virtual on Update rather than Query. You're assuming that your Update() on object type A doesn't go off and call a bunch of Query calls on objects of various types. That's what really kills you.

If A::Update() calls
B->Query() , C->Query(), etc.

the sort by type doesn't really do the magic you want.

But you're mainly talking about components that only work on themselves, eg. Skeleton::Animate() or something like that works very well with the method you describe, but NPC::Think() doesn't work so well because it has to touch a bunch of other objects.

cbloom said...

"I suppose opaque pointers (which may be wrapped in a class to be more C++ idiomatic) might be better than vtables for implementation hiding/dependency minimization. Or were you suggesting something else?"

Yeah, you have to use something like opaque pointers.

But the main point is to not use a type of dynamic dispatch when it could be static. This doesn't just apply to virtual calls either. eg. don't use function pointers for pluggable malloc or stdio or whatever when in fact they are always set to the same thing.

cbloom said...

"Let's go with the Xenon/PS3 again for a minute. There's 32k of I$ and 32k of D$, with 128-byte cache lines - i.e. 256 cache lines each (and suddenly it feels quite a bit smaller)."

BTW when you actually have two hyperthreads doing different things, this is pretty catastrophic. And to make it worse the time to load from L2 to L1 is actually quite long (40 clocks or so). I'm not sure how most people deal with this.

ryg said...

"First of all you're putting the virtual on Update rather than Query. You're assuming that your Update() on object type A doesn't go off and call a bunch of Query calls on objects of various types. That's what really kills you.

If A::Update() calls
B->Query() , C->Query(), etc.

the sort by type doesn't really do the magic you want."
The purpose is to reduce effective cache pressure, and it still does. A typical "which class calls into which other classes at runtime" matrix is relatively sparse. So you still get way better I$/D$ usage out of it than if the order is just random.

Then there's another thing: if possible, you should split "Update"-style functions into two phases nowadays. First phase just gathers data from other entities and caches it (but leaves visible object state alone), second phase updates visible object state only using existing object state and the cached values. That's purely to make it safe to distribute the update step over multiple cores. But it also ensures that the second-phase function is in the form you need for later transforms.

You can need arbitrarily many phases if you have expensive queries (e.g. collision tests) that you want to start based on the results of other queries. If you really need this in a lot of places, this design is a bad idea. If you can tolerate some latency, "pipeline" the work over multiple frames: results of certain queries get reported one frame late.

"BTW when you actually have two hyperthreads doing different things, this is pretty catastrophic."
Hardware threads are generally a different beast to design for than "real" threads: on real threads it can be a win to do work multiple times if it decreases comm/sync overhead and increases concurrency. On HW threads it's a loss, always. HW threads can also stall each other easily (e.g. microcoded instructions on PPU/Xenon stall both threads until they're done). Small cache size means you pretty much have to work on similar data or you're screwed. On the plus side, there's no false sharing between two HW threads, so none of the associated slowdowns. And communication between two threads on the same core is fast since you can avoid most (all?) of the processor read/write barriers, though that's a very risky game to play.

"And to make it worse the time to load from L2 to L1 is actually quite long (40 clocks or so). I'm not sure how most people deal with this."
Shake fist at the sky, yell "CURSE YOU, IBM!", and get on with their lives.

old rants