8/12/2013

08-12-13 - Cuckoo Cache Tables - Failure Report

This is a report on a dead end, which I wish people would do more often.

Ever since I read about Cuckoo Hashing I thought, hmm even if it's not the win for hash tables, maybe it's a win for "cache tables" ?

(A cache table is like a hash table, but it never changes size, and inserts might overwrite previous entries (or not insert the new entry, though that's unusual). There may be only a single probe or multiple).

Let me introduce it as a progression :

1. Single hash cache table with no hash check :

This is the simplest. You hash a key and just look it up in a table to get the data. There is no check to ensure that you get the right data for your key - if you have collisions you may just get the wrong data back from lookup, and you will just stomp other people's data when you write.


Data table[HASH_SIZE];

lookup :

hash = hash_func(key);
Data & found = table[hash];

insert :

table[hash] = data;

This variant was used in LZP1 ; it's a good choice in very memory-limited situations where collisions are either unlikely or not that big a deal (eg. in data compression, a collision just means you code from the wrong statistics, it doesn't break your algorithm).

2. Single hash cache table with check :

We add some kind of hash-check value to our hash table to try to ensure that the entry we get actually was from our key :


Data table[HASH_SIZE];
int table_check[HASH_SIZE]; // obviously not actually a separate array in practice

lookup :

hash = hash_func(key);
check = alt_hash_func(key);
if ( table_check[hash] == check )
{
  Data & found = table[hash];
}

insert :

table_check[hash] = check;
table[hash] = data;

In practice, hash_func and alt_hash_func are usually actually the same hash function, and you just grab different bit ranges from it. eg. you might do a 64-bit hash func and grab the top and bottom 32 bits.

In data compression, the check hash value can be quite small (8 bits is common), because as noted above collisions are not catastrophic, so just reducing the probability of an undetected collision to 1/256 is good enough.

3. Double hash cache table with check :

Of course since you are now making two hashes, you could look up two spots in your table. We're basically running the primary hash and alt_hash above, but instead of unconditionally using only one of them as the lookup hash and one as the check, we can use either one.


Data table[HASH_SIZE];
int table_check[HASH_SIZE]; // obviously not actually a separate array in practice

lookup :

hash1 = hash_func(key);
hash2 = alt_hash_func(key);
if ( table_check[hash1] == hash2 )
{
  Data & found = table[hash1];
}
else if ( table_check[hash2] == hash1 )
{
  Data & found = table[hash2];
}

insert :

if ( quality(table[hash1]) <= quality(table[hash2]) )
{
    table_check[hash1] = hash2;
    table[hash1] = data;
}
else
{
    table_check[hash2] = hash1;
    table[hash2] = data;
}

Where we now need some kind of quality function to decide which of our two possible insertion locations to use. The simplest form of "quality" just checks if one of the slots is unused. More complex would be some kind of recency measure, or whatever is appropriate for your data. Without any quality rating you could still just use a random bool there or a round-robin, and you essentially have a hash with two ways, but where the ways are overlapping in a single table.

Note that here I'm showing the check as using the same number of bits as the primary hash, but it's not required for this type of usage, it could be fewer bits.

(also note that it's probably better just to use hash1 and hash1+1 as your two hash check locations, since it's so much better for speed, but we'll use hash1 and hash2 here as it leads more directly to the next -)

4. Double hash cache table with Cuckoo :

Once you get to #3 the possibility of running a Cuckoo is obvious.

That is, every entry has two possible hash table indices. You can move an entry to its alternate index and it will still be found. So when you go to insert a new entry, instead of overwriting, you can push what's already there to its alternate location. Lookup is as above, but insert is something like :


insert :

PushCuckoo(table,hash1);
table_check[hash1] = hash2;
table[hash1] = data;



PushCuckoo(table,hash1)
{
// I want to write at hash1; kick out whatever is there

if ( table[hash1] is empty ) return;

// move my entry from hash1 to hash2
hash2 = table_check[hash1];
PushCuckoo(hash2);

table[hash2] = table[hash1];
table_check[hash2] = hash1;

}

Now of course that's not quite right because this is a cache table, not a hash table. As written above you have a gauranteed infinite loop because cache tables are usually run with more unique insertions than slots, so PushCuckoo will keep trying to push things and never find an empty slot.

For cache tables you just want to do a small limited number of pushes (maybe 4?). Hopefully you find an empty slot to in that search, and if not you lose the entry that had the lowest "quality" in the sequence of steps you did. That is, remember the slot with lowest quality, and do all the cuckoo-pushes that precede that entry in the walk.

For example, if you have a sequence like :


I want to fill index A

hash2[A] = B
hash2[B] = C
hash2[C] = D
hash2[D] = E

none are empty

entry C has the lowest quality of A-E

Then push :

B -> C
A -> B
insert at A

That is,

table[C] = table[B]
hash2[C] = B
table[B] = table[A]
hash2[B] = A
table[A],hash2[A] = new entry

The idea is that if you have some very "high quality" entries in your cache table, they won't be destroyed by bad luck (some unimportant event which happens to have the same hash value and thus overwrites your high quality entry).

So, I have tried this and in my experiments it's not a win.

To test it I wrote a simple symbol-rank compressor. My SR is order-5-4-3-2-1 with only 4 symbols ranked in each context. (I chose an SR just because I've been working on SR for RAD recently; otherwise there's not much reason to pursue SR, it's generally not Pareto). Contexts are hashed and looked up in a cache table. I compressed enwik8. I tweaked the compressor just enough so that it's vaguely competitive with state of the art (for example, I use a very simple secondary statistics table for coding the SR rank), because otherwise it's not a legitimate test.

For Cuckoo Caching, the hash check value must be the same size as the hash table index, so that's what I've done for most fo the testing. (in all the other variants you are allowed to set the size of the check value freely). I also tested 8-bit check value for the single lookup case.

I'm interested in low memory use and really stressing the cache table, so most of the testing was at 18-bits of hash table index. Even at 20 bits the difference between Cuckoo and no-Cuckoo disappears.

The results :


18 bit hash :

Single hash ; no confirmation :
Compress : 100000000 -> 29409370 : 2.353

Single hash ; 8 bit confirmation :
Compress : 100000000 -> 25169283 : 2.014

Single hash ; hash_bits size confirmation :
Compress : 100000000 -> 25146207 : 2.012

Dual Hash ; hash_bits size confirmation :
Compress : 100000000 -> 24933453 : 1.995

Cuckoo : (max of 10 pushes)
Compress : 100000000 -> 24881931 : 1.991

Conclusion : Cuckoo Caching is not compelling for data compression. Having some confirmation hash is critical, but even 8 bits is plenty. Dual hashing is a good win over single hashing (and surprisingly there's very little speed penalty (with small cache table tables, anyway, where you are less likely to pay bad cache miss penalties)).

For the record :


variation of compression with hash table size :

two hashes, no cuckoo :

24 bit o5 hash : (24,22,20,16,8)
Compress : 100000000 -> 24532038 : 1.963
20 bit o5 hash : (20,19,18,16,8)
Compress : 100000000 -> 24622742 : 1.970
18 bit o5 hash : (18,17,17,16,8)
Compress : 100000000 -> 24933453 : 1.995

Also, unpublished result : noncuckoo-dual-hashing is almost as good with the 2nd hash kept within cache page range of the 1st hash. That is, the good thing to do is lookup at [hash1] and [hash1 + 1 + (hash2&0xF)] , or some other form of semi-random nearby probe (as opposed to doing [hash1] and [hash2] which can be quite far apart). Just doing [hash1] and [hash1+1] is not as good.

1 comment:

Paul W. said...

A couple of hardware cache designs seem relevant: skewed associative caches and direct-mapped caches with small victim caches.

A skewed-associative cache is a multiway cache (maybe just 2-way) with a different hash function for each way, so that things that compete for the same cache row in one way will not compete in the other. What collides in one way generally doesn't collide in the other.

A victim cache is a small cache of stuff that's been recently evicted from the main cache. Even a very small cache can usually hang onto the stuff that's been evicted from the main cache solely due to random hash collisions. (Unless you have a hash function that's pathological for your data.) It can be tiny in comparison to the main cache, and still get much of the benefit---if you only have a few random hot spots in the main table, you only need a few entries in your victim cache.

I'm wondering if a good cache table design would be to combine these ideas and use a large direct-mapped main table with a much smaller direct-mapped victim table. (E.g., 64K-entry main table with a 1K-entry victim table, which can be mostly hardware cache-resident but too small to matter much to hardware cache performance.)

For normal hardware caches, victim caches can be a nice win during phases (at the right timescale) where most of your data fit in memory, and most of your misses in a direct-mapped cache are due to collisions. They don't help as much during transients where you're evicting lots of old data and bringing in fresh data, but they still help---e.g., the top of the activation stack will typically stay cached because you keep hitting it while you're hitting other (new) stuff---it may get bounced to the victim cache when you hit some new data with the same hash, but it will likely get bounced back to the main cache before it gets evicted. (And in the case of almost-all-new data, most of your stuff that's evicted would be evicted anyhow, because you're just touching too much data, and keeping a few things that stay hot is a little better.)

For text compression, I'd think the same thing more or less applies. For example, if you're compressing English text, you get transients where the subject terms and mode-of-speech phrases change, but very common and dispersed English phrases (" if the", " and a", " to the", etc.) usually keep appearing. You don't want to simply evict those from the cache table, and giving them a brief second chance with a smallish victim table seems good.

The fact that small dictionaries of 1000 common words and phrases generally work pretty well for text compression suggests that a victim cache table with a comparable number of entries would help with hashing---it would be enough to keep most of the stably common things cached.

For binary data, the situation seems less clear. With binary data you often have a whole lot more unique hashes because e.g., the low bits of numbers keep changing and generating distinct hashes. You're often seeing a flood of "new" data, putting much more pressure on your victim cache table. (But then, that's going to put pressure on whatever mechanism you use to deal with collisions.)







old rants