cbloom rants

3/02/2015

03-02-15 - Oodle LZ Pareto Frontier

I made a chart.

I'm showing the "total time to load" (time to load off disk at a simulated disk speed + time to decompress). You always want lower total time to load - smaller files make the simulated load time less, faster decompression make the decompress time less.


total_time_to_load = compressed_size / disk_speed + decompress_time

It looks neatest in the form of "speedup". "speedup" is the ratio of the effective speed vs. the speed of the disk :


effective_speed = raw_size / total_time_to_load

speedup = effective_speed / disk_speed

By varying disk speed you can see the tradeoff of compression ratio vs. cpu usage that makes different compressor better in different application domains.

If we write out what speedup is :


speedup = raw_size / (compressed_size + decompress_time * disk_speed)

speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)

speedup ~= harmonic_mean( compression_ratio , decompress_speed / disk_speed )

we can see that it's a "harmonic lerp" between compression ratio on one end and decompress spsed on the other end, with the simulated disk speed as lerp factor.

These charts show "speedup" vs. log of disk_speed :

(the log is log2, so 13 is a disk speed of 8192 mb/s).

On the left side, the curves go flat. At the far left (x -> -infinity, disk speed -> 0) the height of each curve is proportional to the compression ratio. So just looking at how they stack up on the far left tells you the compression ratio performance of each compressor. As you go right more, decompression speed becomes more and more important and compressed size less so.

With ham-fisted-shading of the regions where each compressor is best :

The thing I thought was interesting is that there's a very obvious Pareto frontier. If I draw a tangent across the best compressors :

Note that at the high end (right), the tangent goes from LZB to "memcpy" - not to "raw". (raw is just the time to load the raw file from disk, and we really have to compare to memcpy because all the compressors fill a buffer that's different from the IO buffer). (actually the gray line I drew on the right is not great, it should be going tangent to memcpy; it should be shaped just like each of the compressors' curves, flat on the left (compressed size dominant) and flat on the right (compress time dominant))

You can see there are gaps where these compressors don't make a complete Pareto set; the biggest gap is between LZA and LZH which is something I will address soon. (something like LZHAM goes there) And you can predict what the performance of the missing compressor should be.

It's also a neat way to test that all the compressors are good tradeoffs. If the gray line didn't make a nice smooth curve, it would mean that some compressor was not doing a good job of hitting the maximum speed for its compression level. (of course I could still have a systematic inefficiency; like quite possibly they're all 10% worse than they should be)


ADDENDUM :

If instead of doing "speedup" vs. loading raw you do speedup vs. memcpy, you get this :

The nice thing is the right-hand asymptotes are now constants, instead of decaying like 1/disk_speed.

So the left hand (disk speed -> 0) shows the compression ratio, and the right hand side show the decompression speed (disk speed -> inf), and in between shows the tradeoff.

03-02-15 - LZ Rep-Match after Match Strategies

Tiny duh note.

When I did the Oodle LZH I made a mistake. I used a zip-style combined codeword. Values 0-255 are a literal, and 256+ contain the log2ish of length and offset. The advantage of this is that you only have one Huff table and just do one decode, then if it's a match you also fetch some raw bits. It also models length-offset correlation by putting them in the codeword together. (this scheme is missing a lot of things that you would want in a more modern high compression LZ, like pos-state patterns and so on).

Then I added "rep matches" and just stuck them in the combined codeword as special offset values. So the codeword was :

{
256 : literal
4*L : 4 rep matches * L length slots (L=8 or whatever = 32 codes)
O*L : O offset slots * L length slots (O=14 and L = 6 = 84 codes or whatevs)
= 256+32+84 = 372
}

The problem is that rep-match-0 can never occur after a match. (assuming you write matches of maximum length). Rep-match-0 is quite important, on binary/structured files it can have very high probability. By using a single codeword which contains rep-match-0 for all coding events, you are incorrectly mixing the statistics of the after match state (where rep-match-0 has zero probability) and after literal state (where rep-match-0 has high probability).

A quick look at the strategies for fixing this :

1. Just use separate statistics. Keep the same combined codeword structure, but have two entropy tables, one for after-match and one for after-literal. This would also let you code the literal-after-match as an xor literal with separate statistics for that.

Whether you do xor-lit or not, there will be a lot of shared probability information between the two entropy tables, so if you do static Huffman or ANS probability transmission, you may need to use the cross-two-tables-similary in that transmission.

In a static Huffman or ANS entropy scheme if rep-match-0 never occurs in the after-match code table, it will be given a codelen of 0 (or impossible) and won't take any code space at all. (I guess it does take a little code space unless you also explicitly special case the knowledge that it must have codelen 0 in your codelen transmitter)

This is the simplest version of the more general case :

2. Context-code the rep-match event using match history. As noted just using "after match" or "after literal" as the context is the simplest version of this, but more detailed history will also affect the rep match event. This is the natural way to fix it in an adaptive arithmetic/ANS coder which uses context coding anyway. eg. this is what LZMA does.

Here we aren't forbidding rep-match after match, we're just using the fact that it never occur to make its probability go to 0 (adaptively) and thus it winds up taking nearly zero code space. In LZMA you actually can have a rep match after match because the matches have a max length of 273, so longer matches will be written as rep matches. Ryg pointed out that after a match that's been limitted by max-length, LZMA should really consider the context for the rep-match coding to be like after-literal , not after-match.

In Oodle's LZA I write infinite match lengths, so this is simplified. I also preload the probability of rep-match in the after-match contexts to be near zero. (I actually can't load exactly zero because I do sometimes write a rep-match after match due to annoying end-of-buffer and circular-window edge cases). Preloading the probability saves the cost of learning that it's near zero, which saves 10-100 bytes.

3. Use different code words.

Rather than relying on statistics, you can explicitly use different code words for the after-match and after-literal case. For example in something like an LZHuf as described above, just use a codeword in the after-match case that omits the rep0 codes, and thus has a smaller alphabet.

This is most clear in something like LZNib. LZNib has three events :

LRL
Match
Rep Match
So naively it looks like you need to write a trinary decision at each coding (L,M,R). But in fact only two of them are ever possible :

After L - M or R
  cannot write another L because we would've just made LRL longer

After M or R - L or M
  cannot write an R because we wouldn't just made match longer

So LZNib writes the binary choices (M/R) after L and (L/M) after M or R. Because they're always binary choices, this allows LZNib to use the simple single-divider method of encoding values in bytes .

4. Use a combined code word that includes the conditioning state.

Instead of context modeling, you can always take previous events that you need context from and make a combined codeword. (eg. if you do Huffman on the 16-bit bigram literals, you get order-1 context coding on half the literals).

So we can make a combined codeword like :

{
(LRL 0,1,2,3+) * (rep matches , normal matches) * (lengths slots)
= 4 * (4 + 14) * 8 = 576
}

Which is a pretty big alphabet, but also combined length slots so you get lrl-offset-length correlation modeling as well.

In a combined codeword like this you are always writing a match, and any literals that precede it are written with an LRL (may be 0). The forbidden codes are LRL=0 and match =rep0 , so you can either just let those get zero probabilities, or explicitly remove them from the codeword to reduce the alphabet. (there are also other forbidden codes in normal LZ parses, such as low-length high-offset codes, so you would similarly remove or adjust those)

A more minimal codeword is just

{
(LRL 0,1+) * (rep-match-0, any other match)
= 2 * 2 = 4
}

which is enough to get the rep-match-0 can't occur after LRL 0 modeling. Or you can do anything between those two extremes to choose an alphabet size.

2/20/2015

02-20-15 - Shoes All Suck

Shoes are fucking bullshit. This is another in the long series of "everyone else is a fucking moron and I should be in charge of everything".

Skippable background : {

My feet are wide in the toebox area (forefoot). This has caused me enormous problems over the years. Basically no shoes fit. As a result of wearing shoes that are too tight in the forefoot, I developed a neuroma. My neuroma is in the usual place, between the metatarsal heads. There's basically no treatment for neuromas. I've tried orthotics and cortisone shots and all the usual stuff. There are surgical options, but the success rate is extremely low and the problem often recurs worse after surgery. The neuroma goes from being a minor annoyance to being incredibly painful when inflamed. It's affected what I can do in my life; I never run on concrete, I try to avoid long walks or standing on concrete. I still do things like hiking and backpacking, but I know I'll be in great pain afterward and just take that as part of the deal.

Recently it's gotten much worse. The problem is I've subconsciously developed a gait that avoids putting pressure on the neuroma. For the past I don't know how many years, I've been walking really crooked. I didn't really notice until it started having bad effects up the chain. First I had an SI slip in my left hip. I got it popped back in, then a month later had another SI slip. (I now know how to pop it back in myself, it's pretty easy). I started going to physical therapy for the left hip, and that immediately set off a groin strain on my left pelvis, which has been lingering now for several months. Basically short version is that the left foot neuroma has now caused bad effects all the way up the chain due to subconscious favoring.

(I had the same kind of issue with my shoulder separation ; if I could just force my body to keep moving normally and ignore the pain, it actually doesn't have any bad mechanical consequences. In fact the surgical treatment for a neuroma is just to try to kill the nerve. The pain is not indicative of damage occuring, you want to just power through it. The problem is that subconsciously the body just takes over and modifes your movements to avoid the pain, and those modifications are actually much worse for you than the original injury. I'm still doing PT almost every day to try convince my shoulder that it's allowed to move in a normal way and doesn't have to do janky things to avoid pops.)

So I decided that the neuroma pain that I've been just ignoring for a long time was something that I need to try to address if it's going to fuck up my hips and low back and so on. Part of that is trying to find shoes that give me enough toe room.

}

So. I need 4E wide shoes, which basically just don't exist. (New Balance is the only significant maker of wide shoes, and they're bullshit. They just aren't actually wide. NB makes it wide shoes by putting a larger upper on the same sole, which means the ball of my foot is hanging over the side of the sole. Fail New Balance.) Even the rare 4E shoes that do exist are wrong, which we will now discuss.

Anyway. That's just background. The point is shoes fucking suck. I'm going to go by type of shoes.

First, oxfords and traditional dress shoes and such. I don't know WTF the point of these shoes is. Raised heel = terrible for gait. No. Tapered toe box = neuroma pain. No. Hard leather sole = uncomfortable. Most have terrible grip soles. Overly rigid soles that don't bend with the step. Just non-functional all around.

It's a shame because I'm tempted to get custom made shoes with a last cut for my foot. The problem is that the custom shoe maker guys only make these fucking non-functional antiquated bullshit shoes.

Second, traditional running shoes.

The stupid minimalist community is all down on these, and I agree with some of the complaints, but there are good things about them.

Good : cushioning to protect pain from concrete and pebbles. flexible but structured upper than you can tighten to really fit snug on the foot so it doesn't slide around (thus allowing you to make a hard athletic "cut". If you can't make a cut in a shoe without the shoe sliding around, then the shoe does not work and goes in the fucking trash). generally grippy soles that aren't made of something retarded like vibram or leather.

Bad : too much heel-to-toe drop. Too much heel height in general causing early heel strike and strange gait. Too little feel of the ground. Too much taper in the toe box.

Let's talk about the last one : almost every shoe in the world has too much taper in the toe box. Feet are not pointed at the front. It's like the shoe makers all thing we have feet that are shaped like boats that come to a point. They're trying to sell us jester shoes.

In fact real feet (especially freak feet like mine) are pretty flat in front. The foot is widest at the ball of the foot and then does *NOT* taper ahead of the ball - the toes go straight ahead, and then there's a pretty straight line across the front edge. Not a pointy taper.

I will illustrate the problem on a shoe I actually like a lot. The Asics Gel Unifire TR 4E is one of the best shoes of the 100 I have tried in the past few months. It's a true 4E sole, has good structure in the heel so my foot doesn't slide all over, and is wide enough in the ball of the foot. *However* it's pointed, which makes it still a bit painful for me.

I believe these images are self explanatory :

... and everyone who makes shoes is fired.

(the other things I hate about the Unifire are : 1. they're ugly as piss, and 2. the tread chunks stick out past the edges of the upper, which makes them feel bigger than they are. A shoe should feel as small as possible, like it's not there. 3. the heel is way too high; I do not want or need high-heel sneakers.)

cliff note version :
These are better shoes than the stupid shit that mainstream shoe manufacturers make :

Now, you may object - "cbloom, you're taking your own personal weird foot issue and projecting on the masses; blaming shoe makers for making crap shoes when in fact their shoes work fine for most people". Mmm, yes, but I don't think so. The thing is, I don't think your shoes actually *do* work for you. You may wear something ridiculously narrow and pointed like a typical Nike shoe, and you've been lucky enough to not get any obvious painful problems like a bunion or hammertoe or neuroma, so you think your shoes are great. But I don't think they are. Most people who wear western shoes have got their pinky & big toes jammed inward. This is fucking up your feet. So you've endured the toe-smushing. It doesn't mean it's actually right, and I belive your feet would be better off in shoes that don't crush the toes.

Aside from the shape of the toe box, traditional running shoes would be easy to fix (as opposed to like, the average "casual shoe" or dress shoe which are completely hopeless). For me the perfect sole is something like what is on my beloved Onitsuka Tigers. I don't need a fucking one inch thick insane high-heel rubber sole like you get on New Balance or the Unifire. I want one centimeter like you get on the Tigers. (Tigers do have a tiny bit too much drop, I'd like a little more padding under the toes and a little less in the heel). I think that retro sneakers are as close as you can get to a perfect shoe. They're generally light weight, you can actually tie them tight to your foot, they have pretty good ground feel. They have the plusses of running shoes but without too much sole. Unfortunately for me there's not a single one of them that's made wide, and they're all too tapered in the front.

Summarizing the mainsteam shoes varieties :

And the retro/fashion sneaker is the closest; it's almost there. Like, if feet were shaped like fucking torpedos it would be right. It would be right if the fucking jester shoes were right. If you only had two toes and your foot tapered down to those two toes like a fucking sloth it would be right. If you were a 19th century Japanese girl with bound feet it would be great. Not for an actual human being. But, hey it's the best that the shoe industry has done, so, we'll give it a thumbs up and an encouraging pat on the back with "almost".

So, let's talk about "minimal" shoes.

Minimal shoes are fucking bullshit. Shoes in the real world need some padding. Maybe if we only walked on grass and dirt and rubber they would be fine. But we don't, we walk on concrete and gravel and other painful shit. Like the fucking moron "paleo" eaters they love to talk about how the "foot evolved to be barefoot" blah blah. Okay, first of all, a minimal shoes is not at all like actually being barefoot. That's just a lie. You don't move in the same way at all. You don't feel the ground, you can't grip with your toes, you don't even step the same way. Second of all, when the foot evolved we didn't walk on fucking concrete all day, we didn't have broken glass and nails, etc. Third of all, when apes evolved the ability to make things one of the very first fucking things they make in any primitive society is shoes.

(the only minimal shoe that I sort of believe in is the VFF, which at least allows your toes to grip the ground and so on, in theory. The "minimal" shoe I'm primarily objecting to here is something like the vivobarefoot, which is basically a leather sock. It provides no padding. It slips around on your foot if you try to do anything athletic in it. It removes your toes ability to actually feel the ground. It's total bullshit.)

(I do actually believe that truly barefoot is the best way to be, when possible. That's very different than "barefoot" shoes. It also only makes sense to me on surfaces that are sufficiently soft. eg. maybe if you have a nice rubber running track that you know to be free of nails and glass - run on it truly barefoot, and yeah I totally support that)

The distressing thing is that in some ways the minimal shoes are doing things right. Some of them actually have quite wide toe boxes (*) that let your feet spread nicely. I like zero drop. Actually for me I think a tiny bit of drop is better, but zero drop is better than the huge drop of traditional shoes. I like light weight. So that's all good. What I don't like is zero padding. I also don't like unstructured; if a shoe slips around it's not right.

(* = unfortunately almost all of the minimal shoes CLAIM to have "wide toe boxes" but very few of them actually do. Altra, innov8, not at all wide. VivoBarefoot is the best I've found in terms of shoe shape - actually wide and squared off in the front (they look like clown shoes), but they have literally zero padding and are unwearable. Reebok Crossfit Nanos and Lems shoes are probably good for someone with a normal width foot; they feel like a 2E to me, which is not wide enough for me but probably okay for others.)

Crossfit Nanos and Lems are what I would call "semi-minimal". They have a little more structure and actually have about 1 cm of sole thickness for some padding. Lems are the best designed of all the shoes I tried. If I fit Lems, that's what I would wear. Unfortunately, they only make whole Euro sizes (**), and they're also a tiny bit too narrow in the toes for me. The Nanos should be pretty great, but for some moronic reason they've made the bottom of the shoe out of hard plastic instead of soft rubber, so even though it's thick enough it's painful to walk on and now flexible enough. (yeah yeah weight lifting, bullshit, whatever).

(** = whole Euro sizes, WTF. The step between sizes is way too big.)

The best shaped shoe that I've found is the Patagonia Loulu. It's a hybrid shoe; semi-minimal, semi-sneaker. Unfortunately it's almost ruined by a fucking retarded synthetic sole. It's some recycled plastic garbage and it just doesn't work. It's widely known to wear away insanely quickly, and despite that it also has zero grip in the wet. It's like ice skating. Fucking awful moronic sole material. JUST USE NATURAL RUBBER you fucking morons, it was perfect a hundred years ago and you're only making it worse.

While I'm at it let me complain about shoe shopping in general :

1. How hard is it to fucking standardize the sizes? I should be able to measure my foot and buy a shoe that will fit. Instead a fucking "size 10" is slightly different in every damn brand.

2. Widths are even worse. Half the 4E shoes I bought were just not actually wide at all. Even the ones that were wide suffer from a fundamental problem with unclarity of the specification. There's an ambiguity in width. Is it just the toe box that's wide, and the heel is narrow? Are they both wide? Is it just the upper (bullshit) or the sole also? Furthermore, in a lot of brands "wide" actually means "fat fuck" and the shoe is just bigger all over.

3. How fucking terrible is the internet. I have to search on a whole mess of different web sites. They all have different searches, most of which are broken (like width is not an option, or I can't filter by size until I select a type of shoe). Lots of shoes are only for sale on the manufacturer's web site which I have to go digging around to find and then use their own custom terrible interface. But most of all -

- none of them actually take advantage of the internet. Shopping sites could easily be wikis where customers can mark up entries with extra information. Some people would be nerdy enough to actually measure shoes and add that information so I could search by true measurements instead of the stupid official sizes. I should be able to cross-index all the different shopping sites with one global index. But no. None of that. It's just so fucking primitive.

4. Why can't I get custom made sneakers? I should be able to measure my foot and provide specs and get shoes made to order for reasonable fees.

It's all so antiquated. It requires actually trying on 100's of shoes to find one that fits. There will of course always be some aspect of personal feel, but right now it's not even that. It's like I put on a shoe and immediately go "nope, this 4E is not actually wide at all" or "nope, this supposed size 10.5 is actually a 10 or 11". I'm just wasting all this time weeding out candidates that could have been done systematically.

2/14/2015

02-14-15 - Template Arithmetic Coder Generation

I did something for Oodle LZA that I thought worked out very neatly. I used templates to generate efficient & arbitrary variants of arithmetic coder decompositions.

First you define a pattern. Mine was :


"coder" pattern implements :

void reset();
 int decode( arithmetic_decoder * dec );
void encode( arithmetic_encoder * enc, int val );
int cost(int val) const;

where encode & decode also both adapt the model (if any).

You can then implement that pattern in various ways.

Perhaps the first "coder" pattern is just a single bit :


template <int t_tot_shift, int t_upd_shift>
struct arithbit_updshift
{
    U16  m_p0;

    void reset()
    {
        m_p0 = 1<<(t_tot_shift-1);
    }

    .. etc ..

};

And then you do some N-ary coders. The standard ones being :


template <int t_alphabet_size>
struct counting_coder;

template <int t_numbits, typename t_arithbit>
struct bitwise_topdown_coder;

template <int t_numbits, typename t_arithbit>
struct bitwise_bottomup_coder;

template <int t_maxval, typename t_arithbit>
struct unary_coder;

I'm not going to show implementations of all of these in this post, but I'll do a few here and there for clarity. The point is more about the architecture than the implementation.

For example :


template <int t_numbits, typename t_arithbit>
struct bitwise_topdown_coder
{

    enum { c_numvals = 1<<t_numbits };
    //t_arithbit m_bins[c_numvals-1]; // <- what's needed (use ctx-1 index)
    t_arithbit m_bins[c_numvals]; // padded for simpler addressing (use ctx index)
    
    int decode( arithmetic_decoder * dec )
    {
        int ctx = 1;

        for(int i=0;i<t_numbits;i++)
        {
            int bit = m_bins[ctx].decode(dec);
            ctx <<= 1; ctx |= bit;
        }

        return ctx & (c_numvals-1);
    }

    etc ...
};

and "t_arithbit" is your base coder pattern for doing a single modeled bit. By making that a template parameter it can be something like

arithbit_countn0n1

or a template like :

arithbit_updshift<12,5>

etc.

The fun thing is you can start combining them to make new coders which also fit the pattern.

For example, say you want to do a bitwise decomposition coder, but you don't have an even power of 2 worth of values? And maybe you don't want to put your split points right in the middle?


// t_fracmul256 is the split fraction times 256
// eg. t_fracmul256 = 128 is middle splits (binary subdivision)
// t_fracmul256 = 0 is a unary coder

template <int t_numvals, int t_fracmul256, typename t_arithbit>
struct bitwise_split_coder
{
    enum { c_numvals = t_numvals };
    enum { c_numlo = RR_CLAMP( ((t_numvals*t_fracmul256)/256) , 1, (t_numvals-1) ) };
    enum { c_numhi = t_numvals - c_numlo };

    t_arithbit  m_bin;
    bitwise_split_coder<c_numlo,t_fracmul256,t_arithbit> m_lo;
    bitwise_split_coder<c_numhi,t_fracmul256,t_arithbit> m_hi;

    void reset()
    {
        m_bin.reset();
        m_lo.reset();
        m_hi.reset();
    }

    void encode(arithmetic_encoder * enc,int val)
    {
        if ( val < c_numlo )
        {
            m_bin.encode(arith,0);
            m_lo.encode(arith,val);
        }
        else
        {
            m_bin.encode(arith,1);
            m_hi.encode(arith, val - c_numlo );
        }
    }

    .. etc ..

};

+ explicit template specialize for t_numvals=1,2

This lets you compile-time generate funny branching trees to be able to handle something like "my alphabet has 37 symbols, and I want to code each interval as a binary flag at 1/3 of the range, so the first event is [0-12][13-37]" and so on.

And then you can make yourself some generic tools for plugging coders together. The main ones I use are :


// val_split_coder :
// use t_low_coder below t_low_count
// then t_high_coder above t_low_count

template <int t_low_count, typename t_low_coder, typename t_high_coder , typename t_arithbit>
struct val_split_coder
{
    t_arithbit  m_bin;
    t_low_coder m_lo;
    t_high_coder m_hi;

    void encode(arithmetic_encoder * enc,int val)
    {
        if ( val < t_low_count )
        {
            m_bin.encode(arith,0);
            m_lo.encode(arith,val);
        }
        else
        {
            m_bin.encode(arith,1);
            m_hi.encode(arith, val - t_low_count );
        }
    }

    .. etc .. 

};

// bit_split_coder :
// use t_low_coder for t_low_bits
// use t_high_coder for higher bits
// (high and low bits are independent)

template <int t_low_bit_count, typename t_low_coder, typename t_high_coder >
struct bit_split_coder
{
    t_low_coder m_lo;
    t_high_coder m_hi;

    void encode(arithmetic_encoder * enc,int val)
    {
        int low = val & ((1<<t_low_bit_count)-1);
        int high = val >> t_low_bit_count;
        m_lo.encode(enc,low);
        m_hi.encode(enc,high);
    }

    .. etc .. 
};

// bit_split_coder_contexted :
// split bits, code hi then low with hi as context
// (used for decomposition of a value where the bits are dependent)

template <int t_low_bit_count, typename t_low_coder, int t_high_bit_count, typename t_high_coder >
struct bit_split_coder_contexted
{
    t_high_coder m_hi;
    t_low_coder m_lo[(1<<t_high_bit_count)];

    void encode(arithmetic_encoder * enc,int val)
    {
        int high = val >> t_low_bit_count;
        int low = val & ((1<<t_low_bit_count)-1);

        m_hi.encode(enc,high);
        m_lo[high].encode(enc,low);
    }

    .. etc .. 
};

So that gives us a bunch of tools. Then you put them together and make complicated things.

For example, my LZ match length coder is this :


val_split_coder< 8 , 
    bitwise_topdown_coder< 3 , arithbit_updshift<12,5> > ,  // low val coder
    numsigbit_coder< unary_coder<16, arithbit_updshift<12,5> > > ,  // high val coder
    arithbit_updshift<12,5> >

which says : first code a binary event for length < 8 or not. If length < 8 then code it as a 3-bit binary value. If >= 8 then code it using a num-sig-bits decomposition where the # of sig bits are written using unary (arithmetic coded) and the remainder is sent raw.

And the standard LZ offset coder that I described in the last post is something like this :


val_split_coder< 64 , 
    bitwise_topdown_coder< 6 , arithbit_updshift<12,5> > , // low vals
    bit_split_coder< 5 ,  // high vals
        bitwise_bottomup_coder< 5 , arithbit_updshift<12,5> > ,  // low bits of high vals
        numsigbit_coder< unary_coder<30, arithbit_updshift<12,5> > > ,  // high bits of high vals
    > ,
    arithbit_updshift<12,5> 
    >

To be clear : the advantage of this approach is that you can easily play with different variations and decompositions, plug in different coders for each portion of the operation, etc.

The code generated by this is very good, but once you fully lock down your coders you probably want to copy out the exact cases you used and hand code them, since the human eye can do things the optimizer can't.

2/10/2015

02-10-15 - LZ Offset Modeling Rambles

Canonical LZ offset coding :

The standard way to send LZ offsets in a modern LZ coder is like this :

1. Remove the bottom bits. The standard number is 4-7 bits.


low = offset & ((1<<lowbits)-1);
offset >>= lowbits;

then you send "low" entropy coded, using a Huffman table or arithmetic or whatever. (offsets less than the mask are treated separately).

The point of this is that offset bottom bits have useful patterns in structured data. On text this does nothing for you and could be skipped. On structured data you get probability peaks for the low bits of offset at 4,8,12 for typical dword/qword data.

You also get a big peak at the natural structure size of the data. eg. if it's a D3DVertex or whatever and it's 36 or 72 bytes there will be big probability peaks at 36,72,108,144.

2. Send the remaining high part of offset in a kind of "# sig bits" (Elias Gamma) coding.

Count the number of bits on. Send the # of bits on using an entropy coder.

Then send the bits below the top bit raw, non-entropy coded. Like this :


topbitpos = bit_scan_top_bit_pos( offset );
ASSERT( offset >= (1<<topbitpos) && offset < (2<<topbitpos) );

rawbits = offset & ((1<<topbitpos)-1);

send topbitpos entropy coded
send rawbits in topbitpos bits

Optionally you might also entropy-code the bit under the top bit. You could just arithmetic code that bit (conditioned on the position of the top bit as context). Or you can make an expanded top-bit-position code :


slot = 2*topbitpos + ((offset>>(topbitpos-1))&1)

send "slot" entropy coded
send only topbitpos-1 of raw bits

More generally, you can define slots by the number of raw bits being sent in each level. We've done :

Straight #SB slots :

{0,1,2,3,4,5,...}

Slot from #SB plus one lower bit :

{0,1,1,2,2,3,3,4,4...}

More generally it helps a little to put more slots near the bottom :

{0,0,1,1,1,2,2,2,3,3,4,4,5,6,7,8...}

but in the more general method you can't use a simple bitscan to find your slot.

The intermediate bits that are sent raw do have a slight probability decline for larger values. But there's very little to win there, and a lot of noise in the modeling. In something like a Huffman-coded-LZ, sending the code lengths for extra slots there costs more than the win you get from modeling it.

With this we're just modeling the general decrease in probability for larger offsets. Something that might be interesting that I've never heard of anyone doing is to fit a parameteric probability function, laplacian or something else, to offset. Instead of counting symbols to model, instead fit the parameteric function, either adaptively or statically.


So, a whole offset is sent something like this :


offset bits (MSB on left) :

  SSRRRRRRRRLLLLL

S = bit sent in slot index, entropy coded
R = raw bit
L = low bit, entropy coded


Now some issues and followup.

I. The low bits.

The low-bits-mask method actually doesn't handle the structure size very well. It does well for the 4-8 dword/qword stuff, because those are generally divide into the low bit mask evenly. (a file that had trigram structure, like an RGB bitmap, would have problems).

The problem is the structure size is rarely an exact multiple of the power of two "lowbits" mask. For example :


 36&0xF = 0x04 = 0100
 72&0xF = 0x08 = 1000
108&0xF = 0x0C = 1100
144&0xF = 0x00 = 0000

A file with structure size of 36 will make four strong peaks if the lower 4 bits are modeled.

If instead of doing (% 16) to get low bits, you did (% 36), you would get a perfect single peak at zero.

Any time the structure size doesn't divide your low bits, you're using extra bits that you don't need to.

But this issue also gets hidden in a funny way. If you have "repeat match" codes then on strongly structured data your repeat offsets will likely contain {36,72,...} which means those offsets don't even go into the normal offset coder. (the redundancy between low-bits modeling and repeat matches as a way of capturing structure is another issue that's not quite right).

II. Low bit structure is not independent of high bits.

Sometimes you get a file that just has the exact same struct format repeated throughout the whole file. But not usually.

It's unlikely that the same structure that occurs in your local region (4-8-36 for example) occurs back at offset one million. It might be totally different structure there. Or, it might even be the same structure, but shifted by 1 because there's an extra byte somewhere, which totally mucks up the low bits.

The simplest way to model this is to make the lowbits entropy coder be conditioned by the high bits slot. That is, different models for low offsets vs higher offsets. The higher offsets will usually wind up with low bits that are nearly random (equiprobable) except in the special case that your whole file is the same structure.

More generally, you could remember the local structure that you detect as you scan through the file. Then when you match back to some prior region, you look at what the structure was there.


An obvious idea is to use this canonical coding for offsets up to 32768 (or something), and then for higher offsets use LZSA.

So essentially you have a small sliding LZ77 window immediately preceding your current pointer, and as strings fall out of the LZ77 window they get picked up in a large LZSA dictionary behind.

2/09/2015

02-09-15 - LZSA - Some Results

So I re-ran some Oodle Network tests to generate some LZSA results so there's some concreteness to this series.

"Oodle Network" is a UDP packet compressor that works by training a model/dictionary on captured packets.

The shipping "OodleNetwork1 UDP" is a variant of LZP. "OodleStaticLZ" is LZSA-Basic and obviously HC is HC.

Testing on one capture with dictionaries from 2 - 64 MB :


test1   380,289,015 bytes

OodleNetwork1 UDP [1|17] : 1388.3 -> 568.4 average = 2.442:1 = 59.06% reduction
OodleNetwork1 UDP [2|18] : 1388.3 -> 558.8 average = 2.484:1 = 59.75% reduction
OodleNetwork1 UDP [4|19] : 1388.3 -> 544.3 average = 2.550:1 = 60.79% reduction
OodleNetwork1 UDP [8|20] : 1388.3 -> 524.0 average = 2.649:1 = 62.26% reduction
OodleNetwork1 UDP [16|21] : 1388.3 -> 493.7 average = 2.812:1 = 64.44% reduction
OodleNetwork1 UDP [32|22] : 1388.3 -> 450.4 average = 3.082:1 = 67.55% reduction
OodleNetwork1 UDP [64|23] : 1388.3 -> 390.9 average = 3.552:1 = 71.84% reduction

OodleStaticLZ [2] : 1388.3 -> 593.1 average = 2.341:1 = 57.28% reduction
OodleStaticLZ [4] : 1388.3 -> 575.2 average = 2.414:1 = 58.57% reduction
OodleStaticLZ [8] : 1388.3 -> 546.1 average = 2.542:1 = 60.66% reduction
OodleStaticLZ [16] : 1388.3 -> 506.9 average = 2.739:1 = 63.48% reduction
OodleStaticLZ [32] : 1388.3 -> 445.8 average = 3.114:1 = 67.89% reduction
OodleStaticLZ [64] : 1388.3 -> 347.8 average = 3.992:1 = 74.95% reduction

OodleStaticLZHC [2] : 1388.3 -> 581.6 average = 2.387:1 = 58.10% reduction
OodleStaticLZHC [4] : 1388.3 -> 561.4 average = 2.473:1 = 59.56% reduction
OodleStaticLZHC [8] : 1388.3 -> 529.9 average = 2.620:1 = 61.83% reduction
OodleStaticLZHC [16] : 1388.3 -> 488.6 average = 2.841:1 = 64.81% reduction
OodleStaticLZHC [32] : 1388.3 -> 429.4 average = 3.233:1 = 69.07% reduction
OodleStaticLZHC [64] : 1388.3 -> 332.9 average = 4.170:1 = 76.02% reduction

--------------

test2   423,029,291 bytes

OodleNetwork1 UDP [1|17] : 1406.4 -> 585.4 average = 2.402:1 = 58.37% reduction
OodleNetwork1 UDP [2|18] : 1406.4 -> 575.7 average = 2.443:1 = 59.06% reduction
OodleNetwork1 UDP [4|19] : 1406.4 -> 562.0 average = 2.503:1 = 60.04% reduction
OodleNetwork1 UDP [8|20] : 1406.4 -> 542.4 average = 2.593:1 = 61.44% reduction
OodleNetwork1 UDP [16|21] : 1406.4 -> 515.6 average = 2.728:1 = 63.34% reduction
OodleNetwork1 UDP [32|22] : 1406.4 -> 472.8 average = 2.975:1 = 66.38% reduction
OodleNetwork1 UDP [64|23] : 1406.4 -> 410.3 average = 3.428:1 = 70.83% reduction

OodleStaticLZ [2] : 1406.4 -> 611.6 average = 2.300:1 = 56.52% reduction
OodleStaticLZ [4] : 1406.4 -> 593.0 average = 2.372:1 = 57.83% reduction
OodleStaticLZ [8] : 1406.4 -> 568.2 average = 2.475:1 = 59.60% reduction
OodleStaticLZ [16] : 1406.4 -> 528.6 average = 2.661:1 = 62.42% reduction
OodleStaticLZ [32] : 1406.4 -> 471.1 average = 2.986:1 = 66.50% reduction
OodleStaticLZ [64] : 1406.4 -> 374.2 average = 3.758:1 = 73.39% reduction

OodleStaticLZHC [2] : 1406.4 -> 600.4 average = 2.342:1 = 57.31% reduction
OodleStaticLZHC [4] : 1406.4 -> 579.9 average = 2.425:1 = 58.77% reduction
OodleStaticLZHC [8] : 1406.4 -> 552.8 average = 2.544:1 = 60.70% reduction
OodleStaticLZHC [16] : 1406.4 -> 511.8 average = 2.748:1 = 63.61% reduction
OodleStaticLZHC [32] : 1406.4 -> 453.8 average = 3.099:1 = 67.73% reduction
OodleStaticLZHC [64] : 1406.4 -> 358.3 average = 3.925:1 = 74.52% reduction

Here's a plot of the compression on test1 ; LZP vs. LZSA-HC :

Y axis is comp/raw and X axis is log2(dic mb)

What you should see is :

OodleNetwork1 (LZP) is better at small dictionary sizes. I think this is just because it's a lot more tweaked out; it's an actual shipping quality codec, whereas the LZSA implementation is pretty straightforward. Things like the way you context model, how literals & lengths are coded, etc. needs tweakage.

At around 8 MB LZSA catches up, and then as dictionary increases it rapidly passes LZP.

This is the cool thing about LZSA. You can just throw more data at the dictionary and it just gets better. With normal LZ77 encoding you have to worry about your offsets taking more bits. With LZ77 or LZP you have to make sure the data's not redundant or doesn't replace other more useful data. (OodleNetwork1 benefits from a rather careful and slow process of optimizing the dictionary for LZP so that it gets the most useful strings)

Memory use of LZSA is quite a bit higher per byte of dictionary, so it's not really a fair comparison in that sense. It's a comparison at equal dictionary size, not a comparison at equal memory use.

2/07/2015

02-07-15 - LZSA - Index Post

LZSA series :

cbloom rants 02-04-15 - LZSA - Part 1
cbloom rants 02-04-15 - LZSA - Part 2
cbloom rants 02-05-15 - LZSA - Part 3
cbloom rants 02-05-15 - LZSA - Part 4
cbloom rants 02-06-15 - LZSA - Part 5
cbloom rants 02-06-15 - LZSA - Part 6
cbloom rants 02-09-15 - LZSA - Some Results

And other suffix related posts :

cbloom rants 09-27-08 - 2 (On LZ and ACB)
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
09-26-11 - Tiny Suffix Note
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression
cbloom rants 06-21-14 - Suffix Trie Note
cbloom rants 07-14-14 - Suffix-Trie Coded LZ
cbloom rants 08-27-14 - LZ Match Length Redundancy
cbloom rants 09-10-14 - Suffix Trie EOF handling

02-07-15 - LZMA note

Originally posted at encode.ru

I just had a look in the LZMA code and want to write down what I saw. (I could be wrong; it's hard to follow!) The way execution flows in LZMA is pretty weird, it jumps around : The outer loop is for the encoding position. At each encoding position, he asks for the result of the optimal parse ahead. The optimal parse ahead caches results found in some sliding window ahead. If the current pos is in the window, return it, else fill a new window. To fill a window : find matches at current position. set window_len = longest match length fill costs going forward as described by Bulat above as you go forward, if a match length crosses past window_len, extend the window if window reaches "fast bytes" then break out when you reach the end of the window, reverse the parse and fill the cache now you can return the parse result at the base position of the window So, some notes : 1. LZMA in fact does do the things that I proposed earlier - it updates statistics as soon as it hits a choke point in the parse. The optimal-parse-ahead window is not always "fast bytes". That's a maximum. It can also stop any time there are no matches crossing a place, just as I proposed earlier. 2. LZMA caches the code costs ("prices") of match lengths and offsets, and updates them periodically. The match flags and literals are priced on the fly using the latest encoded statistics, so they are quite fresh. (from the start of the current parse window) 3. LZMA only stores 1 arrival. As I noted previously this has the drawback of missing out on some non-greedy steps. *However* and this is something I finally appreciate - LZMA also considers some multi-operation steps. LZMA considers the price to go forward using : literal rep matches normal matches If you stored all arrivals, that's all you need to consider and it would find the true optimal parse. But LZMA also considers : literal + rep0 match rep + literal + rep0 normal match + literal + rep0 These are sort of redundant in that they can be made from the things we already considered, but they help with the 1 arrivals problem. In particular they let you carry forward a useful offset that might not be the cheapest arrival otherwise. ( literal + rep0 match is only tested if pos+1 's cheapest arrival is not a literal from the current pos ) That is, it's explicitly including "gap matches" in the code cost as a way of finding slightly-non-local-minima.

2/06/2015

02-06-15 - LZSA - Part 6

Some wrapping up.

I've only talked about LZSA for the case of static dictionaries. What about the more common case of scanning through a file, the dictionary is dynamic from previously transmitted data?

Well, in theory you could use LZSA there. You need a streaming/online suffix trie construction. That does exist, and it's O(N) just like offline suffix array construction. So in theory you could do that.

But it's really no good. The problem is that LZSA is transmitting substrings with probability based on their character counts. That's ideal for data that is generated by a truly static source (that also has no finite state patterns). Real world data is not like that. Real world sources are always locally changing (*). This means that low-offset recent data is much better correlated than old data. That is easy to model in LZ77 but very hard to model in LZSA; you would have to somehow work the offsets of the strings into their code space allocation. Data that has finite state patterns also benefits from numeric modeling of the offsets (eg. binary 4-8 patterns and so on).

(* = there's also a kind of funny "Flatland" type thing that happens in data compression. If you're in a 2d Flatland, then 2d rigid bodies keep their shape, even under translation. However, a 3d rigid body that's translating through your slice will appear to change shape. The same thing happens with data compression models. Even if the source comes from an actual static model, if your view of that model is only a partial view (eg. your model is simpler than the real model) then it will appear to be a dynamic model to you. Say you have a simple model with a few parameters, the best value of those parameters will change over time, appearing to be dynamic.)

Furthermore, at the point that you're doing LZSA-dynamic, you've got the complexity of PPM and you may as well just do PPM.

The whole advantage of LZSA is that it's an incredibly simple way to get a PPM-like static model. You just take your dictionary and suffix sort it and you're done. You don't have to go training a model, you don't have to find a way to compact your PPM nodes. You just have suffix sorted strings.


Some papers for further reading :

"A note on the Ziv - Lempel model for compressing individual sequences" - Langdon's probability model equivalent of LZ

"Unbounded Length Contexts for PPM" - the PPM* paper

"Dictionary Selection using Partial Matching" - LZ-PM , an early ROLZ

"PPM Performance with BWT Complexity" - just a modification of PPM* with some notes.

"Data compression with finite windows" - LZFG ; the "C2" is the Suffix Trie LZFG that I usually think of.

Mark Nelson's Suffix Tree writeup ; see also Larsson, Senft, and Ukkonen.

There's obviously some relation to ACB, particularly LZSA*, just in the general idea of finding the longest backwards context and then encoding the longest forward match from there, but the details of the coding are totally unrelated.

02-06-15 - LZSA - Part 5

LZSA is not really an LZ. This is kind of what fascinates me about LZSA and why I think it's so interesting (*). Ryg called it "lz-ish" because it's not really LZ. It's actually much closer to PPM.

(* = it's definitely not interesting because it's practical. I haven't really found a use of LZSA in the real world yet. It appears to be a purely academic exercise.)

The fundamental thing is that the code space used by LZSA to encode is match is proportional to the probability of that substring :


P(abc) = P(a) * P(b|a) * P(c|ab)

where P is the probability as estimated by observed counts. That kind of code-space allocation is very fundamentally PPM-ish.

This is in contrast to things like LZFG and LZW that also refer directly to substrings without offsets, but do not have the PPM-ish allocation of code space.

The funny thing about LZSA is that naively it *looks* very much like an LZ. The decoder of LZSA-Basic is :


LZSA-Basic decoder rough sketch :

{

handle literal case
decode match_length

arithmetic_fetch suffix_index in dictionary_size

match_ptr = suffix_array[suffix_index];

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove get_suffix_count( suffix_index, match_length ) , dictionary_size

}

whereas a similar LZ77 on a static dictionary with a flat offset model is :

LZ77-static-flat decoder rough sketch :

{

handle literal case
decode match_length

arithmetic_fetch offset_index in dictionary_size

match_ptr = dictionary_array + offset_index;

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove {offset_index,offset_index+1} , dictionary_size

}

they are almost identical. The only two changes are : 1. an indirection table for the match index, and 2. the arithmetic_remove can have a range bigger than one, eg. it can remove fewer than log2(dictionary_size) bits.

We're going to have a further look at LZSA as a PPM by examining some more variants :


LZSA-ROLZ :

ROLZ = "reduced offset LZ" uses some previous bytes of context to reduce the set of strings that can be matched.

This is trivial to do in LZSA, because we can use the same suffix array that we used for string matching to do the context lookup as well. All we have to do is start the suffix array lookup at an earlier position than the current pointer.

eg. instead of looking up matches for "ptr" and requiring ML >= 3 , we instead look up matches for "ptr-2" and require ML >= 5. (that's assuming you want to keep the same MML at 3. In fact with ROLZ you might want to decrease MML because you're sending offsets in fewer bits, so you could use an MML of 2 instead, which translates to a required suffix lookup ML of 4).

That is, say my string to encode is "abracket" and I've done "ab" so I'm at "racket". My static dictionary is "abraabracadabra". The suffix sort is :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra
bra
braabracadabra
bracadabra
cadabra
dabra
ra
raabracadabra
racadabra
The dictionary size is 15.

With LZSA-Basic I would look up "racket" find a match of length 3, send ML=3, and index = 14 in a range of 15.

With LZSA-ROLZ-o2 I would do :


string = "ab|racket"

look up context "ab" and find the low & high suffix indexes for that substring
 (low index = 2, count = 3)

look up "abracket" ; found "abracadabra" match of length 5
 at index = 4

send ML=3

arithmetic_encode( suffix_index - context_low_index , context_count )
  (that's 2 in a range of 3)

You could precompute and tabulate the suffix ranges for the contexts, and then the complexity is identical to LZSA-Basic.

In LZSA-ROLZ you cannot encode any possible string, so MML down to 1 like LZSA-HC is not possible. You must always be able to escape out of your context to be able to encode characters that aren't found within that context.

LZSA-Basic had the property of coding from order-0,order-1,2,3,.. ML, jumping back to order-0. In LZSA-ROLZ, instead of jumping down to order 0 after the end of a match, you jump down to order-2 (or whatever order you chose for the ROLZ). You might then have to jump again to order-0 to encode a literal. So you still have this pattern of the context order going up in waves and jumping back down, you just don't jump all the way to order-0.


LZSA-ROLZ* :

(pronounced "ROLZ star" ; named by analogy to "PPM*" (PPM star) which starts from the longest possible context)

This idea can be taken further, which turns out to be interesting.

Instead of duing ROLZ with a fixed order, do ROLZ from the highest order possible. That is, take the current context (preceding characters) and find the longest match in the dictionary. In order to do that efficiently you need another lookup structure, such as a suffix trie on the reversed dictionary (a prefix tree). The prefix tree should have pointers to the same string in the suffix tree.

eg. say you're coding "..abcd|efgh.."

You look up "dcba..." in the prefix tree (context backwards). The longest match you find is "dcbx..". So you're at order-3. You take the pointer over to the suffix tree to find "bcd" in the suffix tree. You then try to code "efgh..." from the strings that followed "bcd" in the dictionary.

You pick a match, send a match length, send an offset.

Say the deepest order you found is order-N. Then if you code a match, you're coding at order-N+1,order-N+2,etc.. as you go through the match length.

The funny thing is : those are the same orders you would code if you just did PPM* and only coded symbols, not string matches.

Say you're doing PPM* and you find an order-N context (say "abcd"). You successfully code a symbol (say "x"). You move to the next position. Your context now is "abcdx" - well, that context must occur in the dictionary because you successfully coded an x following abcd. Therefore you will an order-N+1 context. Furthermore there can be no longer context or you would have found a longer one at the previous location as well. Therefore with PPM* as you successfully code symbols you will always code order-N, order-N+1,order-N+2 , just like LZSA!

If LZSA-ROLZ* can't encode a symbol at order-N it must escape down to lower orders. You want to escape down to the next lower order that has new following characters. You need to use exclusion.

Remember than in LZSA the character counts are used just like PPM, because of the way the suffix ranges form a probability distribution. If the following strings are "and..,at..,boobs.." then character a is coded with a 2/3 probability.

There are a few subtle differences :

In real PPM* they don't use just the longest context. They use the *shortest* determinstic context (if there is any). If there is any deterministic context, then the longest context is determinstic, predicting the same thing. But there may be other mismatching contexts that predic the same thing and thus affect counts. That is :


..abcdefg|x    abcdefg is the longest context, and only x follows it
.....xefg|y    context "efg" sees both x and y.  So "defg" is the shortest determinstic context
....ydefg|x    but "ydefg" also predicts x

So the longest determistic context only sees "x" with a count of 1

But if you use the shortest determinstic context, you see "x" with a count of 2

And that affects your escape estimation.

The big difference is in the coding of escapes vs. match lengths.

And if you code matches in LZSA* from orders lower than the deepest, that results in different order selection than PPM*.

old rants