9/12/2010

09-12-10 - Context Weighting vs Escaping

The defining characteristic of PPM (by modern understanding) is that you select a context, try to code from it, and if you can't you escape to another context. By contrast, context weighting selects multiple contexts and blends the probabilities. These are actually not as different as they seem, because escaping is the same as multiplying probabilities. In particular :

    context 0 gives probabilities P0 (0 is the deeper context)
    context 1 gives probabilities P1 (1 is the parent)
    how do I combine them ?

    escaping :

    P = (1 - E) * P0 + E * (P1 - 0)

    P1 - 0 = Probablities from 1 with chars from 0 excluded

    weighting :

    P = (1 - W) * P0 + W * P1

    with no exclusion in P1

In particular, the only difference is the exclusion. Specifically, the probabilities of non-shared symbols are the same, but the probabilities of symbols that occur in both contexts are different. In particular the flaw with escaping is probably that it gives too low a weight to symbols that occur in both contexts. More generally you should probably be considering something like :

    I = intersection of contexts 0 and 1

    P = a * (P0 - I) + b * (P1 - I) + c * PI

that is, some weighting for the unique parts of each context and some weighting for the intersection.

Note that PPMii is sort of trying to compensate for this because when it creates context 0, it seeds it with counts from context 1 in the overlap.

Which obviously raises the question : rather than the context initialization of PPMii, why not just look in your parent and take some of the counts for shared symbols?

(note that there are practical issues about how your compute your weight amount or escape probability, and how exactly you mix, but those methods could be applied to either PPM or CW, so they aren't a fundamental difference).

2 comments:

Matt Mahoney said...

Actually the context mixing in PAQ8 is done in the logarithmic domain.

P = squash(W0 * stretch(P0) + W1 * stretch(P1))

where

stretch(P) = log(P/(1-P))

squash(P) = 1/(1 + exp(-P)) = stretch^-1(P)

This has the effect of favoring high confidence predictions near 0 or 1. So instead of combining 0.9 with 0.999 to get 0.95, you get something like 0.99 instead.

cbloom said...

Yeah I thought someone might point that out, but I didn't mention it because you can apply arbitrary warps on the probabilities in the escaping case as well. While it might be important in practice to CM doing well, it's not a fundamental difference that can't be applied back to PPM-like methods. In fact distorting probability before escaping is a classic thing to do, for example exaggerating deterministic probabilities.

What you never get in the escaping case is contribution to the probabilities of the characters in the overlapping set, which seems to me to be the biggest difference.

The other huge difference is one of convention :

In the mixing case you are effectively making a mixing weight for each context which depends on all the contexts (after you normalize the weights to sum to one). That is, the final normalized weight for c0 depends on all the other contexts.

In the escaping case, the normal operation is to compute w0 only from context 0, ignoring all other contexts. This obviously is a big difference.

Also in escaping you have to have a clear order of "I preffer this context, then this one, then this one", while in mixing you can have various contexts which are equally important.

old rants