I got the new Firefox (2.something) and WTF it still blows chunks. Pages take much longer to load & format than in IE and they pop all over during formating in evil ways. I had to get "SetBrowser" and "FireTune" to even get it working right. I wish I could fucking use it because I want the better security and ad blocking and such, but I'm not sure I can handle the craptacularity. Also, you can't drag & drop url's between IE and Firefox which makes it impossible to use a hybrid browsing approach. Blah blah I know about initialpaint delay.
12/31/2006
12/30/2006
123006  2
123006  1
12/29/2006
122906  1
12/25/2006
122506  1
I'm not around any really little kids this year, which sucks.
12/22/2006
122206  3
122206  2
122206  1
Meh, most of my logic is flawed here. I'm a condescending ignorant ass.
12/21/2006
122106  3
Say you have some function F(A,B) where A and B are vectors. You wish to solve F(A,B) = 0 , (or perhaps minimize an error E(A,B), so take the derivative to get F's). Do some algebra and substitutions until you have two seperate functions that depend only on the other variable :
A = G(B)
B = H(A)
Where G and H may be arbitrary nonlinear functions so you can't just plug them together and solve it. If B was the right answer for B then A =G(B) would give you the exact solution for A and viceversa.
The hacky solution is basically just to set A from G, then B from H, and repeat. There are some simple hacks to make it better.
First imagine the solutions converging as a time series, so we have A_t and B_t. First of all it's important to have a pretty good initial guess for A_0 and B_0 (actually just for one or the other) because this method doesn't work if you're very far from the right answer.
The next hacky trick is to take less than the full step for the first few steps. A full step would be :
A_t_1 = G(B_t)
B_t_1 = H(A_t)
( A_t_1 should be read as A_(t+1) , it's A at the
next time step)
Instead take a fractional step of size f (f in 0  1) :
A_t_1 = (1f) * A_t + f * G(B_t)
B_t_1 = (1f) * B_t + f * H(A_t)
This avoid oscillations and helps you settle in. You can start with f = 0.5 and quickly step it up 0.6,0.7. For some problems you might want to have "f" max out at 0.99 or something rather than 1.0
The next hack which helps a lot is to use A_t_1 instead of A_t when solving for B_t_1 :
A_t_1 = G(B_t)
B_t_1 = H(A_t_1)
This sort of lets you take a more exact step. If you want you can swap the order of A & B on every other step, or do some sort of fancier thing like :
A_int = 0.5 * A_t + 0.5 * G(B_t)
B_t_1 = H(A_int)
A_t_1 = G(B_t_1)
which is sort of like a centeredtime integrator or something though I have no idea if that's actually better. Actually this whole thing sort of reminds me of implicit integrators.
Anyway, in practice this type of iteration converges incredibly fast. In problems where I've used Newton's method or Conjugate Gradient or something and it's taken like 500 iterations to converge, this iteration took 20 steps. Obviously it only works on certain types of problems and they have to have well behaved local minima, but when it works it works great.
To be a bit more concrete I'll give you a specific example. I've been examining the "weighted SVD", and one step of that is the approximation of a matrix with a rank1 tensor (that's just an outer product of two vectors). The problem is :
Given matrix A_ij
and weights w_ij
Find vectors U_i and V_j such that the error
E = Sum_ij { w_ij * (A_ij  U_i*V_j) }
is minimized
If you take derivatives of E with respect to U and V you can solve for two equations :
U_i = Sum_j { A_ij * w_ij * V_j } / Sum_j { w_ij * (V_j^2) } = H(V)
V_j = Sum_i { A_ij * w_ij * U_i } / Sum_i { w_ij * (U_i^2) } = G(U)
Directly solving these is impossible (I think), but they're exactly in the form I need for my hacky iteration, and in fact it works super well.
I guess the reason no one likes this approach is that on general problems it can very badly fail to converge.
Addendum : this approach is also great for equations in one variable by substitution. For example say you have some evil equation to solve like :
3 + 7*x + sin(x) + sin(x)^2 = 0
Just set y = sin(x) and pretend you have equations in two variables. Now :
x = ( 3  y  y*y)/7
y = sin(x)
Are the two equations and you just use the alternate stepping approach. The halfstep iteration I described here converges to within FLT_EPSILON in 5 steps !!
122106  2
122106  1
I hear a lot of women say that they admire or respect Condy Rice for her acheivements and for the fact that she's a woman with a top role in the government. That sentiment disgusts me. I find her of the same cloth as Colin Powell, who is also despicable. Someone who is clearly intelligent and rational is even more accountable for their actions, and by making themselves totally subservient to a madman they have given up all their integrity to further their careers. They know that they're lying, they know they're trying to cover up gross misdeeds and incompetence, they know their actions are leading to the death of innocents; sometimes you can even see it on their faces in an interview where they take a tough question and they almost wince before they shake it off and spout the awkward lies again. Someone like Rumsfeld I almost can excuse because he's clearly just loony tunes or drunk on his own koolaid.
12/19/2006
121906  1
First of all, generally the output should be "linear" in A's and B's. That is, sqrt(A*B) is good but A*B is not, because it gives you 2 powers of A or B. We think like physics and assume these things have units, the output should have the same units.
Secondly, we must be aware of scales. If A & B are scaled in exactly the same way so their numeric values have the same significance, then (A+B) is a good combiner. If they are not scaled the same, then any forms which add A & B are not okay. In that case you only really have one option :
sqrt(A*B) * G(A/B)
Often A & B have the same significance which means the output should be symmetric in swap A <> B. In that case the G function has limitted forms. I haven't thought about exactly what they can be, but it has to be things like (A/B) + (B/A). In fact if A & B aren't on a similar scale even that form is not okay.
If we assume A & B are on the same scale, then additive forms are okay and it opens up some other options. (A+B)/2 is obviously your first guess.
2AB / (A+B) is an interesting combiner. If an A&B are in [0,1] then this takes (0,x)>0 , (.5,.5) > .5 and (1,1) > 1, and sort of penalizes when they're not equal. It takes (x,x) > x which is a nice property of any combiner when you're trying to make a combiner that can stand in for (A+B)/2.
sqrt(A^2 + B^2) is another, and then you can take simple multiples of these, which gives you forms like (A^2 + B^2)/(2AB).
Anyway the point is that there really aren't very many forms to choose from which satisfy the basic properties and you can easily try them all.
12/16/2006
121606  3
Say you have two experts. They make predictions which have a Guassian error. The expert provides both the prediction (the center of the Gaussian) and his best estimate of the accuracy of the prediction (the sdev of the Gaussian, which is the sqrt of the variance), call this P & S, so you have {P1,S1} and {P2,S2}.
We're going to make a combine prediction by guessing a weight for each expert (here W1 and W2). The combined prediction is then :
Pred = (W1 * P1 + W2 * P2) / (W1 + W2)
(or) Pred = P1 + (W2 / (W1+W2) ) * (P2  P1)
Pred = P1 + F * (P2  P1)
F = 1 / (1 + W1/W2)
This latter form is more useful in practice because it allows you to apply arbitrary scaling to each term so you can run them through an lsqr or nnet or whatever. (assuming W1 and W2 never go to zero; if they do then just choose the other pred)
The ideal combination of experts in general is a contextdependent problem, but a general good way to weight them is via the entropy of the prediction. If you knew the instantaneous entropy of each prediction, H, the ideal weight would be :
W = e^(  Beta * H )
for some Beta which in general depends on the problem. In practice, Beta = 2 is usually very good.
The entropy of a Gaussian can be analytically integrated. The answer is :
H(Gauss) = (1/2) * ln(2pi) + ln(S)
Where S = the sdev (sigma), recall. Since we assumed our predictors were Gaussian we can use this. But we can simplify :
W = e^(  Beta * H ) = C * e^(  Beta * ln(S) ) = C * ( S^Beta )
Where we put a bunch of constants in C, and they don't matter because it's an overall scaling of W which divides out.
F = 1 / (1 + W1/W2)
F = 1 / (1 + S1^Beta/S2^Beta
F = 1 / ( 1 + (S2/S1)^Beta )
If we use V = S^2 (V is the "variance" of the Gaussian, or the MSE), and choose the common case of Beta=2, then we have a very simple expression :
F = 1 / ( 1 + V2/V1 )
or F = V1 / (V1 + V2)
and this is the same as W1 = V2 and W2 = V1
Which is in fact a very good way to combine Gaussian predictions.
121606  2
121606  1
12/15/2006
121506  2
Related to my perpetual improved ranting method search, the new Google Blogger (now in beta) will have an SDK for code access, so I imagine it will be easy to write an app to throw raw text at it via Blogger GData
121506  1
12/13/2006
121306  1
1. The FANN Neural Net library is actually pretty decent. It's missing a lot of functions you would want, but it's reasonable easy to write them yourself.
2. SVDLIBC is a nice little implementation of the old "Lanczos" SVD algorithm which is fast and good. It comes from the old FORTRAN math goodness. One amazing little routine I found in there is "Machar" which will deduce the properties of your FPU by doing a few simple math operations!
3. If you search google for things like "SVD learning unknown missing" you will find about 100 papers; this is a major field of research, a lot of it driven by computer vision.
4. Simon's spilled the beans on his Netflix approach. His description of the "GHA" (Generalized Hebbian Algorithm) is way way clearer than the paper. The Hebbian is actually a type of Neural Network, and I find this approach much easier to understand if you just think of it in terms of doing a gradient descent to optimize the SVD vectors. I also found that Gorrell has released source code for her own GHA Lab
12/12/2006
121206  1
12/10/2006
121006  1
12/09/2006
120906  1
12/06/2006
120606  1
12/03/2006
120306  1
A lot of people in Sports Betting use the motivation of the two teams as a basis for choosing the side to bet. The theory is that the team that needs a win is more likely to play to full effort and beat the spread. I've always been skeptical of these theories and thought I'd try to statistically examine how important this theory actually is.
I examined all the NFL games from 1999 to 2006. For each game I made a rough evaluation of which of the teams "needed a win". Only games in weeks 814 were examined, which is the heart of the playoff run. Teams that "need a win" were any teams with a record >= 50% wins and with a record worse than 82. Generally these are the teams on the bubble for their division or a wildcard spot. Note that this isn't a totally precise measure of "need a win" but it should be good enough to see some statistical correlation if any exists.
Here are the results. There are four categories :
cat 0 = both teams don't need a win cat 1 = road team needs a win cat 2 = home team needs a win cat 3 = both teams need a win The results are (home wins)(home losses)(pushes) ATS = Against The Spread 32 teams, 1992 games cat 0 WLP : 8275780 cat 0 ATS : 67868344 cat 1 WLP : 841250 cat 1 ATS : 1031006 cat 2 WLP : 138650 cat 2 ATS : 941027 cat 3 WLP : 105691 cat 3 ATS : 91768
In particular look at categories 1 and 2. The team that needs a win is 263149 vs. teams that don't need a win. However, they are only 194205 ATS !!
Now, is that WL record even significant? The fact is most of the time you have a "need a win" vs. "dont need a win" matchup it's a pretty good team vs. a very bad team, so it's no surprise that they win a lot more.
For comparison I did the same study, but just grouped based on teams with a record >= 50% against teams with a record < 50% , eg. winning teams vs. losing teams.
32 teams, 1992 games cat 0 WLP : 2191360 cat 0 ATS : 1771699 cat 1 WLP : 1862680 cat 1 ATS : 22421317 cat 2 WLP : 3281110 cat 2 ATS : 21621112 cat 3 WLP : 4213221 cat 3 ATS : 34936827
Winnings teams are 596297 against losing teams, which is actually a better ratio than the "need a win" vs "dont need a win".
It's possible that a better measure of "need a win" would reveal a slight statistical significance, but the absense of any here indicates it is not strong at all.
12/02/2006
120206  1
Also Radiohead Giglink Dump is full of great live shows. Check out Thom Yorke's solo at the Bridge School.
12/01/2006
120106  2
I couldn't find this info easily anywhere on the web, so here you go. Lossless color transforms :
Color transform notes : JPEGLS (LOCOI) uses : {RGB} > { (RG), G, (BG) } and the inverse is obvious BTW this is very nice in that you can go 8bit > 8bit using modulo arithmetic, many of the others require 9 bit YUV coefficients. J2K (aka RCT aka CREW) uses : {RGB} > Y = (R + 2G + B)/4 U = RG V = BG G = Y  (U+V)/4; R = U + G B = V + G (the divides are floored) YCoCg is : (similar to "RCT" but decorrelates better) lossy : Y = (R + 2G + B)/4 Co= (RB)/2 Cg= (2G  R  B)/4 G = Y + Cg R = Y  Cg + Co B = Y  Cg  Co lossless : (Co and Cg are effectively scaled up by 2) Co = RB t = B + (Co/2) Cg = Gt Y = t + (Cg/2) s = Y  (Cg/2) G = Cg + s B = s  (Co/2) R = B + Co
In general with the lossless YUV type transforms, the chromatic coefficients are roughly 2X bigger than they should be in a perceptual space in order to have the bits to get back to RGB exactly.
Note that YUV was created for perceptual uniformity, NOT for decorrelation, so it's no surprise that you can beat it for decorrelation. In fact, doing the 3x3 KLT and sending the coefficients is totally worth it if you're doing compression. (actually YUV was created for some weird use in televisions)
120106  1
NOTE ON COLOR CONVERSIONS : int colors are in [0,255] Int pel "x" is considered to represent the span {x0.5,x+0.5}/255 float colors are in [0,1] int 0 > float 0 int 255 > float 1 Note that this way is nice in that it keeps the maximums the same, but it's bad in that it wastes color space. That is, there's only 1/255 precision instead of 1/256 which is totally possible. Another way of seeing it is that the int range [0,255] is actually being mapped to the float range { 0.5/255 , 255.5/255 } , that is, {0,1} is not the full range. One nice thing about this though is that it gives you some slop in your float>int. As long as the float is in { 0.5/255 , 255.5/255 } you don't need to clamp. /* // more pure quantizer way : // int value [i] represents the bucket (i,i+1)/256 inline int ColorFTOI(float f) { return ftoi(f * 256.f); } inline float ColorITOF(int i) { return (i + 0.5f) * (1.f/256.f); } */ // way that maps 0<>0 and 255<>1.0 inline int ColorFTOI(float f) { return ftoi(f * 255.f + 0.5f); } inline float ColorITOF(int i) { return i * (1.f/255.f); }
old rants

▼
2006
(654)

▼
December
(26)
 123106  1
 123006  2
 123006  1
 122906  1
 122506  1
 122206  3
 122206  2
 122206  1
 122106  3
 122106  2
 122106  1
 121906  1
 121606  3
 121606  2
 121606  1
 121506  2
 121506  1
 121306  1
 121206  1
 121006  1
 120906  1
 120606  1
 120306  1
 120206  1
 120106  2
 120106  1

▼
December
(26)