My aim is to compare the results of two randomized algorithms to prove that one does better than the other. Normally you'd use a hypothesis test like the Mann Whitney U or T Test. These tests have a null hypothesis that both sets of results are from the same distribution, in this context this means neither is higher or lower. These tests produce a test statistic and if this test statistic is below a threshold you set, e.g. 0.05 then you can reject the null hypothesis and state that the results are different with 95% confidence (1- 0.05 = 0.95 = 95%).

As I'm trying to produce a framework which enables minimal runs to be carried out when testing on expensive problems, I was told that it is possible to create a system like this:

1) You run both algorithms n times, where n is small and test for a significant difference.

2) If there is a significant difference you stop testing, if there isn't you run another test and carry out your hypothesis test on your now n+1 samples.

3) Repeat step 2 until a reasonable maximum. So that if you only need a small number of runs to obtain significance you will only run that small number and no more.

You can't just do this and run multiple Mann Whitney tests, for example, as the probability of getting a significant result at some point just by chance approaches 1. But I was told about one solution: Sequential sampling techniques of which the only variety that I could find much info on is the Sequential Probability Ratio Test.

This is introduced here:

http://www.stats.ox.ac.uk/~steffen/teaching/bs2siMT04/si14c.pdf

So you obtain a log likelihood ratio L from your data. If L is below A then you reject your null hypothesis, if it is above B then you accept your null hypothesis and stop testing. If it is between the two then you run more tests and carry on. A is calculated from your threshold, e.g. 0.05 and B is calculated from a similar threshold for accepting the null hypothesis.

The question is how to calculate the log likelihood ratio. It seems to be the case that - given a set of samples X, a null hypothesis θ_0 and an alternative hypothesis θ_1 that it is simply

log (probability of X occurring given θ_1 = p(X | θ_1) ) / p(X | θ_0).

My question is: How do I calculate this likelihood for my problem? Is it possible or am I barking up the wrong tree altogether? I have data from two algorithms - is X in my case the data for both algorithms combined? Is it the treatment algorithm? Is it the control? p(X | θ_0) in my case should be the probability of our results given that both datasets are the same but how is this calculated? Probability of treatment results occurring given the distribution calculated from the control? Probability of treatment and control results together occurring given we know, for example, that our data follows a normal distribution? Most confusing though is p(X | θ_1) as a "difference of some sort" is not really a solid statement that can give us a distribution to use to compute likelihood - do you need to decide on a threshold for the difference? I should stress I'm working with the assumption that you know the rough distribution of your data. I know this isn't non parametric like Mann Whitney U but is more parametric like T Test.

I'm beginning to believe I was told this was a solution to my problem in error and I'm trying to solve the wrong problem with the wrong technique so if anyone knows this is the case or can think of an alternative then please tell me. Any advice of any sort would be welcome - I've spent ours googling for some sort of solution.

Many thanks, sorry for the long confused post,

Geoff