The Scientific Approach to Measuring Startup Progress (Part 1/2)

Let us recall the toy example from our previous post:
We want to find out if introducing a new feature in our product or web page will improve a given metric, like the number of clicks on a certain button.
Let's call N1 and N2 the number of button clicks per unit of time before and after deploying the new feature. The naive approach to measuring progress in this example is to directly compare m1 and m2, the number of clicks measured during a period of time of length T before and after deploying the new feature. We argued in our previous post that this approach is essentially wrong, since it doesn't acknowledge the fact that we are measuring things subject to randomness.

The fallacy of the naive approach is to assume that m1 and m2 are equal to N1 and N2, respectively. Fortunately for us, Probability Theory teaches us exactly what to do in this situation. Variables N1 and N2 should be considered to be random variables (think of two dice) and m1 and m2 their samples (think of the outcome of throwing each die once). We have to be careful when concluding that N1 > N2 just because m1 > m2 and take a look at the next section.



The Scientific Approach to Measuring Startup Progress

 The Lean Startup is about applying the scientific method to startup management. So, let's do science and acknowledge the random nature underlying the number of clicks registered before and after deploying the new feature. This will allow us to say something awesome like this:
The improvement registered in our test metric after deploying the new feature is statistically significant at a  significance level of 5%.
To get to this sentence we have to setup a statistical hypothesis test, which is a tool for making decisions when your data is subject to randomness. In statistical hypothesis testing you define what you think the outcome of the test will be, the so-called null hypothesis. Then the test proves you right or wrong by accepting or rejecting your null hypothesis.

Statistical hypothesis tests can fail. They can accept your null hypothesis when they should reject it, or they can reject it when they should accept it. In statistics, tests are designed to favor your null hypothesis, and incorrectly rejecting it is considered to be the worst mistake ever. The probability of making this mistake is what is called the significance level.

Expect the best but plan for the worst: In our toy example we define our null hypothesis like this:
E[N2] ≤  E[N1], where E[N1] and E[N2] are the average number of button clicks before and after deploying the new feature, respectively.
Note that E[N1] and E[N2] are unknown, all we can do is obtain estimates of these values from samples.
So, our hypothesis by default is that after deploying the new feature we are doing the same or worse than before. If the null hypothesis is rejected and we conclude that the new feature rocks, the significance level is the probability of this conclusion being wrong. A significance level  of 5% is stringent enough in most practical applications, including startup progress measurement. We'll use this value from now on. 

Once we have defined our null hypothesis we proceed with the experiment and perform the test:
  1. Divide and Conquer
    We divide the observation period T1 before releasing the new feature in n1 subintervals of length t1=T1/n1. For instance, if T1 is a week, we could divide it into 7 subintervals of length equal to a day. We do the same with T2, the observation period after releasing the new feature, and obtain n2 subintervals of length t2=T2/n2. This step is actually not mandatory. You can make n1=n2=1 and forget about dividing T1 and T2 in subintervals. However, the larger the numbers n1 and n2 are, the better the quality of your decisions, so I recommend you to divide and conquer. 
  2. Measure:
    We measure our metric during each one of these subintervals. In our example, we get n1 samples of N1 and n2 samples of N2. Let k1 and k2 be the sum of the values of all n1 and n2 samples, respectively.
  3. Model
    We choose an appropriate random variable to model the target metric. In the vast majority of cases the startup metric will be discrete, like the number of clicks or the number of conversions of a goal. This means we can use a discrete random variable. I suggest using a  Poisson distributed random variable, since like most startup metrics it can take values from 0 to infinity, it is easy to manipulate, and it is usually the distribution of choice when modeling human-related behavior (read for instance this).
  4. Test
    Apply the statistical hypothesis test for the chosen random variable and accept or reject the null hypothesis. For the Poisson random variable the well-known conditional test rejects our null hypothesis if:

If the test rejects our null hypothesis we are finally entitled to say:
I know with a probability of being wrong of 5% (my significance level) that after deploying the new feature we are doing better concerning our test metric.
We have prepared a spreadsheet for you to run this test, in case you don't want to mess around with the equations. It's the one we are using the monitor the progress of browseye.


Don't Miss Part 2 of this Post

In our next post we'll discuss important things to take into account when applying the scientific approach to measuring startup progress. More specifically, we'll tell you how to:
  • interpret the test results
  • define your metrics
  • know when you should be using the scientific approach or the naive approach
  • find out how much better (or worse) you are doing

No comments:

Post a Comment