The Scientific Approach to Measuring Startup Progress (Part 2/2)

Summary of the Problem

Let us recall the toy example from our previous post:
We want to find out if introducing a new feature in our product or web page will improve a given metric, like the number of clicks on a certain button.
  • N1 and N2 are the number of button clicks per unit of time before and after deploying the new feature, respectively
  • n1 and n2 are the number of metric measurement samples before and after deploying the new feature, respectively
  • k1 and k2 are the sum of the values of all n1 and n2 samples, respectively
  • The null hypothesis of our experiment is E[N2] ≤  E[N1], where E[N1] and E[N2] are the average values of N1 and N2, respectively
  • a is the significance level, which is the probability of making a mistake when rejecting the null hypothesis. We assume that a=0.05

 Interpreting Results

Let's assume that we run the test on real data following the process described in our previous post.

If the null hypothesis is rejected, we can conclude that in average we have more button clicks after deploying the new feature than before. The probability of this conclusion being right is very high (95% since a=0.05). We can safely congratulate our team and ourselves; the new feature rocks.

If the null hypothesis is not rejected we could be tempted to increase the significance level and run the test again. Then we'd then say something like:
We know with a significance level of 35% that we are doing better
But that's the equivalent of saying:
We are not very sure that we are doing better
So, I strongly recommend you to stick to the 5% significance level and don't touch that number. If the test does not reject your null hypothesis, accept the news.

The Independence Assumption

You should always define metrics based on the behavior of new users. For instance, in our case our metric would be the number of clicks on a business-sensitive button made by new users joining our app during the observation period. Only then can we model our metric with a random variable. Otherwise, dependencies among subintervals arise and the metric has to be modeled with a stochastic process. In this case, further assumptions need to be made about the nature of the dependencies and the statistical test becomes more involved and/or analytically intractable.

When should I use the Scientific Approach?

In practice, the naive approach will work fairly well only when you have many users generating events relevant to the target metric. When this is not the case the naive approach can lead to the wrong conclusions without leaving you a clue about how wrong they are. The situation is particularly delicate for young startups, which have to make important decisions based on the evidence left by a reduced number of users.
We briefly explain the mathematical reasoning supporting the above conclusions. The probability that a user clicks on the button during an interval of time T can be modeled with a random variable. Therefore, the number of button clicks during T is equal to the sum of u independent and identically distributed random variables, where u is the number of users. According to the Classical Central Limit Theorem, this sum converges to a Normal distribution. As the number u of users increases, the variance of this Normal distribution converges to zero by a linear factor of 1/u. Thus, when u is large enough the number of button clicks resembles a deterministic (i.e., non random) variable and a single sample m1 is a good representative of its mean. In this case comparing N1 and N2 is almost equivalent to comparing two of their samples m1 and m2.
For young startups my advice is to always use the Scientific Approach. I know that using the Scientific Approach requires an extra effort, but without it you simply can't trust your interpretation of the data. And if you can't trust your interpretation of the data, you can't make the right decisions that will take you to the top.

Startups with many of users you can eventually fall back to the Naive Approach. If you bothered to divide into subintervals and take different measures (i.e., samples) you can use the maximum likehood estimates (MLE) of the means E[N1] and E[N2] as representatives for N1 and N2, respectively. For the Poisson distribution, the MLE of E[N1] and E[N2] are simply k1 / n1 and k2 / n2, respectively. This is slightly better than the pure Naive Approach of comparing just two samples of N1 and N2. But still, you'll never be sure that the conclusions you get by using this trick are right until you run the test with the Scientific Approach. So, what's the point of taking the risk? If you have that many users, you probably have the resources needed to do things properly. I would therefore suggest using the Scientific Approach here as well.

How much better are we doing?

This section is for those who are not satisfied with knowing that they are doing better, but want to know how much better they are exactly doing. The idea is to generalize the null hypothesis like this:
E[N2] / E[N1] ≤  c, for c ≥ 0.
Notice that with c=1 we are back to the question of whether we are doing better with the new feature. But other values of c tell us exactly how much better (c>1) or worse (c<1) we are doing now compared to when we didn't have the new feature.

In this case, the Divide and Conquer, Measure and Model steps are the same like before, but the Test step depends on the chosen value for c and it is slightly different:

Setting always a=0.05, the way to proceed now may be something like this. Perform the test for c=1. If the null hypothesis is rejected we now that we are doing better after deploying the new feature. Repeat the test for c=2. If the null hypothesis is rejected we know that we are doing twice as good as before. In general, if the test rejects the null hypothesis for c, we know that we are doing c times as good as before. The problem then reduces to finding the largest c that rejects the null hypothesis.

We have prepared a spreadsheet for you to run this test (the same one from our previous post), in case you don't want to mess around with the equations. It's the one we are using the monitor the progress of browseye.


No comments:

Post a Comment