## Summary of the Problem

Let us recall the toy example from our previous post:We want to find out if introducing a new feature in our product or web page will improve a given metric, like the number of clicks on a certain button.

**Notation**:

and**N1**are the number of button clicks per unit of time before and after deploying the new feature, respectively**N2**and**n1****n2**and**k1**are the sum of the values of all**k2**and**n1**samples, respectively**n2**- The null hypothesis of our experiment is
where**E[N2]**≤**E[N1]**,and*E[N1]*are the average values of*E[N2]*and**N1**, respectively**N2** is the significance level, which is the probability of making a mistake when rejecting the null hypothesis. We assume that a=0.05*a*

## Interpreting Results

Let's assume that we run the test on real data following the process described in our previous post.If the null hypothesis is rejected, we can conclude that in average we have more button clicks after deploying the new feature than before. The probability of this conclusion being right is very high (95% since a=0.05). We can safely congratulate our team and ourselves; the new feature rocks.

If the null hypothesis is not rejected we could be tempted to increase the significance level and run the test again. Then we'd then say something like:

But that's the equivalent of saying:We know with a significance level of 35% that we are doing better

So, I strongly recommend you to stick to the 5% significance level andWe are not very sure that we are doing better

**don't touch that number**. If the test does not reject your null hypothesis, accept the news.

## The Independence Assumption

You should always define metrics based on the behavior of**new users**. For instance, in our case our metric would be the number of clicks on a business-sensitive button made by new users joining our app during the observation period. Only then can we model our metric with a random variable. Otherwise, dependencies among subintervals arise and the metric has to be modeled with a stochastic process. In this case, further assumptions need to be made about the nature of the dependencies and the statistical test becomes more involved and/or analytically intractable.

## When should I use the Scientific Approach?

In practice, the naive approach will work fairly well only when you have**many users**generating events relevant to the target metric. When this is not the case the naive approach can lead to the wrong conclusions without leaving you a clue about how wrong they are. The situation is

**particularly delicate for young startups**, which have to make important decisions based on the evidence left by a reduced number of users.

For young startups my advice is toWe briefly explain the mathematical reasoning supporting the above conclusions. The probability that a user clicks on the button during an interval of timeTcan be modeled with a random variable. Therefore, the number of button clicks duringTis equal to the sum ofuindependent and identically distributed random variables, whereuis the number of users. According to the Classical Central Limit Theorem, this sum converges to a Normal distribution. As the numberuof users increases, the variance of this Normal distribution converges to zero by a linear factor of1/u. Thus, whenuis large enough the number of button clicks resembles a deterministic (i.e., non random) variable and a single samplem1is a good representative of its mean. In this case comparingN1andN2is almost equivalent to comparing two of their samplesm1andm2.

**always use the Scientific Approach**. I know that using the Scientific Approach requires an extra effort, but without it you simply can't trust your interpretation of the data. And if you can't trust your interpretation of the data, you can't make the right decisions that will take you to the top.

Startups with many of users you can eventually fall back to the Naive Approach. If you bothered to divide into subintervals and take different measures (i.e., samples) you can use the maximum likehood estimates (MLE) of the means

**E[N1]**and

**E[N2]**as representatives for

**N1**and

**N2**, respectively. For the Poisson distribution, the MLE of

**E[N1]**and

**E[N2]**are simply

**k1 / n1**and

**k2 / n2**, respectively. This is slightly better than the pure Naive Approach of comparing just two samples of

**N1**and

**N2**. But still, you'll never be sure that the conclusions you get by using this trick are right until you run the test with the Scientific Approach. So, what's the point of taking the risk? If you have that many users, you probably have the resources needed to do things properly. I would therefore suggest using the Scientific Approach here as well.

## How much better are we doing?

This section is for those who are not satisfied with knowing that they are doing better, but want to know

**how much better they are****exactly****doing**. The idea is to generalize the null hypothesis like this:E[N2] / E[N1]≤c, for c ≥ 0.

Notice that with

**c**=1 we are back to the question of whether we are doing better with the new feature. But other values of**c**tell us exactly how much better (**c**>1) or worse (**c**<1) we are doing now compared to when we didn't have the new feature.
In this case, the Divide and Conquer, Measure and Model steps are the same like before, but the Test step depends on the chosen value for

**c**and it is slightly different:
Setting always

We have prepared a spreadsheet for you to run this test (the same one from our previous post), in case you don't want to mess around with the equations. It's the one we are using the monitor the progress of browseye.

**a**=0.05, the way to proceed now may be something like this. Perform the test for**c**=1. If the null hypothesis is rejected we now that we are doing better after deploying the new feature. Repeat the test for**c**=2. If the null hypothesis is rejected we know that we are doing twice as good as before. In general,**if the test rejects the null hypothesis for c, we know that we are doing c times as good as before**. The problem then reduces to finding the largest**c**that rejects the null hypothesis.We have prepared a spreadsheet for you to run this test (the same one from our previous post), in case you don't want to mess around with the equations. It's the one we are using the monitor the progress of browseye.

## No comments:

## Post a Comment