Statistics for Programmers
As a programmer you want to know how your programs affect the world, when you do a change to the code and deploy it you want to know how it would affect your users, if you have many users and you use canary deployments and gradual deployments you want to know if this 1% of users that received your feature were they happy or not, this 1% is a sample, is this sample good enough? Can you infer from it to the broad population of your users? For this we need to know how to do sampling good, and to check if our sample is good or not, yes as programmers we also need to know about confidence intervals and the central limit theorem and hypothesis testing - but hold on we are not mathematicians we are programmers, this why this is for you because this is statistics for programmers.
From reality to numbers to representative numbers to inference and testing with samples what is the true reality
We are especially interested in the relation between samples to the actual population, so that we can take samples of our users for example and deduce from it on the whole population.
We are collecting evidence from samples the more evidence we have the more confident we are in our actual population mean.
We use numbers to describe our world, in statistics we try to organize and understand the numbers and take conclusions and ask questions on them.
We cannot see the whole population numbers, so we use sampling.
Looking at all the numbers or charts is not enough, so we try to come up with a very few numbers like the mean and medium which is a single number describing our whole dataset.
Another number is telling us how do all the numbers we have related to the simple single number we have the standard deviation is the average distance for each data point and the mean of the dataset. How much variation we have in our data.
Normal distribution means our data is symmetrically distributed around the mean, so we can look at a picture, and we can estimate how many numbers we expect at a certain part of the distribution.
We expect about 68% of our data to be around 1stddev from our mean, we expect 95% of data to be around 2stddev of our mean.
def zscore(specificDataPoint) = stdDev(dataPoint) // from the mean.
z-scores are for a specific data point how many stddev is it from mean.
Probability is the ratio
It's just a ratio of a particular for all possible outcomes. In some cases we need to know the probability of multiple events to occur.
If we need to know the probability of multiple evens happening, or the probability of even B given even A, the latter is conditional probability.
Random experiments what is the outcome of experiment of rolling a die.
Random variable - the numerical outcome of a random experiment 5 or 6.
Discrete outcomes - 1,2,3,4, ...
Continuous outcomes - The temperature.
Binomial experiments, in each experiment we have only 2 possible outcomes - like the stock would raise or the stock price would fall.
How many users should we sample
The larger the sample size the more accurate and more confident we are that the sample is giving us a hint into the population. The standard deviation of the sample is the average of points from the mean - the mean of the sample the larger the sample the larger n is the larger n is then the standard deviation becomes smaller of the sample.
n is the sample size the bigger the sample size the smaller the standard deviation.
Central Limit Theorem
Assures you get a great estimation of population mean from samples given you take enough samples.
Could we use samples to direct us to the population mean?
If we do 3 sample groups we take the average of each sample group, in each sample groups we take a few samples like sample size n=4
k groups - 3 sample groups.
The more samples we take the closer the sample mean to the population mean.The sample groups mean would start to look always like a normal distribution even if the original population is not of normal distribution.
If you increase sample size => the curve becomes more normal the curve is taller and narrow so stddev is smaller.
The larger the sample size the more confidence we have in our results.
Standard deviation of the sample mean
Connect the sample to population
Sample and Population Relation! Standard deviation of our sample means (standard error) is standard deviation of actual population divided by sqrt(n) sample size.
|=||standard error of the sample|
|=||sample standard deviation|
|=||number of samples|
Significant - Did this just happened by chance or is there something into it, something really happened here and not a mere chance.
If the event you see has less than 5% of chance to happen then something is new here we have news so we don't reject the null hyphothesis
If the outcome we saw as evidence has less than 5% chance (significance level alpha we choose) then we say Ha is true something is strange here.
Pvalue the probabbility that the outcome we saw would occur by chance. let's say we find it's 2%
So 2% < 5% so Pvalue is 0.01% that this even would occur by chance, and as the percentage of Pvalue 2% so it's less than 5% so this is very uncommon, so we reject the null hypothesis we think that something is smelly here, and we have real news. Does not prove anything just means we reject the null hypothesis.
And yet again we look at the normal distribution the zscore pvalue (event would happen by chances, so nothing new here). Where is the event we see on standard deviation based on normal distribution.