Introduction
Probability
 Definition : likeliness that an event will occur
 Numerical representation
 number 0 to 1. 0 = never, 1=certainty
 multiply 100 as percent. 0% = never, 100% = certainty
 How to estimate a probability
 Intuitive guessing. e.g. The probability of passing the statistical module is 99.9%
 By observation and data collection
 There are 12 midwives and 13 Reproductive Medicine Students
 The probability of being a midwife in this class = 12 /25 = 0.48 (48%)
 Based on a known theoretical model
 Toss of a coin, probability of either side = 1 /2 = 0.5 or 50%
 Toss of a dice, probability of any number = 1/6 = 0.17 or 17%
 Normal distribution and its variants : basis of this module
 So What
 We use probability models to represent reality
 We use the mathematics of probability to calculate how likely we are to be right or wrong
 This is what statistics is all about
Normal Distribution
95% confidence interval, one tail and two tail model
 95% confidence interval : The one tail model
 Really a percentile definition
 95% of observations on one side of that interval
 5% other side of this interval
 probability of a value exceeding 1.65 SD (z=1.65) is 0.05 (5%)
 95 percentile = mean + 1.65*SD
 5 percentile = mean  1.65*SD
 95% confidence interval (one tail) = <mean+1.65SD or >mean1.65SD
 95% confidence interval : The two tail model
 95% confidence interval means that 95% of observations falls within that interval
 This means 5% is outside of this interval, 2.5% on each side (tail)
 Probability of a value exceeding 1.96 SD is 0.025 (2.5%)
 Therefore 95% confidence interval = mean ±1.96SD
The t distribution
Normal Distribution
 Applicable to population of very large samples
 Increasingly imprecise as sample size decreases
 Student's t correct for this'
 A widening of the distribution with smaller sample size (degrees of freedom)
 Otherwise maintains precision
 Allows research using small samples to infer results to population
 In smaller sample size therefore
 Calculate t according to probability and degrees of freedom
 t for 1 or 2 tail models available
 95% confidence interval calculated the same way, using t instead of z
Sampling, Statistical Effect, Standard Error
 Sampling
 For truth : we need to observe the whole population, usually impractical.
 In research : we sample, observe some and infer the results as applicable to population
 Provides an estimate of the truth (a statistical effect)
 Repeated sampling using the same method produces similar but not same results
 The measurement of this variation is called the Standard Error of the statistical effect
 When we sample to obtain a mean
 Statistical Effect : mean
 Standard Error of effect : Standard Error of mean
 Theoretically, Standard Error of the mean is the Standard Deviation of the mean value, if
samples of the same size are taken repeatedly (see diagram)
 In practice, Standard Error of the mean is calculated SE = SD / sqrt(n), where n=sample size
 If the sample size is 1 case, then SE=SD
 If sample size is infinitely large, SE = 0
95% Confidence Interval of the Difference
 Effect and Standard Error can be translated to 95% confidence interval of the effect
 5% (0.05) outside of the interval is excluded
 For infinitely large sample or population
 For the 1 tail model
 z for 0.05 = 1.65
 95% confidence interval is <effect + 1.65 * SE, or >effect  1.96 * SE
 For the 2 tail model
 5% to be excluded is divided between both tails, so 2.5% (0.25) each tail
 z for 0.025 = 1.96
 95% confidence interval = effect ± 1.96 * SE
 For small samples
 For the 1 tail model
 one tail t for 0.05 and degree of freedom calculated
 95% confidence interval = <effect + t_{1 tail} * SE or >effect  t_{1 tail} * SE
 For the 2 tail model
 two tail t for 0.05 and degree of freedom calculated
 95% confidence interval = effect ± t_{2 tail} * SE
 If the 95% CI does not overlap the null value(0), then the Null hypothesis is rejected
and the effect is statistically significant
Statistical Significance 1. Probability of Type I Error
 Effect and Standard Error : In parametric data (normally distributed)
 The measurements are normally distributed (mean, Standard Deviation)
 The sample mean is normally distributed (mean, Standard Error of the mean)
 The difference between two means is normally distributed (difference, Standard Error of the difference)
 It follows that (Fisher argued):
 If we propose that, in truth, the difference is null (0), the null hypothesis
 We then reject that truth, proposing a difference does exist rejecting the null hypothesis
 We then use the data we collected to test whether we have made a error in rejecting the null hypothesis
(Type I Error). Please note that error here means mistake and not variation
 We assume the data we have is normally distributed
 We examine our data and obtain its difference and Standard Error, assuming these are also normally distributed (z or t)
 We calculate the distance of our difference from null (0) in terms of its Standard error,
so that z or t = (diff0) / SE = diff / SE
 We can then conclude that the probability of not null is the probability of z or t
 This is the probability that we are wrong to reject the null hypothesis
(Probability of Type I Error, p, alpha, α)
 If this error (p, α) is small enough (usually p<0.05), we can conclude that we are not wrong
in rejecting the null hypothesis, that we can conclude that null is not true, and so decide
that the difference is real and meaningful
 In Summary
 The difference between 2 means can be generalized to any statistical effect that is normally distributed
 (z or t) = statistical effect / Standard Error of the effect
 (p, alpha, α) = probability of z or t being null
 if p<0.05, the probability of null is too small. we reject the null hypothesis
Statistical significance 2. Type II Error
Problem with Type I Error : no statistical decision if α is large (p>0.05)
Pearson's model
 The alternative hypothesis. The the difference is not null
 The error outside of the null value in the hypothesis is the Type II Error, β
 If we can predetermine the value of α to reject the null hypothesis, e.g. α p=0.05
 If we can predetermine the value of β to reject the alternative hypothesis, . e.g. β=0.2 (power, 1β)=0.8
 If we know what the Standard Deviation of the measurement is
 If we can nominate the difference that matters to us, the critical difference, the clinically significant difference
 We can determine the same size required to make a confident interpretation of data
 After collecting the data
 If the difference is greater than the critical difference
 We reject the null hypothesis and accept the alternative hypothesis
 We conclude that the difference is Statistically Significant
 If the difference is less than the critical difference
 We reject the alternative hypothesis and accept the null hypothesis
 We conclude that the difference is Not Statistically Significant
Power and Sample Size
 Forest Plot shows relationship between power and sample size
 All bars in the plot have same effect size and Standard Error
 The 95% confidence interval narrows as sample size increases
 The first 3 bars show studies with insufficient sample size, and are under powered. They
failed to identify the difference as statistically significant, even if the difference is real
 The last 2 bars show studies with excessive sample size. They are unnecessarily large,
wastes resources, inconvenience everyone, and places research subjects at unnecessary risk and discomfort
 Research without appropriate sample size considerations are therefore considered
poorly designed and conducted
 During the planning stage of a research project
 The effect that is clinically meaningful must be defined
 The background population variation of the measurement concerned estimated
 The probability of Type I Error or the % confidence interval to be used to establish
statistical significance decided
 Following common usage, the default value for the module is p<0.05, or 95% confidence interval
 The power, the ability to detect the difference, is decided
 Following common usage, the default value for the module is power=0.8 (80%)
 From these parameters, the sample size required can be estimated.
Exercises
Probability is so much part of statistical analysis that there will be plenty of opportunities to
exercise the concept in relationship to practical analysis of data.
The exercises in this workshop are more theoretical in nature, taking into account that, at this stage,
students are still unfamiliar with the many research models that will be discussed later.
The purpose of these exercises are therefore to make the students familiar with the concepts and some
of the frequently encountered numbers, in particular, the numerical relationship between probability, z, t,
one and two tail, and confidence interval.
Q 1. Calculate z value to 4 decimal places, for probabilities of 0.1, 0.05, 0.025, and 0.01
A 1. Click to show contents
z values for probability p
p  z 
0.1  1.2815 
0.05  1.6449 
0.025  1.9600 
0.01  2.3263 
Q 2. Calculate the one and two tail t value to 4 decimal places, for the probability of 0.5,
for degrees of freedom from 1 to 20
A 2. Click to show contents
t values for probability p=0.05
df  t(1 tail)  t(2 tail)   df  t(1 tail)  t(2 tail)   df  t(1 tail)  t(2 tail)   df  t(1 tail)  t(2 tail) 
1  6.3138  12.7065   6  1.9432  2.4469   11  1.7959  2.2010   16  1.7459  2.1199 
2  2.9200  4.3026   7  1.8946  2.3646   12  1.7823  2.1788   17  1.7396  2.1098 
3  2.3534  3.1824   8  1.8595  2.3060   13  1.7709  2.1604   18  1.7341  2.1009 
4  2.1319  2.7764   9  1.8331  2.2621   14  1.7613  2.1448   19  1.7291  2.0930 
5  2.0150  2.5706   10  1.8124  2.2282   15  1.7530  2.1314   20  1.7247  2.0860 
Q 3. The following parameters and questions are asked for different sample sizes
 The birth weight has a mean of 3500g and a Standard Deviation of 450g
 Assuming birth weight is normally distributed, calculate the following
 the 0.5, 1, 2.5, 5, 10, 90, 95, 95, 97.5, 99, and 99.5 percentile in birth weight
 What is the percent of babies that weighs
 less than 2000g, 2500g, and 3000g
 more than 4000g, and 4500g
 What is the 80%, 90%, 95%, and 99% confidence interval of birth weight if we measure a baby
 What is the 80%, 90%, 95%, and 99% confidence interval of the mean value of birth weight
 In the results, percent can be rounded to 1 decimal place, and birth weight to the nearest gram.
Q 3a. Perform the calculations assuming the mean and SD were from a sample of 10 babies
A 3a. Click to show contents
 Sample size =10, df = 9
 Table 3a.1 birthweights based on t values calculated from p and df=9
Percentile  Probability p  t from p,df (one tail)  3500+450t 
0.5  0.005  3.2498  2038 
1  0.01  2.8214  2230 
2.5  0.025  2.2621  2482 
5  0.05  1.8331  2675 
10  0.1  1.383  4122 
90  0.1  1.383  4122 
95  0.05  1.8331  4325 
97.5  0.025  2.2621  4518 
99  0.01  2.8214  4770 
99.5  0.005  3.2498  4962 
 Table 3a.2 percent outside of birth weight according to p calculated from t and df=9
BWt  t=(Bwt3500)/450  p from t,df(one tail)  % outside 
2000  3.3333  0.0044  0.44 
2500  2.2222  0.0267  2.67 
3000  1.1111  0.14765  14.765 
4000  1.1111  0.14765  14.765 
4500  2.2222  0.0267  2.67 
 Table 3a.3 percent confidence interval of birth weight from t calculated from p and df=9
% confidence interval  p(total)  p(each tail)  t from p,df(each tail)  3500450t  3500+450t 
80  0.2  0.1  1.383  2878  4122 
90  0.1  0.05  1.8331  2675  4325 
95  0.05  0.025  2.2621  2482  4518 
99  0.01  0.005  3.2498  2038  4962 
 Table 3a.4 percent confidence interval of mean from t calculated from p and df=9
SD is replaced with SE where SE=SD/sqrt(n) = 450/sqrt(10) = 142
% confidence interval  p(total)  p(each tail)  t from p,df(each tail)  3500142t  3500+142t 
80  0.2  0.1  1.383  3304  3696 
90  0.1  0.05  1.8331  3240  3760 
95  0.05  0.025  2.2621  3179  3821 
99  0.01  0.005  3.2498  3039  3961 
Q 3b. Perform the calculations assuming the mean and SD were from a sample of 20 babies
A 3b. Click to show contents
 Sample size =20, df = 19
 Table 3b.1 birthweights based on t values calculated from p and df=19
Percentile  Probability p  t from p,df(one tail)  3500+450t 
0.5  0.005  2.8609  2213 
1  0.01  2.5395  2357 
2.5  0.025  2.093  2558 
5  0.05  1.7291  2722 
10  0.1  1.3277  4097 
90  0.1  1.3277  4097 
95  0.05  1.7291  4278 
97.5  0.025  2.093  4442 
99  0.01  2.5395  4643 
99.5  0.005  2.8609  4787 
 Table 3b.2 percent outside of birth weight according to p calculated from t and df=19
BWt  t=(Bwt3500)/450  p from t,df(one tail)  % outside 
2000  3.3333  0.00175  0.175 
2500  2.2222  0.0193  1.93 
3000  1.1111  0.1402  14.02 
4000  1.1111  0.1402  14.02 
4500  2.2222  0.0193  1.93 
 Table 3b.3 percent confidence interval of birth weight from t calculated from p and df=19
% confidence interval  p(total)  p(each tail)  t from p,df(each tail)  3500450t  3500+450t 
80  0.2  0.1  1.3277  2903  4097 
90  0.1  0.05  1.7291  2722  4278 
95  0.05  0.025  2.093  2558  4442 
99  0.01  0.005  2.8609  2213  4787 
 Table 3b.4 percent confidence interval of mean from t calculated from p and df=19
SD is replaced with SE where SE=SD/sqrt(n) = 450/sqrt(20) = 101
% confidence interval  p(total)  p(each tail)  t from p,df(each tail)  3500101t  3500+101t 
80  0.2  0.1  1.3277  3366  3634 
90  0.1  0.05  1.7291  3325  3675 
95  0.05  0.025  2.093  3289  3711 
99  0.01  0.005  2.8609  3211  3789 
Q 3c. Perform the calculations assuming the mean and SD were from a sample of 50 babies
A 3c. Click to show contents
 Sample size =50, df = 49
 Table 3c.1 birthweights based on t values calculated from p and df=49
Percentile  Probability p  t from p,df(one tail)  3500+450t 
0.5  0.005  2.68  2294 
1  0.01  2.4049  2418 
2.5  0.025  2.0096  2596 
5  0.05  1.6766  2746 
10  0.1  1.2991  4085 
90  0.1  1.2991  4085 
95  0.05  1.6766  4254 
97.5  0.025  2.0096  4404 
99  0.01  2.4049  4582 
99.5  0.005  2.68  4706 
 Table 3c.2 percent outside of birth weight according to p calculated from t and df=49
BWt  t=(Bwt3500)/450  p from t,df(one tail)  % outside 
2000  3.3333  0.0008  0.08 
2500  2.2222  0.01545  1.545 
3000  1.1111  0.13595  13.595 
4000  1.1111  0.13595  13.595 
4500  2.2222  0.01545  1.545 
 Table 3c.3 percent confidence interval of birth weight from t calculated from p and df=49
% confidence interval  p(total)  p(each tail)  t from p,df(each tail)  3500450t  3500+450t 
80  0.2  0.1  1.2991  2915  4085 
90  0.1  0.05  1.6766  2746  4254 
95  0.05  0.025  2.0096  2596  4404 
99  0.01  0.005  2.68  2294  4706 
 Table 3c.4 percent confidence interval of mean from t calculated from p and df=49
SD is replaced with SE where SE=SD/sqrt(n) = 450/sqrt(50) = 64
% confidence interval  p(total)  p(each tail)  t from p,df(each tail)  350064t  3500+64t 
80  0.2  0.1  1.2991  3417  3583 
90  0.1  0.05  1.6766  3393  3607 
95  0.05  0.025  2.0096  3371  3629 
99  0.01  0.005  2.68  3328  3672 
Q 3d. Perform the calculations assuming the mean and SD were from a sample of 100 babies
A 3d. Click to show contents
 Sample size =100, df = 99
 Table 3d.1 birthweights based on t values calculated from p and df=99
Percentile  Probability p  t from p,df(one tail)  3500+450t 
0.5  0.005  2.6264  2318 
1  0.01  2.3646  2436 
2.5  0.025  1.9842  2607 
5  0.05  1.6604  2753 
10  0.1  1.2902  4081 
90  0.1  1.2902  4081 
95  0.05  1.6604  4247 
97.5  0.025  1.9842  4393 
99  0.01  2.3646  4564 
99.5  0.005  2.6264  4682 
 Table 3d.2 percent outside of birth weight according to p calculated from t and df=99
BWt  t=(Bwt3500)/450  p from t,df(one tail)  % outside 
2000  3.3333  0.0006  0.06 
2500  2.2222  0.01425  1.425 
3000  1.1111  0.1346  13.46 
4000  1.1111  0.1346  13.46 
4500  2.2222  0.01425  1.425 
 Table 3d.3 percent confidence interval of birth weight from t calculated from p and df=99
% confidence interval  p(total)  p(each tail)  t from p,df(each tail)  3500450t  3500+450t 
80  0.2  0.1  1.2902  2919  4081 
90  0.1  0.05  1.6604  2753  4247 
95  0.05  0.025  1.9842  2607  4393 
99  0.01  0.005  2.6264  2318  4682 
 Table 3d.4 percent confidence interval of mean from t calculated from p and df=99
SD is replaced with SE where SE=SD/sqrt(n) = 450/sqrt(100) = 45
% confidence interval  p(total)  p(each tail)  t from p,df(each tail)  350045t  3500+45t 
80  0.2  0.1  1.2902  3442  3558 
90  0.1  0.05  1.6604  3425  3575 
95  0.05  0.025  1.9842  3411  3589 
99  0.01  0.005  2.6264  3382  3618 
Q 3e. Perform the calculations assuming the mean and SD were from a sample infinitely large (population)
A 3e. Click to show contents
 Sample size = very large use z
 Table 3e.1 birthweights based on z values
Percentile  Probability p  z from p  3500+450z 
0.5  0.005  2.5758  2341 
1  0.01  2.3263  2453 
2.5  0.025  1.96  2618 
5  0.05  1.6449  2760 
10  0.1  1.2815  2923 
90  0.1  1.2815  4077 
95  0.05  1.6449  4240 
97.5  0.025  1.96  4382 
99  0.01  2.3263  4547 
99.5  0.005  2.5758  4659 
 Table 3e.2 percent outside of birth weight according to p calculated from z
BWt  z=(Bwt3500)/450  p from z  % outside 
2000  3.3333  0.0004  <0.1 
2500  2.2222  0.0131  1.3 
3000  1.1111  0.1333  13.3 
4000  1.1111  0.1333  13.3 
4500  2.2222  0.0131  1.3 
 Table 3e.3 percent confidence interval of birth weight from z calculated from p
% confidence interval  p(total)  p(each tail)  z from p(each tail)  3500450t  3500+450t 
80  0.2  0.1  1.2815  2923  4077 
90  0.1  0.05  1.6449  2760  4240 
95  0.05  0.025  1.96  2618  4382 
99  0.01  0.005  2.5758  2341  4659 
 Table 3e.4 percent confidence interval of mean from t calculated from p and df=99
SD is replaced with SE where SE=SD/sqrt(n) = 450/sqrt(∞) = 0
% confidence interval  p(total)  p(each tail)  z from p,df(each tail)  35000z  3500+0z 
80  0.2  0.1  1.2815  3500  3500 
90  0.1  0.05  1.6449  3500  3500 
95  0.05  0.025  1.96  3500  3500 
99  0.01  0.005  2.5758  3500  3500 
