*This post, on confidence intervals, is by frequent commenter, John Friend. It is the second of our guest posts; the first, by Anthony Harradine, is here. A version of John’s post is also available as a PDF, here.*

## A Lack of Confidence

The *Wald Interval*, commonly called the ‘Standard Interval’, is the confidence interval most commonly presented in introductory statistics textbooks for the population proportion . It is the confidence interval prescribed in the VCE Study Design for Mathematical Methods (Word), and takes the form

Here, * *is the sample size, is the observed sample proportion and , where is the confidence level and is the standard normal random variable.

**Problems with the Wald interval**

The Wald interval possesses a number of defects:

**1. **In the special cases or , the Wald Interval has zero width, and thus disappears. The Wald Interval also performs very poorly when or .

**2.** Intervals can have ‘overshoot’. For example, when and , the approximate 95% Wald interval is (0.793,1.007).

**3. **The Wald Interval often performs very poorly in practical scenarios, in that the *coverage probability* (see below) is often less than the nominal confidence level (e.g. the coverage of the 95% Wald Interval is often less than 95%). That is not good since we hope to have a reasonable ‘coverage’ when constructing a confidence interval.

**Note: **The probability that an interval contains or *covers* the true value of an unknown parameter is called the *coverage probability*. It is a property of the procedure that produces the interval. The interval produced for a particular sample, using a procedure with coverage probability , is said to have a confidence level of . The coverage probability of intervals such as the Wald Interval can be investigated by simulating random sampling from a population with a known value of . A confidence interval is constructed for each of the random samples to see how many such confidence intervals actually ‘cover’ (include) .

(From *Five Confidence Intervals For Proportions You Should Know About* – Dr Dennis Robert Aug 2020.)

**Alternatives to the Wald Interval**

Given the imposition of statistical inference onto Mathematical Methods by VCAA, the ubiquity of CAS technology in VCE mathematics and the ‘black box’ (or should that be ‘black CAS’) approach to calculating confidence intervals, it is puzzling that the Wald Interval is the only confidence interval mentioned in the VCE Study Design for Mathematical Methods, and even more puzzling that its defects are not mentioned.

Elsewhere, it has been strongly recommended that instructors present the *Wilson Interval *(see the Appendix) as a better alternative:

where the midpoint, , of the Wilson Interval is given by

See, for example, *Approximate is Better Than “Exac” for Interval Estimation of Binomial Proportions*, by Alan Agresti and Brent Coull (*The American Statistician*, **52: 2**, 119-126, 1998).

The Wilson Interval does not suffer from any of the above-stated defects of the Wald Interval, and in particular its coverage probability is superior:

(From *Five Confidence Intervals For Proportions You Should Know About* – Dr Dennis Robert Aug 2020.)

Nonetheless, many instructors might hesitate to present such a complicated formula in an elementary statistics course. A simpler alternative is the *Agresti-Coull Interval*:

where is again the midpoint of the Wilson Interval, given above.

The Agresti-Coull Interval also has none of the above-stated defects of the Wald Interval. In particular, the coverage probability of the Agresti-Coull Interval is superior to that of the Wald Interval, although not as good as the Wilson Interval:

(From *Five Confidence Intervals For Proportions You Should Know About* – Dr Dennis Robert Aug 2020.)

**Conclusion**

Although VCAA has imposed statistical inference onto Mathematical Methods, the Wald Interval is the only confidence interval mentioned in the Study Design. Given the obvious defects of the Wald Interval and the ubiquity of CAS technology, it is bewildering that superior intervals, such as the Wilson Interval and the Agresti-Coull Interval, are not also considered.

**Appendix: Derivation of the Confidence Interval Formulae **

Here, we derive the confidence intervals for the population proportion . The following two assumptions on the sample size *n* are made:

1) The sample size is ‘small’ *relative to the population size*. Under this assumption, the distribution of the sample proportion, which is a random variable, can be approximated by the binomial distribution.

2)The sample size is ‘large’ enough (see below) that the Normal approximation to the Binomial distribution can be used:

**Note:** The standard conditions for the Normal distribution to be a good approximation to the Binomial distribution are and (or, even better, and ). It will not be known if these conditions are met, however, because the population proportion is not known.

In summary, the sample size is assumed to be simultaneously small enough that the binomial approximation can be used, and large enough so that the normal approximation to the binomial distribution can be used.

Now, let

where is the standard normal random variable. Then, with the assumptions above, we can substitute

giving

The idea now is to somehow ‘invert’ the inequalities

(1)

in order to ‘trap’ the population proportion between lower and upper values, and :

(2)

(2) is not a standard probability statement, because the population proportion is not a random variable. Rather, (2) defines a *random interval*

which contains the fixed but unknown population proportion with probability .

The substitution into this random interval of an observed value of (calculated from a sample) gives the *C*% confidence interval for . This constitutes the *realisation* of this random interval. The differing methods of inverting (1), which underlies this realisation, results in differing confidence interval formulas.

**The Wald Interval**

Approximate for (the unknown) in (1):

**Note: **This is *not* a realisation of a random interval. It is an approximation that is used solely to avoid the cumbersome algebra in solving exactly for , and is unnecessary when CAS technology is so ubiquitous.

Solving the inequalities for gives the random interval

The realisation of this random interval, by substituting the observed value for , gives the Wald or ‘Standard’ Interval.

**The Wilson Interval**

To realise the Wilson Interval, we exactly invert the inequalities (1) for :

*** QuickLaTeX cannot compile formula: \[\boldsymbol{ \aligned{& && -k<\frac{\widehat P - p}{\sqrt{\frac{p(1-p)}{n}}}<k\[3\jot] &\Longleftrightarrow && -k\sqrt{\frac{p(1-p)}{n}}< \widehat P - p <k\sqrt{\frac{p(1-p)}{n} } \[3\jot] & \Longleftrightarrow &&\left(\widehat P - p\right)^2 <k^2\frac{p(1-p)}{n} }\[3\jot] & \Longleftrightarrow && {\widehat P}^2 -2p\widehat P + p^2 < \frac{k^2}{n}p - \frac{k^2}{n}p^2\[3\jot] & \Longleftrightarrow &&\left(1+ \frac{k^2}{n}\right)p^2 - \left(2\widehat P+\frac{k^2}{n}\right)p + {\widehat P}^2 < 0\,.\endaligned}}\] *** Error message: Missing } inserted. leading text: ...ight)p + {\widehat P}^2 < 0\,.\endaligned} Bad math environment delimiter. leading text: ...ight)p + {\widehat P}^2 < 0\,.\endaligned} Missing number, treated as zero. leading text: ...ight)p + {\widehat P}^2 < 0\,.\endaligned} Illegal unit of measure (pt inserted). leading text: ...ight)p + {\widehat P}^2 < 0\,.\endaligned} Extra }, or forgotten \endgroup. leading text: ...ight)p + {\widehat P}^2 < 0\,.\endaligned} Missing \endgroup inserted. leading text: ...ight)p + {\widehat P}^2 < 0\,.\endaligned} Missing } inserted. leading text: ...ight)p + {\widehat P}^2 < 0\,.\endaligned} Bad math environment delimiter. leading text: ...ight)p + {\widehat P}^2 < 0\,.\endaligned} Missing number, treated as zero. leading text: ...ight)p + {\widehat P}^2 < 0\,.\endaligned}

This is a standard quadratic inequality for , with an interval of solutions. The endpoints of the interval are obtained by solving the corresponding quadratic equation for :

Expanding, cancelling and factorising, the quantity within the root simplifies to . Then, taking the out of the root, and dividing top and bottom by , we obtain

The realisation of this random interval, by substituting the observed value for , produces the Wilson Interval, as given above.

**The Agresti-Coull Interval**

Substitute the midpoint of the Wilson Interval into (1) before inverting. The *realisation* of the resulting random interval produces the Agresti-Coull Interval, as given above.

Thank you for this post, JF.

“The interval produced for a particular sample, using a procedure with coverage probability , is said to have a confidence level of .” – To my knowledge, it is standard to take confidence levels to lie between 0 and 1, not 0 and 100. Perhaps one should also clarify in this context that the confidence level of a procedure for a confidence interval is always supposed to be given a priori; as is implied by the more complete term “nominal confidence level” used elsewhere in the text.

“(2) is not a standard probability statement, because the population proportion is not a random variable.” – I find this statement a little confusing. Why should a statement of the sort , with random (which is the situation we are in here, as the subsequent sentence makes clear), not be "standard" in probability?

One should be aware – even if this may not be suitable for a secondary-level classroom situation – that a big part of the problem that motivates those interval corrections comes not from the different ways of inverting an equation, but from the (non-)closeness of the standardized/normalized sums obtained from the binomial distribution to the normal. It is just more unstable than one may think when used to docile Galton boards.

The post uses sometimes and sometimes . Perhaps this is an issue introduced in transcription onto this page. I tend to slightly favor the lower-case version, if only this has become standard.

A final note: I once encountered the topic of this post (as a novice!) when, together with others, I was contemplating a revised textbook for introductory statistics teaching at university (in Australia). The corrected intervals found their way into the revision. I recall clearly that the author told me personally that time that they included the "more complicated" (non-Wald) interval so that students who use whatever statistical software, could make sense of why the results from those may differ from what the Wald procedure gives. I found this argument convincing. (We did not adopt the revision though, for other reasons.)

Thanks for your comments, Christian. I was hoping this blog might stimulate some discussion. You probably know a lot more stats than I do. Nevertheless I will try to address the points you’ve raised and am happy to be corrected in the process.

1) From my reading, confidence levels are typically referred to as 95% CI, 99% CI etc. So a 95% CI corresponds to a coverage probability of 95/100. And when the 95% is nominal (and my understanding of this is that it is – roughly – a claim that 95 out of 100 intervals constructed from 100 different samples will contain the parameter), for the Wald interval the coverage probability is often less than 95%.

2) My understanding is that, at the introductory level, a ‘standard’ probability statement has the form Pr(a < X < b). As opposed to the probability of a parameter being trapped between two random variables. Perhaps I should have chosen a different phrasing to indicate that it's not the sort of probability statement that one might meet in, for example, secondary school mathematics.

3) I agree that interval corrections mainly arise from "the (non-)closeness of the standardized/normalized sums obtained from the binomial distribution to the normal are motivated by approximations". With the Wald Interval you are using an approximation of an approximation of an approximation. This is certainly not made clear in the VCE Study Design or textbooks, where one is led to believe that the interval is infallible.

4) P-hat is a random variable, p-hat is a number. I'm not sure what favouring p-hat means … What happens to the random variable P-hat?

But … the main point of this particular blog is that there are better confidence intervals than the standard Wald interval. If a syllabus such as that of VCE Mathematical Methods is going to include/impose confidence intervals, then I think it should either include/mention the better intervals or, at the very least, mention the shortcomings of the Wald interval. The Maths Methods syllabus could easily do this but it doesn't. I wonder if the people who wrote the syllabus are even aware of any of this.

Some questions: Have Maths Methods teachers come across questions where the, say, 95% CI for a proportion has an endpoint either less than 0 or greater than 1? Are teachers aware this can happen? Have teachers had students ask about this? Have teachers contemplated what implications a sample proportion p-hat = 0 or 1 has for the Wald Interval? Or had students ask?

Statistical inference is a much more complicated business than the VCE Study Design would have students and teachers believe. I think it's ridiculous to include statistical inference in VCE mathematics and again I wonder whether VCAA has the faintest idea how totally dumbed down and appalling its syllabus is.

And I didn't even mention the Clopper-Pearson interval, which is certainly accessible within the scope of the course and using a CAS.

Thank you for your reply, John. Your intent of the post was clear and my focus on those relatively minor points was perhaps rather tangential; yet I hope that I was staying on the right side of pedantry! My final paragraph in my last post will have at least given “some” hint of how the discussion about those corrected intervals made its way into statistical education in large classes (such as the one I was teaching back then). While I am not familiar with what happened on that front in Australia in the past decade, I am sure that what you write is at least a pretty good approximation to the (lamentable) truth.

Quickly on your responses to the points:

1) I think that 95% means 0.95. Once one accepts this, our issue seems to disappear.

2) I understand that you used “standard” probability in a loose way, without a definition of “standard”. There is of course nothing wrong with that. (Slightly off-topic, in this sense, even a probability such as , which is definitely useful, should probably be termed "non-standard" – and requires the consideration of bivariate distributions.)

4) I seem to have seen very rarely in print, if at all. It is perhaps because of (i) tradition (it is different with and in the case of continuous random variables ) and (ii) the sloppiness in using the same notation for a random confidence interval and its realisation, with fixed numbers as its limits, is not bothering people too much.

Thanks, Christian.

1) I’ve rarely seen, say, a 95% CI referred to as a 0.95 CI … Even Wikipedia (admittedly not always totally reliable) talks about 95% CI etc. Maybe it’s a ‘generational’ thing …? But I would agree it’s a less ‘misleading’ name.

2) Re: “." Yes, I must admit that these statements do occur in VCE, but they quickly (and correctly) get converted into statements such as . What's not seen in VCE is the probability that a parameter is trapped between two random variables. I think this is a pity because in the context of confidence intervals it shows how the CI should be interpreted (via where the interval actually comes from). As I remarked earlier, the inclusion of statistical inference into Mathematical Methods, as set out in the Study Design, is diabolical and does more harm than good if the goal is for students to have an understanding of the concept. Such things should be learnt in a specialised statistics subject.

4) Re: "(ii) the sloppiness in using the same notation for a random confidence interval and its realisation, with fixed numbers as its limits, is not bothering people too much."

This bothered/confused me for a long time. It was only when I first read the textbook Into Statistics by Peter J Smith (an Australian textbook, Smith was a lecturer at RMIT) that I saw the clear distinction made and the idea of the interval. Things I'd read in other textbooks made sense after this because I could see what was missing.

Thanks, John.

Regarding 1), I tend to think that this may be a difference between a theoretical setup, which ultimately draws on probability – and probabilists tend to prefer to give probabilities as numbers between 0 and 1, not percentages – and practical language. I agree that a statement like “0.95 CI” would sound odd. It would be much less odd to say “CI [or confidence region in higher dimensions] at level p = 0.95” in a numerical study in some statistics paper. Others may disagree with me here.

Regarding 2), perhaps we have found another example of why a good grounding in probability does help with statistics – an issue that Marty highlighted on some occasion (or several) in this blog. Unless one has that (or deals with independent normally distributed random variables, say), the gain obtained by writing as is IMHO negligible. (An idea for a post by Marty?)

Sorry that the overbars in and didn't render correctly in my previous post.

Thanks for your indulgence in all this side-tracking. I hope some core issues of John’s post will be discussed by others.

Hi Christian.

Not side-tracking at all!

Re: “It would be much less odd to say “CI [or confidence region in higher dimensions] at level p = 0.95” in a numerical study in some statistics paper.”

Can you give an example of where you’ve seen this language used (I’m not disagreeing with you, I’ve just never seen it said like this anywhere).

Re: “the gain obtained by writing as is IMHO negligible."

Yes that will often be so in undergraduate courses. But in VCE (Specialist Maths), there is significant gain. X and Y will typically be independent normal random variables, in which case X – Y is normal with a readily calculated mean and standard deviation. Then as is a standard calculation. Although the calculation will be done using a CAS, an understanding of the background behind the calculation is clear. (Unlike many questions where buttons get pressed with little understanding as to why). Of course, in undergraduate courses X and Y will often NOT be independent and often not normal, in which case I agree that there's no advantage. But that's due to other techniques – not learnt in VCE – being able to be used. Which of course is yet another argument for why this stuff should be taught as a separate subject rather than an ad hoc add-on to Methods and Specialist. When it comes to these sorts of things, what's taught in both those subjects is NOT mathematics. Statistical inference is completely misrepresented in VCE mathematics.

Hi John,

here is an example of a, well, approximate usage of “confidence level given as a number in the interval [0,1]” as you requested; see Figure 2 in that paper. (It uses the erroneous spelling “significant level” instead of “significance level”; the duality, or equivalence, of testing and confidence intervals/regions is used and, I hope, not confusing. At least, with “confidence” being mentioned in the figure caption, it does seem to me that we are within the parlance that we discussed here.)

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852254/figure/pone-0081179-g002/

I agree with what you say about probabilities such as . Life in the "independence" world is much easier and going beyond it in secondary-level mathematics is, for the very most of the students, probably a Pandora's box. I guess some of the characters severely criticized in this blog would say that "convolution" (which is what we really have here, at least if we think of the replacement of Y by (-Y) as minor) cannot be marketed due to its negative connotations… leading them to do the right thing for the wrong reasons.

PS: A clarification to my post: convolution is of course what’s behind the “independent” case, and the “non-independent” case is often far worse (unless we are in the normally-distributed vectors world) or even intractable.

Thanks for this example, Christian.

I – kind of – understand “the confidence region is constructed by setting the significant [sic] level for each test at …”

I’ve never seen this language used when it’s referring to a confidence . What would we say for confidence intervals? Perhaps

The confidence interval is constructed by setting the significance level for the test at 0.05.

But problems with this would include:

i) What test? A test is not used to construct a confidence interval. A sample is collected, a proportion calculated and substitution into a formula occurs. (I’d have to read more of the article to get a sense of the “tests” it’s using and how this relates back to the region).

ii) It seems a complicated way of simply saying 95% confidence interval.

iii) the 95% Wald interval significant at the 0.05 level of significance? In reality, I’m not sure it always is …

Maybe

The confidence interval is constructed at the 0.05 level of significance. As I said, a complicated way of saying 95% confidence interval …

Anyway, I suppose the semantics is small beer when it’s all said and done. The point is that statistical inference in VCE mathematics is grossly misrepresented. I have no confidence in it or VCAA.

The only meaning convolution can have when juxtaposed with VCAA is complicated and muddled. VCAA is expert at convolution.

Thank you, John. Sorry for my delay in replying.

From a mathematical point of view, a confidence interval (at least a bounded, i.e. two-sided one) is just a confidence region in one dimension. Therefore it would seem to me that “in principle”, any convention in higher dimensions should also be OK for confidence intervals (the latter being of course the only situation where notions that invoke the ordering of the real line, i.e. “left/right”, “smaller/larger”, make sense – though there have been attempts to get around that problem). But it may be well possible that the dimension (1 or higher) does make a difference in the language used by practitioners – you may well be right here!

On your i): “A test is not used to construct a confidence interval.”

Here is a link for the other way round:

And one can also get a confidence interval from a two-tailed test:

Click to access lecture2.pdf

See page 9, paragraph 2.

Therefore I would dispute your above statement from a mathematical point of view; it is, however, true that tests and confidence intervals do have two different purposes (and sometimes it makes sense to perform a test as well as to construct a CI).

In fact, the foregoing discussion is why the figure caption of my previous reply is mixing the “statistical testing” and “confidence region” language as it did, and legally so.

Thanks, Christian.

There’s a difference between a rejection interval (or maybe I should say an acceptance interval) and a confidence interval. I’m not completely sure from the link (but I haven’t read thoroughly), but I think the link might be dealing with a rejection interval rather than a confidence interval …?

Hi John,

the term “rejection interval” is not familiar to me (this may be my fault); but I believe “rejection region” is a better term anyway because, for two-sided statistical tests, that region is the union of *two* (semi-infinite) intervals, rather than just one. (I only think of one-dimensional situations here, as in secondary school.) And that region is precisely the complement of a confidence interval – that is what I believe motivated your first parenthetical remark. While CI’s and statistical tests are certainly two different things, the point of the references I sent (or others which I didn’t find – I think that many introductory stats texts mention this) is that one can be constructed from the other.

Another, and somewhat tangential, point: it is important to note that one never “accepts” the null value / null hypothesis in a statistical test; one merely fails to reject it. (Sorry for this reminder if that had already been clear; but I am writing this also for third parties.)

Hi Christian.

I’ve attached an example given to me that I use in Specialist Maths (hence the context is population mean rather than proportion). I think this might clarify my earlier comment. I think the example you attached earlier might be using critical values to define their region. An interval/region calculated from critical values is not the same as a confidence interval/region.

Hypothesis testing

Hi John,

I fear that my formulation “And that region is precisely the complement of a confidence interval …” was confusing. Sorry. I definitely agree that a confidence interval requires more than just taking the complement of a rejection region. A big part of the problem at hand, at least in my opinion, comes from the question which scale you are on – that of the original data, or that of the test statistic. With this in mind, I think that the “Note” on page 3 of your file does support my point that one can “construct” one of CI/test from the other, although this is a bit hidden in the concrete computation.

The difference between what your file calls “Option 1” and “Option 2” is likewise in the scales. The advantage of “Option 2” over “Option 1” is that always has the same distribution under the null hypothesis, regardless of what the value of the parameter under that hypothesis, that is the null value , is (and given that the denumerator also gets rid of the dependency on scale, that is the variance). This is visible in the two function graphs shown in your notes: the student would have to graph the first one again if the null value was different from 13.7, while for the second one, this is not so; or there is just one table in a book to look up, or CAS-computed value to get. (In fact, the very use of the symbol has something to do with the fact that the normal distribution with mean 0 and variance 1 is called the “z distribution” by many applied scientists.)

The options for the -value approach are similar to the above paragraph. I know that applied scientists love -values, and thus they must be taught. I personally have a little less fondness for this concept, although it is not exactly a fifth wheel on a car, perhaps.

A final comment: Notation is not ideal in my view. It could suggest that is even a random variable (reading the vertical bar as a conditional probability symbol – the student knows only what , with subscript, is!), and we are of course not in a Bayesian setting. This, I think, makes it sensible to use instead a formulation such as, "Under ", as many papers do.

Sorry, “Anonymous” is the same commentator as before: Christian R

Fixed.

See attached for an approach to choosing the level of confidence .

2014-StatsinMed

Thankyou, Terry. Most interesting.