Comparing Dispersions
Phillip Good
205 W. Utica Ave.
Huntington Beach CA 92648
Permutation methods provide both exact and more powerful tests for comparing dispersions from two or many populations. The resultant tests can be extended to obtain a more powerful test for treatment effects when non-responders are present.
1.
MOTIVATION
Precision is essential in a manufacturing
process. Items that are too far out of tolerance must be discarded and an
entire production line brought to a halt if too many items exceed (or fall
below) designated specifications. With some testing equipment, such as that
used in hospitals, precision can be more important than accuracy. For accuracy
(closeness to the correct value) can always be achieved through the use of
standards with known values, while a lack of precision may render an
entire sequence of tests invalid. Thus
tests for consistency in dispersion are essential.
2. THE
ASYMPTOTIC APPROACH
There is no shortage of methods to test
the hypothesis that two samples come from populations with the same inherent
variability. Sukhatme [1958] lists four alternative approaches and adds a fifth
of his own; Miller [1968] lists ten alternatives and compares four of these
with a new test of his own; Conover, Johnson and Johnson [1981] list and
compare 56 tests; and Balakrishnan and Ma [1990] list and compare nine tests
with one of their own.
None
of these tests can be relied on. Many promise an error rate or significance
level of 5% but in reality make errors as frequently as 8% to 20% of the
time. For example, with the test
proposed by Miller [1968], a 10% test with samples of size 12 and 8 taken from
normal populations yielded Type I errors 14% of the time.
This
is because even the so-called non-parametric methods make use of asymptotic
parametric approximations to derive their cut-off values. Consequently, the claimed power for small
and mid-sized samples is excessive as it is based on cut-off values that are
accurate only for very large samples.
To reassess the value of these methods our own simulations utilized a
two-stage procedure:
·
First, the
necessary cut-off values were determined by simulation under the null
hypothesis.
·
Then,
these values need to be employed in subsequent power calculations under the
alternative(s) of interest.
None
of the methods cited above proved at all powerful when the correct significance
levels were employed. Indeed, the
actual power was only 70% to 85% of the claimed power when testing against
normal and double exponential distributions, for sample sizes of 6 to 12 in number,
and a claimed power of 80% or less. In other words, the power claimed for these
tests had been obtained at the expense of Type I errors in excess of that
specified and desired.
All
the tests including those proposed here require that the observations be
independent or, at least, exchangeable. The F-ratio test (Fisher, 1925) is
exact only if the observations come from a normal distribution, and unlike the
t-test, is very sensitive to deviations from normality.
The
other previously-proposed tests have more restrictions. Each requires that two or more of the
following four conditions be satisfied:
1.
The
observations be normally distributed.
2.
The
location parameters of the two distributions be the same or differ by a known
quantity, Duran[1976].
3.
The two
samples be equal in size.
4.
The
samples be large enough that asymptotic approximations to the distribution of
the test statistic are valid.
As
an example, the first published solution to this classic testing problem is the
z-test proposed by Welch [1937] based on the ratio of the two sample variances. If the observations are normally
distributed, this ratio has the F-distribution, and the test whose critical
values are determined by the F-distribution is uniformly most powerful among
all unbiased tests (Lehmann, 1986, section 5.3). But with even small deviations from normality, significance
levels based on the F-distribution are grossly in error, (Lehmann, 1986, page
207); the magnitude of the error will depend on the 4th moment of
the distribution from which the samples are drawn.
Box
and Anderson [1955] propose a correction to the F-distribution for
"almost" normal data, based on an asymptotic approximation to the
permutation distribution of the F-ratio.
Not surprisingly, their approximation is close to correct only for
normally distributed data or for very large samples. The Box-Anderson statistic results in an Type I error rate of
21%, twice the claimed and desired value of 10%, when two samples of size 15
are drawn from a gamma distribution with four degrees of freedom.
3. THE
PERMUTATION APPROACH
At first glance, the permutation test for
comparing the variances of two populations would appear to be an immediate
extension of the test used for comparing location parameters, the distinction
being that we make use of the squares of the observations rather than the
observations themselves. But these
squares are actually the sum of two components, one of which depends upon the
unknown variance, the other upon the unknown location parameter. That is, EX2 = E(X–µ+µ)2
= E(X–µ)2 +2µ E(X–µ) +µ2 = σ2 +0 + µ2
. A permutation test based upon
the squares of the observations is appropriate only if the location parameters
of the two populations are known or are known to be equal (Hayes, 1997).
We
cannot eliminate the effects of the location parameters by working with the
deviations about each sample mean as these deviations are interdependent
(Maritz, 1981).
3.1.
Aly’s Test
Statistic
Good[2000] proposed a test based on the
permutation distribution of the statistic described by Aly[1990],
![]()
where
X(1) < X(2) <. . . < X(m)
are the order statistics of the first sample. That is,
X(1) is the smallest of the observations in the first sample
(the minimum), X(2) is the second smallest and so forth, up
to X(m) the maximum.
As
SA puts its greatest weight on differences in the center of the
distribution the effect of outliers is minimized.
Aly[1990]
makes use of a far-from-exact asymptotic parametric approximation to the
distribution of SA. The test based upon the
permutation distribution is exact and is unbiased when testing within the
family of distributions which differ only in their location and scale
parameters, FF={F[(x-μ)/σ)}.
To illustrate the application of Aly’s
statistic, suppose the first sample consists of the measurements 121, 123, 126,
128.5, 129 and the second sample of the measurements 153, 154, 155, 156, 158. X(1)=121,
X(2)=123 and so forth.
Set
{z1i } equal to the differences between successive
order values in the first sample, z1i
=X(i+1) - X(i) for i =1, . . . ,4. In this instance, z11
= 123 – 121 = 2, z12 =
3, z13 = 2.5, z14 = 0.5.
The
original value of Aly’s test statistic is 8 + 18 + 15 + 2 = 38. To compute this test statistic for other
arrangements of the labels on the observations, we also need to know the differences
z2i = Y(i+1) – Y(i) for the second sample; z21=1, z22
=1, z23 =1, z24 = 2.
Only certain exchanges are possible.
Rearrangements are formed by first choosing either z11 or z21,
next either z12 or z22, and so forth until
we have a set of four differences.
One possible rearrangement is {2,1,1,2} which
yields a value of SA = 20. There are 24 = 16
rearrangements in all, of which only one {2, 3, 2.5,2} yields a more extreme
value of the test statistic than our original observations. With two out of 16
rearrangements yielding values of the statistic as or more extreme than the
original, we should accept the null hypothesis. (Better still, given the
limited number of possible rearrangements, we should gather more data before we make a decision.)
But the test we’ve described is restricted to two equal-sized
samples and with missing data nearly inevitable may not always be applicable.
If our second sample is larger than the
first, we may still resample in two
stages: First, we select a subset of m
values {Yi*, I=1, . . . m} without replacement from the n
observations in the second sample, and compute the order statistics Y*(1) < Y*(2) < . .
. <Y*(m) and their differences {z*2i}. Next, we
examine all possible values of Aly's measure of dispersion for permutations of the combined sample {{{z*1i},{{z*2i}} and compare Aly's measure for
the original observations with this distribution. Repeating the two steps for several hundred
random subsets we obtain a bootstrap confidence interval for the p-value.
3.2.
Deviations About The Median
Good [1994] proposed a permutation test
based on the sum of the absolute values of the deviations. First, we compute the median for each
sample; next, we replace each of the remaining observations by the square of its
deviation about its sample median; last, in contrast to the test proposed by
Brown and Forsythe [1974], we discard the redundant linearly-dependent value
from each sample.
Suppose
the first sample contains the observations x11, . . . x1n
whose median is M1; we begin by forming the deviates x'1j
=|x1j – M1|for j =1, . . . n1. Similarly, we
form the set of deviates { x'2j } using the observations in the
second sample and their median.
If
there are an odd number of observations in the sample, then one of these deviates
must be zero. We can't get any information out of a zero, so we throw it away.
In the event of ties, should there be more than one zero, we still throw only
one away. If there is an even number of observations in
the sample, then two of these deviates (the two smallest ones) must be equal.
We can't get any information out of the second one that we didn't already get
from the first, so we throw it away.
Our
new test statistic SG is the sum of the remaining n1–1
deviations in the first sample, that is, ![]()
.
We
obtain the permutation distribution for SG and the cut-off point for
the test by considering all possible rearrangements of the remaining deviations
between the first and second samples.
To
illustrate the application of this method, suppose the first sample consists of
the measurements 121, 123, 126, 128.5,
129.1 and the second sample of the measurements 153, 154, 155, 156, 158. Thus,
after eliminating the zero value, x’11=5, x’ 12=3, x’13=2.5,
x’14=3.1, and SG =13.6.
For the second sample x’21=2, x’22=1, x’23=1,
x’24=3.
There
are
arrangements in all of which only three yield
values of the test statistic as or more extreme than our original value. 3/70=0.043
and we conclude that the difference between the dispersions of the two
manufacturing processes is statistically significant at the 5% level.
As
there is still a weak dependency among the remaining deviates within each
sample, they are only asymptotically exchangeable. Tests based on SG are alternately conservative and
liberal according to Baker [1995] in part because of the discrete nature of the
permutation distribution unless
a.
The ratio
of the sample sizes n, m is close to 1;
b.
The only
other difference between the two populations from which the samples are drawn
is that they might have different means, that is, F2[x] = F1[(x-δ)/σ].
We
were unable to confirm her results in our own simulations (the R code for which
may be obtained from the author at pigood@verizon.net). We found tests based on SG to be
uniformly conservative for samples of sizes 4 and above. Our simulations employed either
normally-distributed data or mixed normal data. To avoid randomizing on the
boundary in our simulations, we set the alpha level in each instance to
correspond to one of the discrete levels available for the permutation
distribution. Chernick and Liu [2002)
describe the necessity of such a
procedure.
3.3.
Paired Deviations
When we have the same number of
observations in each sample, an alternate method of rearrangement suggests
itself. Suppose we pair the deviations
according to their magnitude within each sample, form a rearrangement by
selecting one member of each pair, and again compute the sum.
For
example, using the measurement data, we would form the pairs (5,3), (3.1,2),
(3,1) and (2.5,1). Now there are just
16 possible rearrangements with the sum of the observations as originally
labeled being the largest possible value.
We obtain a p-value of 0.0625, not because this
method is less powerful than the preceding one, but because we have severely
restricted the number of possible
rearrangements. With two samples of size n, there are only
n-1 pairs and 2n-1 possible rearrangements.
If
we draw six observations from a N(0,1) population and six from a N(0,2),
population, and test at the 2/32=0.0625 significance level, the first method using deviations about the sample medians
has a power of 33% and the second of 60%.
The first test is conservative, with an actual p-value of less than
6%. The second test is exact. Aly’s permutation method, which is also
exact for this p-value, has a power of only 24%. Similar results for the three tests were obtained in a comparison
of Gamma(4) distributions with two samples of size 6.
Of course, if we insist on using a
significance level of 5%, then either we must sacrifice power due to the
discrete nature of the permutation distribution, or worse, leave decision
making on the boundary up to a chance device (see, for example, Lehmann, 1986,
p75).
4.
K-SAMPLE PERMUTATION TEST
The preceding tests based upon the
absolute deviations about the sample medians are easily generalized to the case
of K-samples from K-populations. First,
we create K sets of deviations about the sample medians and make use of the
test statistic
![]()
The choice of the square of the inner sum ensures that this statistic takes its largest value when the largest deviations are all together in one sample after relabeling.
To generate the permutation distribution of S, we again have two choices. We may consider all possible rearrangements of the sample labels over the K sets of deviations. Or, if the samples are equal in size, we may first order the deviations within each sample, group them according to rank, and then rearrange the labels within each ranking.
Again, this latter method is directly applicable only if the K samples are equal in size, and, again, this is unlikely to occur in practice. We will have to determine a confidence interval for the p-value for the second method via a bootstrap in which we first select samples from samples (without replacement) so that all samples are equal in size. While I wouldn’t recommend doing this test by hand, once programmed, it still takes less than a second on last year’s desktop.
5.
TESTING
WHEN NON-RESPONDERS ARE PRESENT
In testing for a response to drug
treatment, it is common to encounter a response threshold peculiar to each
individual, such that some individuals respond to drug treatment and some do
not. If the treatment is effective, one
expects both the mean and the variance of the treated population to be larger
than those of the control population.
This suggests that a test for simultaneous changes in expectation and
variance would be more powerful than one that tests for changes in expectation
alone.
We wish to test the hypothesis H: F2[x]=F1[x]
against the alternative
K: F2[x]=pF1[x] +
(1-p) F1[(x-d)/s]; 0 < p < 1; d>0 ; s ≥ 1.
Good[1979]
proposed the test statistic
![]()
where the first term is proportional to the difference in means of
the two samples and the second to the variance of the treatment sample. Rearranging the labels between the two sets
of observations generates its permutation distribution. But, alas, its power is only marginally
better than the t-test.
We suggest instead the test statistic
where S is the sum of
the deviations about the median as defined in Section 3, and the subscripts O
and π refer to the original data and the data after rearranging sample
labels.
Care
must be taken in generating the rearrangements as the first part of our test
statistic is based on one more value than the second. To accomplish the desired result, we first select n – 1
observations at random from the reduced data set to use in forming S and then
one more observation (of those not already selected) to calculate the mean.
Ideally,
the parameter u would be chosen equal to p, but typically, p is not known. In our simulations, we used a value of
u=0.67 and selected data from an N(1,1) population for controls and from a
mixture of 50% N(1,1) and 50% N(2,2) for the treated group. At a significance level of 10%, and using
two samples of size 5 respectively, the t-test yielded power of 20%, a
permutation test using Student’s t as its test statistic had a power of 21%,
and a permutation test that made use of our new statistic had a power of 33%.
Aly E-E AA. (1990) Simple
tests for dispersive ordering. Stat. Prob. Ltr. 9: 323–325.
Baker RD. (1995) Two permutation tests of
equality of variance. Statist. Comput. 5(4): 289–96.
Balakrishnan N;
Ma CW. (1990) A comparative study of various tests for the equality of
two population variances. Statist. Comp. Simul. 35: 41-89.
Box GEP; Anderson SL.( 1955) Permutation
theory in the development of robust criteria and the study of departures from
assumptions. J. Roy. Statist. Soc B. 17: 1-34 (with discussion).
Brown MB; Forsythe AB. (1974) Robust tests
for equality of variances. JASA.
69: 364-367.
Chernick MR; Liu CY. (2002) The saw-toothed
behavior of power versus sample size and software solutions: single binomial
proportion using exact methods. Amer. Statist. 56:149-155.
Conover WJ; Johnson ME; Johnson MM. (1981) Comparative study of
tests for homogeneity of variances: with applications to the outer continental
shelf bidding data. Technometrics. 23: 351-361.
Conover WJ; Salsburg D. (1988) Locally most powerful tests for detecting treatment effects when only a subset of patients can be expected to "respond" to treatment. Biometrics. 44: 189-196.
Duran BS. (1976) A survey of nonparametric
tests for scale. Commun. Statist. Theor-Meth. A5:338-370.
Fisher RA. (1925) Statistical Methods for Research Workers. Edinburgh: Oliver and
Boyd; 1st ed.
Good PI.
(1979) Detection of a treatment
effect when not all experimental subjects respond to treatment. Biometrics. 35:483-489.
Good PI. (1994) Permutation Tests. New
York: Springer-Verlag. 1st ed.
Hayes AF.
(1997) Cautions in testing variance equality with randomization
tests. J. Statist. Compu. Simul. 59:25-31.
Lehmann EL. (1986) Testing Statistical Hypotheses. 2nd ed. New York: John
Wiley and Sons.
Maritz JS. (1996) Distribution Free Statistical Methods. 2nd ed. London: Chapman and
Hall.
Miller
RG. (1968) Jackknifing variances. Annals
Math. Statist. 39: 567-582.
Sukhatme
BV. (1958) A two sample distribution free test for comparing variances: Biometrika.
45: 544-8.
Welch
BL. (1937) On the z-test in randomized blocks and
Latin squares.
Biometrika. 29: 21-52.