Contents

Contributors

Editors:
U. Abel,
A. Koch

Search
Linklist

© Copyright

Published by
symposion logo

Nonrandomized Comparative Clinical Studies -

Proceedings of the International Conference on Nonrandomized Comparative Clinical Studies in Heidelberg, April 10 -11,1997

Order printed volume

A Nonparametric Test for Evaluating Coherent Alternativesin Nonrandomised Studies

O. Gefeller, L. Pralle

Abstract

When considering the effect of treatments or exposures on some outcome variable in a nonrandomised study, the presence of coherence provides supporting evidence that an observed relationship between the factors of interest might reflect a causal treatment or exposure effect. In our understanding, coherence means that we have a specific and detailed description of what an actual treatment or exposure effect would look like. The concept of coherence can then be used to formulate a "coherent pattern" of expected results, indicative of a real effect of the treatment or exposure under study, that can be tested using the observed data. In the paper, we review a simple nonparametric rank test, developed by Rosenbaum, for testing the null hypothesis of no treatment/exposure effect against arbitrarily complicated coherent alternatives. In addition, we introduce a new measure of coherence to summarise quantitatively the coherence present in the data. Two empirical examples, one epidemiological investigation and one nonrandomised clinical trial, illustrate the application of the methodology.

1. Introduction

When the results of nonrandomised experiments or observational studies have to be interpreted, the dilemma often arises whether the apparent difference between the groups to be compared can be causally attributed to the characteristic defining membership in the group or not. No sophisticated statistical testing procedure comparing nonrandomised samples with respect to the distribution of some primary endpoint variable(s) can remedy the lack of randomisation at the design stage of the study. Any nonrandomised comparison is potentially subject to (overt and hidden) biases that can distort the picture of results. However, supporting evidence for the assertion that the observed difference might reflect a real treatment or exposure effect can be drawn from the presence of "coherence" in the pattern of results.

Coherence means that we have a specific and detailed description of what an actual treatment or exposure effect would look like. Many epidemiological textbooks [1-4] and historical papers about epidemiological inference [5,6] discuss coherence as one of the criteria for judging causality when interpreting empirical findings from observational studies on the relationship between some exposure factor and an outcome variable. In all references, the concept of coherence is illustrated through examples rather than defined formally. In this paper, we do not attempt to give a formal description of coherence and leave it for section 4, giving two practical examples, to delineate what a "coherent pattern of results" means in the framework of the two corresponding studies. Instead, we discuss a simple nonparametric approach, developed by Rosenbaum [7], to test for coherence and introduce a new measure of coherence that provides a quantitative description of the degree of coherence present in the observed data.

The rest of the paper is organized as follows: in the next section we derive the so-called poset statistic, which can be viewed as a generalisation of standard two-sample rank tests, for testing the null hypothesis of no treatment/exposure effect against arbitrarily complicated coherent alternatives. In section 3, we introduce a simple procedure for estimating the level of coherence that yields a quantitative summary measure of the compatibility of the data with the pattern of results specified in the definition of the coherent alternative. Section 4 consists of two empirical examples illustrating the application of the methodology. One example is drawn from an observational study in occupational epidemiology [8], the other uses data from a nonrandomised clinical trial in neonatology [9]. The final section gives additional discussion of the concept of coherence and points to further topics that can be addressed in this framework.

2. Derivation of the Poset Statistic

Consider the typical structure of the two-sample layout: there are N units of observation, numbered i=1,...,N. The observations consist of K-dimensional vectors , among which the first N1 (i.e. ) belong to group 1 and the remaining N2=N-N1 (i.e. ) belong to group 2. To formally express a coherent hypothesis, a relation "<c" is defined on RK. This relation "<c " has to be asymmetric in the sense that if a <c b then it is false that b <c a, no other formal assumptions on "<c" have to be made. Note that "<c" is fundamentally different from an ordinary inequality. In particular, it is possible that for none of the three conditions (a <c b, b <c a, a = b) is true. Thus, there are vectors that cannot be ordered using "<c" which defines only a partial ordering on RK.

Now consider the random variables Uij, i,j=1,...,N, defined by

which can be viewed as indicator-like variables providing the information whether yi and yj can be ordered using "<c" and, if yes, in which direction. Note that Uii=0 by asymmetry and Uij=-Uji by definition. For a fixed i the sum defines a score value for yi that can be interpreted as a rank score indicating the position of yi among all observations according to "<c". More precisely, the value of Si gives the information how many observations have "lower" (according to "<c") values than yi minus the number of observations with "higher" values. For example, if Si=N-1 holds, all other observations have "lower" values than yi which attains its maximum possible rank score in this case. Note that from Si<N-1 it cannot directly be inferred that there are observations with "higher" values than yi as missing the maximum rank score can also reflect the inability of "<c" to order all observations.

These rank scores can now be used to test the null hypothesis of no treatment/exposure effect against a coherent alternative. Consider the poset test statistic that sums up all rank scores in the first group. Under the null hypothesis P(yi <c yj) = P(yj <c yi) holds which yields E(T)=0 in this situation. The variance of T under H0 can also be derived using standard arguments from the theory of linear rank statistics as. The central limit theorem for rank statistics is applicable to T so that the standardized version of T is asymptotically normal. Quantiles from the standard normal distribution can thus be used to derive the critical values for the poset test or to calculate the corresponding p-values.

The poset test has been proposed by Rosenbaum in 1991 [7] and has been discussed further by the same author in [10] and [11]. His original proposal uses a different test statistic which is, however, algebraically equivalent to our version here as can be seen easily using the device of Mantel [12]. The poset approach generalises several familiar nonparametric tests. If the outcome is one-dimensional and "<c" is defined as the ordinary inequality "", then the poset statistic is equivalent to the Wilcoxon-Mann-Whitney statistic. If the outcome values are censored (as in many survival analytical problems) and "<c" is appropriately defined, then the poset test corresponds to Gehan's test [13].

3. A Measure of Coherence

In the preceding section we formulated a rank score statistic T to test the hypothesis P(yi <c yj) = P(yj <c yi), for i=1,...,N1, j=N1+1,...N. A straightforward way to measure deviation from this null hypothesis is to consider the quantity

.

ranges from -1 to 1 and takes the extreme values if the relation "<c" completely separates the two treatment/exposure groups. Note that in general P(yi <c yj) + P(yj <c yi) < 1 since "<c" only defines a partial ordering on the observation space. Values of around 0 can thus arise from either of the following situations:

  1. a large proportion of undecidable comparisons (i.e. Uij=0)
  2. (nearly) same proportions of Uij=1 and Uij=-1.

Both situations reflect a poor coherence of the expected result pattern and the observed data with respect to the given partial ordering "<c". The coherence coefficient can for instance be used to compare several partial orderings "<ci" on the same dataset.

The idea of this coherence coefficient resembles the correlation coefficients that describe the degree of joint variation to two variables. In fact, the definition of recalls that of Kendall's t correlation coefficient which is defined to be the difference of probabilities of concordant and discordant pairs in bivariate observations (cf. [14]).

We can give an unbiased estimate of by a suitable normalization of the statistic T. When summing the scores within one group, the intra-group comparisons vanish (as mentioned above Uik = -Uki and if then both Uik and Uki appear in
), thus we can rewrite and we estimate by dividing T by the number of comparisons and get

.

This is an unbiased estimator of since the proportion of indices (i,j) yielding Uij=1 is an unbiased estimate of P(yi<cyj) and the proportion of indices yielding Uij = -1 is an unbiased estimate of P(yj<cyi).

The variance of under the null hypothesis is

.

This quantity might be used to give an upper bound of the non-null variance, similar to known expressions for Kendall's t , of the form

,

where k is some positive constant. Such a result can be used to compute conservative approximations of confidence intervals for . Better versions of the asymptotic confidence intervals for using the exact variance of under a given alternative may be constructed in analogy to the ideas applied to Kendall's in Gibbons [14].

4. Practical Examples

4.1 An Example from a Clinical Trial

As a first example explaining and illustrating the concept of coherence in practical data analysis we use data of a small nonrandomized clinical trial comparing two regimes for treating 26 premature neonates suffering from severe respiratory distress syndrome. The two treatment regimes both involved the application of a natural porcine surfactant (Curosurf) as a surfactant replacement therapy, however, one regimen consisted of administering the surfactant dose early (i.e. within 15 hours after birth), whereas members of the other treatment group received their surfactant dose later (i.e. between 15 and 48 hours after birth). The intention of the trial was to analyse whether the severely diseased neonates can benefit from an early start of treatment. The primary endpoint for the evaluation was defined as survival of the patients up to 28 days after birth. However, important additional outcome variables were the total time on supplemental oxygen (O2) during these first four weeks and the acute effect of the therapy on the neonates' respiratory situation (measured by the fraction of inspired oxygen (FIO2) value at 24 hours after starting the surfactant therapy).

Obviously, the information on the survival status dominates any information that can be drawn from the other two outcome variables, i.e. only among surviving patients the duration on O2 and the FIO2 values at 24 hours provide meaningful pieces of information for analysing treatment effects. Given the three-dimensional outcome data a coherent pattern of responses indicative of a beneficial effect of starting the treatment early would thus show a reduced proportion of deaths in this group combined with a shorter duration on O2 and lower FIO2 values at 24 hours among the survivors in this treatment group, always compared to patients of the late-treatment group.

Formally, X1i denotes the survival status (0 = alive, 1 = dead), X2i the total time on O2 (continuous variable, potential range: 0 - 672 hours) and X3i the FIO2 value at 24 hours (continuous variable, potential range: 0.21 - 1.0) of the i-th child. Then the outcome vectors , are given by the collection of the measurements on these three variables, where the first 19 observations belong to the early-treatment group and the remaining 7 observations were made on patients of the late-treatment group. The coherent alternative to be detected can be formulated using an appropriate definition of "<c" on R3. Incorporating the hierarchy of outcome factors as described above "<c" is defined as follows:

where at least one of the inequalities "" has to be strict. Given this definition of "<c" patients with "lower" vector values exhibit a poorer response to the treatment than those with "higher" vector values, however, many outcome vectors are not ordered by "<c", for example, in the subgroup of dead patients (receiving "lower" values than all surviving patients) no further ordering is performed.

The application of the poset test to the data of this small clinical trial yields a value of 84 for the test statistic T and an estimate of 1787.2 for its variance under H0. This gives a standardised T-value of 1.987 which leads to an asymptotical (two-sided) p-value 0.047. Thus, there is a significant difference in the response pattern of outcomes into the direction of the coherent alternative between the two treatment groups demonstrating that those in the early-treatment group benefit from this therapeutic regime. The estimated coefficient of coherence takes a value of . This quantity is obtained by subtracting the proportion of "negative" comparisons (0.135) from the proportion of "positive" comparisons (0.767). The test performed earlier ensures that this difference is significant. Furthermore, we see that 90.2% of all possible comparisons between the two treatment groups are decidable in the "<c"-sense. The variance of is under H0 is estimated as 0.101.

The results of this poset approach are now compared to a separate analysis of the three outcome variables. The proportion of death in the late-treatment group (1/7 = 14.3%) is slightly higher than in the early-treatment group (2/19 = 10.5%), however, due to the small numbers this difference is far away from statistical significance. The analysis of the duration on O2 uses the values of dead subjects as censored observations. A comparison of the drastically different distributions in the two groups by an exact version of the logrank test yields a p-value of 0.01. The samples of FIO2 measurements at 24 hours show only minor differences resulting in a p-value of 0.48 obtained by a Wilcoxon-Mann-Whitney test restricted to the subgroup of survivors. Thus, only in one of the three outcome variables the univariate analysis revealed a significant benefit for the early-treatment group. The strength of the poset approach is that it not only combines the evidence of the three outcome variables in some formal way but gives the opportunity to look for a coherent pattern of results with respect to all three outcome factors. Given that our understanding of what a beneficial treatment effect has to look like is correct, the result of the poset analysis in our clinical trial strengthens and justifies the claim that the early-treatment regime has beneficial effects when treating neonates with severe respiratory distress syndrome by surfactant replacement therapy. This finding based on our small nonrandomised clinical trial has later on been confirmed in larger randomised trials [15]. It is now the accepted standard treatment regime to start the surfactant replacement therapy in this clinical situation as soon as possible.

4.2 An Epidemiological Example

The second example illustrating the application of the techniques introduced in sections 2 and 3 deals with data from an observational study in occupational epidemiology. Briefly, Morton et al. [8] examined the distribution of lead in the blood of children whose parents were employees in a factory that used lead in the production of batteries. Only children of employees were enrolled in the study, no sample of controls was available. The aim of the investigation was to analyse a potential relationship between the level of lead in the children's blood and the intensity of lead exposure at the parents' workplace. An important confounder of this relation, which has been measured in the study, is the parents' individual hygiene practices that can reduce the lead contamination of the children's home environment. Information on the parents' lead exposure has been dichotomised into two groups of high and low exposure, respectively, whereas data on hygiene practices were available on an ordinal three-point-scale. A coherent pattern in the data indicative of an exposure effect of the parental lead contamination on the children's level of lead in the blood would look like the following: children in the group of highly exposed parents should have higher levels of blood lead than those in the low-exposure group and simultaneously the lead values from children with parents showing poor hygiene practices should also be higher than those in the medium hygiene category which themselves should be higher than those in the good hygiene stratum. Thus, if yi = (X1i , X2i) denotes the pair of observations on the i-th child in the low-exposure group consisting of the lead level (X1 , continuous variable) and the parents' hygiene practices (X2 , 0=poor, 1=medium, 2=good) and yj= (X1j , X2j) denote the data of child j in the high-exposure group, then if and (with at least one strict "") the observed data would be compatible with the coherent pattern described above. The relation "<c" is then defined for accordingly so that yi <c yj means that both components of the vector yi have to be less than or equal to those of yj (with at least one strict inequality).

Given this definition of coherence for the data under study, the poset test statistic T comparing the high-exposure group (n=19) with the low-exposure group (n=15) attains a value of 120 and an estimated variance of 2745.5. Hence, the standardised value of T is 2.29 yielding an asymptotical (two-sided) p-value of 0.022. The estimate of , the measure of coherence introduced in section 3, yields a value of 0.421 (variance estimate 0.0338) as difference of 62.5 % "positive" and 20.4 % "negative" comparisons. Thus 82.9 % of all possible pairs of outcome vectors resulting from compairing subjects of the two groups can be ordered using "<c".

Despite the small samples there is compelling evidence in this epidemiological study that the parents' lead exposure at the workplace affects their children's blood level of lead. The same data analyzed with a simple Wilcoxon-Mann-Whitney test (ignoring the confounding influence of the parents' hygiene practices which seems to be only modest) yields a similar p-value of 0.016. Thus, in this case, the application of the poset idea to this problem does not materially change the conclusion that can be drawn from the data. Separate two-sample comparisons in the strata defined by the confounder are not possible due to the limited sample sizes.

5. Discussion

In this paper, we have described a simple nonparametric approach to test for coherent alternatives. This methodology has its special domain in the analysis of nonrandomised studies since coherence is of specific importance in nonrandomized investigations to support the claim of attributability of the observed effect to the treatment/exposure under study. The idea can, however, equally be applied to randomized experiments offering the opportunity to look for coherent patterns of treatment effects in vector-valued outcome structures. In doing so, the poset provides a simple alternative to conventional multivariate statistical techniques. In a randomised trial the result of the poset test can then be directly interpreted as reflecting the strength of evidence for a discrepancy in the response patterns between the groups attributable to the treatment. Contrary to the nonrandomised case, no further attempts to clarify the sensitivity of the results to hidden biases have to be made (given that the randomisation has been performed properly).

In our opinion, the general idea of this approach has a high potential of practical applicability in clinical and epidemiological trials. On the one hand, the mathematical and computational complexity of the methodology is very low, for moderate sample sizes the test statistics can even be computed using a pocket calculator. On the other hand, the interpretation of the test results is straightforward and gives useful application-oriented information on the topic of interest.

Of course, the validity of the whole procedure with respect to providing supporting evidence for the presence of treatment/exposure effects depends critically on the correct specification of the coherent alternative. In other words, if our understanding of how some treatment/exposure affects the outcome variables is wrong and consequently the specified alternative does not adequately describe the pattern in outcome factors indicative of a causal treatment/exposure effect, the results of the poset test can be misleading. Furthermore, the concept of coherence is by no means the solution of all problems connected with nonrandomised studies. The problem of (overt and hidden) biases applies to the interpretation of poset test results as well, especially in the case of strongly correlated outcome variables where the presence of a specific source of bias affecting one outcome factor is automatically carried over to the other ones. Rosenbaum [10] addressed the problem of hidden bias by extensive sensitivity analyses. He argued that in most cases there is a substantial gain in insensitivity to bias for the results of a test against a coherent alternative when compared to the sensitivity analyses for the individual outcomes used in formulating the coherent alternative.

The original poset test idea has been accompanied here by a straightforward suggestion to measure the degree of coherence present in the data, which is interpretable as the difference of two probabilities. Moreover, the proportion of decidable comparisons can give useful information on the appropriateness of the partial order relation. For practical use of the coherence measure the computation of a confidence interval is crucial. Although we do not give an explicit formula (yet), the similarities of to Kendall's indicate that asymptotic confidence intervals for can be constructed borrowing from the ideas applied in the framework of Kendall's .

Further topics to be addressed in future work on this issue are the extension to multiple group comparisons and the construction of an exact poset test for small sample sizes by deriving the finite distribution of the test statistic T under the null hypothesis. Both areas seem to pose no serious difficulties so that first results on the generalisation of the poset approach to these problems should be available soon.

References

[1]
Susser, M. (1973). Causal Thinking in the Health Sciences: Concepts and Strategies in Epidemiology. New York: Oxford University Press.
[2]
Lilienfeld, A., Lilienfeld D. (1980). Foundations of Epidemiology. New York: Oxford University Press.
[3]
Kleinbaum, D.G., Kupper, L.L., Morgenstern, H. (1982). Epidemiologic Research. Principles and Quantitative Methods. Belmont: Lifetime Learning Publications.
[4]
Rothman, K. (1986). Modern Epidemiology. Boston: Little, Brown & Co.
[5]
Hill, A.B. (1965). The environment and disease: association or causation? Proceedings of the Royal Society of Medicine 58, 295-300.
[6]
Evans, A.S. (1978). Causation and disease: a chronological journey. American Journal of Epidemiology 108, 249-258.
[7]
Rosenbaum, P.R. (1991). Some poset statistics. Annals of Statistics 19, 1091-1097.
[8]
Morton, D., Saah, A., Silberg, S., Owens, W., Roberts, M., Saah, M. (1982). Lead absorption in children of employees in a lead related industry. American Journal of Epidemiology 115, 549-555.
[9]
Speer, C.P., Harms, K., Herting, E., Neumann, N., Curstedt, T., Robertson, B. (1990). Early versus late surfactant replacement therapy in severe neonatal respiratory distress syndrome. Lung 168(Suppl), 870-876.
[10]
Rosenbaum, P.R. (1994). Coherence in observational studies. Biometrics 50, 368-374.
[11]
Rosenbaum, P.R. (1995). Observational Studies. New York: Springer-Verlag.
[12]
Mantel, N. (1967). Ranking procedures for arbitrarily restricted observations. Biometrics 23, 65-78.
[13]
Gehan, E. (1965). A generalized Wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika 52, 203-223.
[14]
Gibbons, J.D. (1971). Nonparametric Statistical Inference. New York: McGraw-Hill.
[15]
Jobe, A.H. (1993). Pulmonary surfactant therapy. New England Journal of Medicine 328, 861-868.