|
The mythology of randomization
In biostatistics and medicine one sometimes encounters an extremely negative view or even a categorical rejection of nonrandomized studies. This attitude may be comprehensible from a historical, pragmatic, or educational viewpoint but it is not well-founded on epistemological grounds. In addition, it is potentially harmful. Usually, randomization is credited with advantages that it does not possess or confer, and the criticism of nonrandomized studies is based on a catalogue of admittedly unhappy examples from the medical literature. Neither these examples nor the existing empirical investigations into the differences between published randomized and nonrandomized studies tell us whether well-designed and carefully analyzed nonrandomized studies would have yielded results that are distinctly or even qualitatively different from those of the randomized trials. Dogmatic beliefs about randomisation In this contribution we are going to criticize widespread opinions about randomization. We do not mean to criticize randomization itself. Of course, randomization is a reasonable methodological principle, and, to express it in a slogan-like manner, you should randomize whenever you can. After its introduction into therapeutical research in 1948, randomization has been a great success story. According to Horwitz [26] the randomized trial may rightly be called a scientific paradigm today. Olkin [34] estimated that currently some 9000 randomized clinical trials are performed every year, although it remains unclear whether this was meant to be the annual incidence of new studies or the prevalence of the studies being conducted. Many authors believe that randomization ultimately owes its attractiveness to the fact that it introduces the method of scientific experimentation into therapeutical research [11, 17, 42, 53]. |
||||||
| Nevertheless until today there has been a long standing and often bitter
controversy about randomized trials. Virtually all imaginable aspects are at issue: the
possibility to perform randomized studies, their adequacy, their ethical justification,
and their conclusiveness. Roughly speaking, the front lines of the dispute are as follows:
critics of randomization are mainly found in unconventional therapies and in surgery,
whereas its supporters - or rather emphatical critics of nonrandomized studies -
are particularly frequent in internal medicine, among biostatisticians, and among the
regulatory agencies. It is not exaggerating to say that the controversy about the necessity and the value of randomization is sometimes like a religious dispute. In 1978 Rimm and Bortin [41] remarked that the randomized trial is not only a ritual but has all elements of a religion - they called it TRIALISM -, with gods, devils and 10 commandments the first of which is: Thou shall randomize. One should admit that a lack of sober and rational consideration of randomization is common among biostatisticians as well [18]. This shows up in two phenomena. On the one hand, some scientists seem to regard randomization as a sort of quality mark without which the products of clinical research, that is to say, the study results, are valueless from the start. Thus, in 1981 Sackett gave the following advice to doctors about how to read clinical journals: "....discard at once all articles on therapy that are not about randomized trials" [14]. An even more extreme view was held by Cowan who wrote "With some exceptions participation of any group of patients in a nonrandomized trial is wholly unjustified and unethical since nothing can be learned from it" (quoted from Royall, [43]). And finally, Sir Richard Doll stated in 1994: "Biases...make outcome research ...as inadequate a means for assessing the value of a specific form of treatment as the outdated technique of comparing the results in a current series of patients with those obtained on other patients in the past" [15]. On the other hand, randomization is often credited with a numer of almost mystical properties. It is said
[e.g., 3, 5, 20, 29, 46, 47, 48]. All these properties are believed to be theoretical advantages of randomization, and they seem to tell us why a nonrandomized study is of little value when compared to a randomized one. There is only one problem with these claims. Except the last one, which is vague up to the point of being meaningless, they are incorrect. Since they are being stubbornly repeated in the literature on clinical studies we will briefly discuss them one by one. Without additional assumptions, such as identical observation in both groups, randomization does not "guarantee the validity of statistical tests" or whatever similar statement one can read in the literature. This is simply because if there are observational differences between the groups, the distributions of the outcome variable will be different even in the absence of any treatment effect. Randomization is also not necessary for testing. This not only follows from the fact that hypothesis testing is a mathematical theory that does not rely on any physical property or activity. It also becomes obvious when one looks at real-world research which does not use randomization, such as epidemiology, or the evaluation of prognostic factors, or diagnostic tests. As Feinstein [17] put it: "If randomization was really required for stochastic decisions on statistical significance, a massive amount of scientific literature would have to be expunged of all the p-values". In reality, p-values are conditional probabilities with the condition being a theoretical one regarding probability distributions that cannot be guaranteed in practice. In both randomized and nonrandomized studies a significant p-value has essentially the same interpretation, namely, it indicates that the results are not merely due to chance alone [23]. As for the second claim, to say that randomization is the basis for causal inferences is both vague and entirely erroneous, whatever the interpretation may be. This lacks a precise statement of what causal inference means, although precise statements do exist [25]. It also reflects an overestimation of the implications and meaning of randomization, which Royall [42] has mockingly called "the closurization principle": You randomize and then you close your eyes". In truth, randomization is not a panacea and is by no means sufficient for causal inferences. Apart from chance imbalance, randomized trials can also suffer from many sorts of severe systematic bias and mistakes, and indeed there are many examples of incredibly poor randomized trials in the literature. Indeed doctors themselves are well aware of the limits of randomized studies and are not necessarily convinced of a treatment effect even if this effect is shown in a series of successive randomized studies
That randomization is not necessary for causal inferences is quite clear. It follows not only from the history of epidemiology but also from the early medical breakthroughs like penicilline or insulin, which were introduced without randomized studies and for which there has hardly been any doubt about causal effects. The third claim concerning blinding is somewhat amusing. It exploits the fact that the expressions "make possible", "are the basis for", or "are an opportunity" are ambiguous, both in English and in German. They all leave open whether they designate a necessary condition, a sufficient condition or both. Clearly randomization is not sufficient for blinding. And, although it is a reasonable requirement for blinding it is not a necessary one in the logical sense. Even if it were necessary, this could hardly be sold as an advantage of randomization, just as it is not an advantage of beer to be a necessary requirement of the Oktoberfest. Rather vice versa: it is an advantage of double-blind studies that they are randomized. The last point which is about balance is the most important one. Obviously randomization has something to do with the concept of comparability of treatment groups. Although most biostatistician constantly talk about comparability in clinical trials, the vast majority are unable to give a really precise definition in statistical terms of what comparable means. There seems to be no definition in the literature, either. However, one can derive one from the excellent paper by Holland published in JASA in 1986 [25], which is based on earlier work by Rubin [44]. Holland precisely explains what a causal treatment effect is and what the conditions are, so that in a comparison of two groups one obtains an unbiased estimate of the causal effect. By analyzing Hollands argument, it becomes clear how a precise and reasonable definition of comparability should look like. It is simple but not obvious. Comparability of two groups is a context-dependent term. The context consists of the two treatments to be compared and the outcome variable of interest. Within a given context, one can define two groups as comparable if the distribution of the outcome variable conditional on the choice of treatment T1 or T2 does not depend on the treatment group. This definition does not make use of words like "factors" or "structure" or something else that one does not quite understand. It is nice in another respect: Following Hollands ideas one can show that if groups are comparable in this sense then one obtains an unbiased estimate of the causal treatment effect of T1 relative to T2. This is a property that the concept of comparability should indeed have. Intuitively it is obvious that comparability as we have defined it can be violated if, apart from the therapy, the groups have further differences. These aspects have been analyzed and illustrated in an large number of publications [e.g., 1, 6, 16, 18, 21, 22, 27, 28, 33, 40, 45, 46, 51, 54]. Table 1 gives a fairly comprehensive list of relevant causes for treatment group differences in an outcome variable. The most important ones can be classified under the catchwords: "differences in structure, in observation, and in experimental environment". The achievements of randomisation Now, let us see what randomization does achieve. It guarantees a control of imbalance in the sense that for all patient variables measurable at the time of randomization, the probability distributions are the same in all treatment arms of the study. This implies that randomization enables one to makes probability statements on differences between the groups regarding these variables. Randomization by itself does not guarantee balance with respect to any other aspect listed in Table 1. In particular, it does not guarantee comparability in the strict sense defined above.
One should emphasize, however, that balance with respect to prognostic variables is indeed a point of eminent importance. Lack of balance in these variables is usually the main objection raised against nonrandomized studies, for in theses studies an adjustment is not possible for unknown prognostic variables. This is particularly disturbing if the treatment effects to be investigated are small [37]. Several investigations have shown that these unknown variables may be of considerable importance [e.g., 49, 21]. But even an adjustment for known variables may fail if the ways to measure these variables change. In oncology, for example, it is often overlooked that if stage migration [19] has occurred in the past then matching with respect to stage does not only fail to reduce imbalance but may even be counterproductive because it results in comparisons of nominally matched pairs that in reality are not matched at all. It is not amazing, therefore, that some authors come to a very negative and pessimistic judgment on nonrandomized studies. Thus Sacks et al. [46] wrote: "...biases in patient selection may irretrievably weight the outcome of the HCT" (HCT=historically controlled trial). And "Can the accuracy of HCTs be increased? We fear there is little room for improvement in this area." However, very often, randomized studies are credited with nice properties that are not actively induced by the act of random allocation itself. As it were, they are "inherited" properties, following from the fact that randomized studies are part of a larger category of high-quality studies, namely prospective parallel comparisons with a written protocol, specifying important aspects of patient enrollment, treatment, observation, analysis, and other procedures. One can also put it like this: The advantages of randomized studies are not identical to the advantages conferred by randomization. Whatever the reasons, randomized trials have a good reputation, they are well accepted by the scientific community and have a relatively high impact in medicine. Therefore, needlessly refraining from randomization is absolutely unwise if one wants to convince other scientists [5]. Note, however, that the consequences of not randomizing depend on the situation. While in the planning phase of a study, "human" aspects like the impact of the results have to be taken into account, the only aspect that matters to the reader of the finished and published study is an epistemological one, namely, the potential bias due to imbalance. Let us summarize the arguments developed so far.:Although randomization clearly adds to the conclusiveness and credibility of studies, especially when it comes to small treatment effects, we have shown that there is no fundamental difference in the conclusiveness of randomized and nonrandomized studies. This is simply because there is only a loose, vague link between the balance induced by randomization and the comparability of the groups. Therefore, a rejection of nonrandomized studies is unjustified on theoretical grounds! This is important to note because there are situations in which randomizing is impossible or inappropriate [4], or in which it is possible but no randomized study exists (an example of this is high-dose chemotherapy of many carcinomas), or in which there are both randomized and nonrandomized studies of the same question In surgery, randomization seems to be especially problematic and relatively infrequent [30, 31, 35, 45, 50, 52]. This shows that there is a definite demand for nonrandomized treatment evaluations. So when biostatisticians are reluctant to deal with nonrandomized studies and neglect their further methodological development, they are not only somewhat unrealistic but also partly responsible if in these situations studies are worse than they could be. Randomized vs nonrandomized studies: empirical comparisons So far, we have analyzed theoretical differences between randomized and nonrandomized studies and have tried to unveil some unfounded dogmas about the value of randomization. Let us now see, if there is good empirical evidence against nonrandomized studies. The published material on systematic bias in nonrandomized studies is of the following three types: 1. Horror stories. 2. Systematic investigations into the literature of how observed treatment effects
depend on the study design. 3. Investigations of the "history effect" or "chronology bias"
[22]. Horror stories are extremely popular not only with doctors but even more so with biostatisticians. The reason for this is above all an educational one, because horror stories are wonderfully suited for making clear the dangers that loom when one strays from the virtual path of randomization. They are an instrument for threatening unruly clinicians who refuse to randomize.
Table 2 gives a catalogue of more or less well-known horror stories. Some of the therapies on the list are still being discussed are even in widespread use today.
In the light of such an example, any claim as to treatment efficacy based on a nonrandomized trial must necessarily appear as unfounded and almost a violation of critical science. However, this may be not the whole truth. No doubt, horror stories are impressive and easily remembered. But there are at least three reasons why they do not tell us much about the importance of randomization for the conclusiveness of studies. Firstly, one may safely assume that the published horror stories are the result of a biased selection. If, on the contrary, a randomized trial is done that confirms the positive result of a previous observational study. Then this is not very exciting from a methodological point of view and will hardly become known among biostatisticians. The second point is publication bias for the single studies constituting the basis of the horror stories. Poorly controlled studies with a null result are more easily withheld from publication than randomized studies with a null result. This contributes to a tendency that published nonrandomized studies more often show positive results than the published randomized studies. The third objection is that in general the published horror stories are comparisons of randomized studies with historically controlled studies, many of them of poor quality by todays standards. Often the fact that randomization is needlessly refrained from is in itself a strong indicator for poor science and for extensive methodological defects in the studies. So horror stories do not tell us anything about the role and impact of bias in well-designed nonrandomized studies, that is, in studies which are planned and conducted with the same methodological care as a randomized trial. One must emphasize that it is absolutely possible and it has occurred that results obtained in nonrandomized studies are convincing. To give just one example. In 1991 Cassileth et al. [8] published a prospective parallel-group matched pair study of unconventional treatment compared to conventional treatment of advanced cancer. The study gave a perfect null result which is probably not much less convincing than a null result from a randomized study of the same size. In this context, an often overlooked asymmetry of conclusiveness is important. There are several reasons why null results form a large study, whether randomized or not, are more convincing than positive results. This also has an implication for the motivation to initiate studies. Since null results undoubtedly contain information, it can be justified to do a nonrandomized study (rather than no study at all), even if it is clear from the start that only a null result will be accepted by the scientific community without further investigations. Note also, that there are some new proposals for designing and evaluating nonrandomized studies in a way that the results can be convincing, whether positive or negative. One powerful instrument is the evaluation of the total impact of a new therapy in an institution, just as in the paired availability design [2], combined, in addition, with an analysis of the immediate and sudden intervention effect caused by the introduction of the new treatment [24]. In the light of such developments, the sweeping negative opinion about nonrandomized studies, which is probably determined by a simplistic idea of historical two-group comparisons, is too pessimistic. It is perhaps understandable given the impression from horror stories, but it is not really justified. However is this negative view supported by systematic investigations? The association between the study design and the observed treatment effect has been examined in at least four papers. Chalmers et al. [9] analyzed 145 published parallel-goup studies of treatment of acute myocardial infarction. Short-term mortality was chosen as the variable of interest. The percentages of positive results, that is, of studies with a significant advantage of the therapy group over the control group, were as follows:
According to Chalmers et al., this result indicates that in all studies where the treatment allocation is not unknown to the clinician, a distinct selection bias must be reckoned with. Colditz, Miller, and Mosteller analyzed published reports of 113 studies of medical therapies [12], and - in another paper - 221 studies of surgical therapies [31]. In these studies, an innovation was compared to a standard therapy. It was found that the standardized treatment effect did depend on the study design. As one may expect, the largest mean effect was found for studies with external controls. Surprisingly, however, the mean effect found for observational studies using retrospective record reviews was on the average smaller than that found for randomized studies. Finally, Ottenbacher [36] analyzed 30 randomized and 30 nonrandomized parallel comparisons of a therapy group with a no-treatment control group that had been published 1989 or earlier in JAMA or the New England Journal. He did not find any influence of randomization on the mean observed treatment effect. The results of these investigations are difficult to interpret, not only because they are somewhat contradictory. At best, they can give a vague idea of the order of magnitude of bias in nonrandomized studies. However, the investigations themselves are biased in three respects. Firstly, the therapies were not identical in the randomized and nonrandomized studies. One cannot exclude that the true treatment effect depends on the type of therapy, so that the differences in observed effects possibly just reflect the reality. Secondly, true effects in randomized trials may be smaller because randomization is done only if, a priori, clinicians are not convinced that any one of the treatments to be compared is superior to the others. Since their prior judgement, based on experience, is often correct, positive results in randomized studies are rare [10]. And finally, publication bias probably contributes to the differences between randomized and nonrandomized studies. Chronology bias was investigated by Pocock [39]. Pocock identified 19 instances where co-operative groups used the same entry criteria for two successive randomized cancer chemotherapy trials which both included the same control treatment. When comparing the identical treatment arms in these pairs of trials Pocock found that the differences in annual death rates ranged from -46% to +24%. Four comparisons yielded differences that were significant on the 2%-level, the smallest p-value was 0.0001 [37]. One should note however that these comparisons were not adjusted for explanatory information or for secular trends. In summary, the existing investigations do not give any information about the value of carefully designed and conducted nonrandomized studies. In particular, they do not tell us to what extent, if any, possible imbalance in these studies might influence the judgement on the therapies. One way to address this question systematically would be to embed synthetic nonrandomized studies in randomized trials and to compare the results obtained with the different designs. Synthetic parallel-group studies, for example, can be carried out in multicenter trials by comparing the results of treatment groups obtained in different institutions. If apparent imbalance occurs then of course it should be adjusted for. Likewise synthetic historically controlled studies can be implemented within long-term randomized trials by partitioning the period of patient entry into different intervals and comparing the results obtained in these intervals.
|