|
The Analysis of Intervention Effects Using Observational Data Bases If, in a clinical unit, a new treatment is introduced within a short time period the problem arises as to how to evaluate its immediate impact on the patients prognosis, i.e., the (possible) intervention effect. An exploratory tool is described which can be employed to examine this effect. The method is illustrated by means of a clinical example. |
||||||
| 1. Introduction Whenever a new treatment, procedure, or care program is introduced in a clinical institution or a clincal population, it is of interest to evaluate the impact or intervention effect of this innovation on certain health outcomes of the patients. It is the exception rather than the rule that a randomized study is carried out to answer this question. Very often one has to rely on routinely sampled data from a clinical database. The standard approach for assessing the intervention effect of an innovation, say NEW, using patient registers is to perform a comparison of patients admitted after the introduction of NEW with a historical control group of patients admitted before the introduction of NEW. Usually, this includes an adjustment for observed imbalance in potential confounders by means of matching, stratification, or statistical modeling. This approach has a number of well-known problems and limitations. Thus, there may be changes over time in referral patterns, patient structure (including unknown or unobserved prognostic variables), diagnostic procedures, staff and skill, concomitant treatments and ancillary care, methods of data collection, dropout mechanisms, etc. It is hardly surprising, therefore, that some authors consider historical comparisons as intrinsically biased and faulty [ 4,5,6,8,9] . In particular, this applies to evaluations based on databases that were not specifically designed for the purpose of effectiveness research or the assessment of treatment efficacy. It is difficult to see how the various sources of bias of historical comparisons can be eliminated even with the most careful data checks and analyses. Here, we consider a special situation where the innovation (we will assume that it is a new treatment for some disease) is introduced at a clinical unit or hospital at a well-defined point in time. In this situation, one may carry the analysis beyond the simple two-sample comparison. Instead of examining whether or not - and, if so, to what degree - a treatment group is superior to a historical control group, one may ask a different question: Or, to put it in other words: Was there any sudden change in the treatment result at the time the innovation was implemented? In order to address this question, we propose to use calender time t (time of patient admission) as a surrogate for the treatment variable and to examine the time-dependence, and especially the existence of change-points, of the target variable representing the treatment results. Various statistical methods might be of use for this investigation, for example change point analysis in a time series setting or classification and regression trees (CART [ 3] ). For our purposes we have chosen a simple exploratory tool named CRITLEVEL [ 1] which will be outlined in the next section. Originally, CRITLEVEL was designed for the evaluation of arbitrary quantitative prognostic factors like tumor markers [ 1] . Its basic idea is quite simple. Essentially, it is as follows: 1. If F is the prognostic factor under investigation, order the patient sample (of size n) by increasing values of F. Let n* be an even number < n and let W(f) be the "window" of patients defined by n* consecutive values in the ordered sample, where f is the centre of the window (i.e. the median value of F in this window). 2. Given the window defined in step 1, perform a two-group comparison of patients that lie to the right or left of the centre of the window. Let g (f) be the effect measure derived for the window W(f). For example, g (f) may be the p-value of a test for differences between two groups. 3. Let the centre of the window move through all possible sample values of the prognostic factor. This creates a familiy of windows, comparisons, and effect measures. 4. For each member of the family of comparisons thus generated, plot g (f) versus f. We suggest to apply CRITLEVEL to "calender time" t as a prognostic factor in patient registers in order to investigate, whether the outcomes of the patients have changed suddenly at certain values of t. The assumption we make is that the rigorous introduction of an innovation will lead to such changes in the target variable, if it is effective. We will first illustrate the method by a constructed example. Let X be a variable (e.g., some laboratory parameter) that is recorded daily for a single patient during an observation period of t=1,...,250 days. Assume that the X(t), the value of X at day t, follows a normal distribution N(mt,1), where mt is a step function mt º 2.5 for t £ 60, mt º 2.0 for 60 < t £ 120, and mt º 3.0 for t > 120) with two systematic effects (jumps) at days 60 and 120, and s(X(t),X(t))=0 for t¹ t. For the CRITLEVEL method we choose a window size of 100. The effect measure is chosen to be the logarithm of the p-value of a two-sample t-test with sample sizes 2x50. Thus, the cutpoints (window centres) for the CRITLEVEL method range from t=50 to t=200, so that the method generates a familiy of 151 comparisons and p-values. Figure 1 shows the results of the method using a simulated sample of observations xt (t=1,...,250). Also shown are the underlying step function and a cubic spline (solid line) fitted to the data indicating the mean change of xt over time due either to changes in mt or to random variation. As can be seen, the jumps of the underlying step function at days 60 and 120 are well reflected in the downward peaks of the plot of log(pt).
How are these peaks generated? If at a certain "critical" level tc an abrupt systematic change takes place, then the absolute sample value of the effect measure g (here: log(pt)) can be expected to increase as soon as the right side window reaches tc . The local peak of |g | will be approximately situated at tc, because it is at this point that the jump in the effect measure leads to the most pronounced discrimination of the groups compared within one window. With the shift of the window to increasing values of t the sample values of |g | tend to decrease again. We will apply the method to a semi-real situation, i.e., the presentation of the example is hypothetical, but is very much inspired by a real situation. The data of the original example are proprietary and the results are still subject to dispute and not open for publication. Assume that a new postoperative therapy for operable lung cancer was routinely introduced in a hospital in January, 1985. The value of this innovation is to be assessed using a clinical database containing baseline data and end results of all lung cancer patients admitted to the institution between 1980 and 1990. In a simple standard analysis, the patients receiving the new therapy showed a highly significant survival advantage compared to patients treated before January, 1985, the composition of tumor stages being similar in both groups. On the basis of this observed advantage a claim as to a real beneficial effect of the innovation was made. In this situation, it makes sense to examine if the innovation has had an immediate intervention effect which might be the cause for the observed differences in survival. Preferably, the intervention effect should be evaluated in all eligible lung cancer patients (both operable and nonoperable). Thus, by investigating the total impact one avoids an artefact which might arise if the criteria for operability have changed - consciously or unconsciously - at about the same time the therapy was introduced. The CRITLEVEL method was applied to all eligible patients, using a window size 2´ 45. The result of this analysis is shown in figure 2. The plot suggests at least one sudden change in the target variable occurring at approximately 1230 days after the beginning of the observation period. On the other hand, at the time of the introduction of the intervention in January 1985 (1830 days after the beginning of the observation period), no effect is perceivable. The interpretation is plain enough: clearly the advantage of patients treated by the innovation can be assumed to be due to some change in the institution that took place before January 1985 and is therefore unrelated to the innovation itself. This interpretation becomes even more plausible when one takes into consideration that at the date corresponding to the peak further sudden changes were detected in the data, e.g., changes in the patient characteristics and admission rate. In this special example the findings of the CRITLEVEL method are not only in contrast to the positive finding from the conventional approach but they make a rather strong case against the effectiveness of the innovation. In fact we believe an argument of this type can sometimes be more devastating than a mere null result from randomized trial. Figure 2:
Now assume hypothetically that a distinct peak like the one at day 1230 in figure 2 had coincided with the introduction of the new therapy. Clearly then, this could have been a rather convincing argument in favor of the new therapy. In this case, since internal selection can not have had any influence on the results, one could have been reasonably confident that the intervention effect (and thus the survival benefit) of the introduction of the new therapy is real unless one assumes one of the following alternatives: 1. Referral of patients to the institution changed abruptly at day 1230 and - moreover - this change went unnoticed and did not show up in the baseline data, but did affect the survival; 2. Some further changes took place in the institution at about day 1230 which were unrelated to the the patients characteristics but were related to their survival. 4. Discussion The usual more or less negative view of historical controls or clinical databases for the evaluation of the efficacy of new therapeutic innovations is dominated by the concept of the two-group comparison of patients that have received the new treatments with those that have not. Admittedly these simple comparisons are susceptible to severe bias. This view, however, may be somewhat simplistic. To investigate the effect of the intoduction of a new therapy at a clinical unit based on clinical registers, we strongly recommend doing some kind of change point analysis. We think that the CRITLEVEL method may be useful in this situation. It is an exploratory tool for data analysis and should be used as a complementary concept. Of course, it often happens that the result of the CRITLEVEL method is entirely inconclusive. This, however, is not specific to our procedure and does not lower its general value as an alternative tool. The idea of considering a family of comparisons is somewhat analogous to that used in the CART procedure for quantitative prognostic factors [ 3] . However, CART does not restrict the comparisons to fixed size windows but rather compares all patients to the left and right of the sample values of T. Therefore, it is less suited for detecting a sudden change in the effect measure at a critical level tc because such an effect would be diluted by the fact that all comparisons cover patients both below and above tc. Our recommendation for the analysis with CRITLEVEL is to apply it to all eligible patients with the disease or treatment indication, much in the same way as treatment effects are evaluated within the paired availability design proposed by Baker and Lindeman [ 2] , because by investigating the total impact one avoids artefacts which might arise if the selection criteria have changed at about the same time the therapy was introduced. Also, we recommend using only hard variables for defining treatment indication and eligibility, i.e., variables (such as age, gender, calendar time, mortality, tumor site, survival time, etc.) that are not seriously affected by changes in diagnostics or technical equipment. Some further general remarks on the method are in order. As for the effect measure g , it should be adjusted for random variation because the method does not give confidence intervals along with the estimations. Thus, test statistics or p-values are preferable to differences of mean values. One may be concerned by the multiplicity of estimating or testing connected to the method, especially if p-values are plotted versus the prognostic factor in question. For minimal p-value selection methods it is well known that the p-value tends to overestimate the significance of two-group comparisons at the optimal cutoff point. Several types of adjustment have been proposed, based, e.g., on the Bonferroni inequality or on the distribution of the maximally selected rank statistics (see [ 7] for a brief review). In our context, however, adjustment, while possible, is probably of little use. First, our aim is not to actually choose a cut-off point for the prognostic factor. Second, it is less the absolute sizes of effect measures (e.g., p-values) that are of interest because the question is not primarily whether or not the two group within a window centered at T= tc differ at all with respect to the effect measure. Rather, it is the increase in the effect measure that one is interested in. One obvious caveat of the method is that it requires rather large sample sizes. This may preclude its application or lead to entirely inconclusive and unconvincing results. In survival analysis, this problem is exacerbated if there are many censored data. In the form presented here, our method is entirely exploratory. We believe that further work along these lines may be very useful. One extension that might prove helpful in quality assessment would be to relate the degree of implementation of an innovation to the size of the effect measure of a window method. Also, it would be interesting to develop precise hypotheses concerning sudden changes of prognosis, and confirmatory statistics to test these hypotheses.
|