|
Can postmarketing surveillance studies (Anwendungsbeobachtungen) give meaningful answers to important questions? A critical discussion of 5 examples.
Five conventional postmarketing surveillance studies with 86 to 5702 patients are discussed regarding their potential to answer essential questions and the quality of their implementation. The first question is concerned with the quality of the actual treatment compared to the accepted standards. In studies of treatment with antihypertensive and lipid-reducing medications, it was found that while dose increases did occur, therapeutic targets were not reached for most patients. Similary, in another study, observed dosing of a protective inhalant was far off the recommended regime. The second question centers around adverse drug reactions. In a study with a calcium antagonist, no signs of tachycardia were observed, a finding that is in conflict with claims that this class of drugs generally is associated with an increased mortality risk. The third question deals with efficacy measures. They seem to be slightly overestimated in comparison with controlled studies. However, they can be validated using information on their dependence on initial values, their time course, etc.. The proportion of invalid data can be judged by means of plausibility checks. Very useful are checks of the agreement between case report forms and investigator independent protocols (patient protocols, printouts), or checks of known relations between consecutive values like in lung function or oral glucose tolerance tests. Rates of detected invalid data were in the range of 20%. It is shown that routine statistical techniques are not useful for detecting fraud. Finally, general epistemiological limitations and possibilities to improve commercial postmarketing surveillance studies are dicussed. |
||||||
| Introduction Postmarketing surveillance studies (AWBs) are commonly used by pharmaceutical companies as marketing instruments. Large amounts of data are thus available at low costs, a potential which seems to be underused. Being charged with the evaluation of 5 AWBs, I attempted to extract valuable information with simple exploratory statistical tools. In the following, I will first describe the organizational background of these studies and then formulate three important questions, for which answers were possible. Each point is exemplified with one AWB in detail. In a second paragraph the reliability of data and implementation is examined, and finally recommendations for improvement of AWBs are given. The 5 AWBs and their organizational background The evaluations presented here were done as contract research. The patient populations and treatments of the 5 AWBs can be summarized as follows: 1) 4702 patients in Germany with chronic obstructive lung disease were treated with an anti-inflammatory agent for 3 months with intended visits at week 0, 6 and 12. Symptoms (dyspnea, coughing, sputum), general state of health and capability to manage daily, professional and sport activities and to engage in social activities were recorded. In many cases routine lung function parameters (peak expiratory flow (PEF), vital capacity (VK), 1-second forced (FEV1) and mean expiratory flow (MEF25,50,75), and breathing resistance (R)) where measured. Additionally, the patients were asked to measure the peakflow three times a day with a little device given to them and note the values in a diary ("Lung Study"). 2) 391 patients in Germany with high blood pressure were treated for 4 weeks with a slow releasing calcium-antagonist. At the beginning and at the end of the observational period, a 24-hour ambulatory blood pressure measurement (ABPM) as well as manual blood pressure measurements were requested ("ABPM Study"). 3) 86 patients in Germany with high blood pressure and high serum-lipid levels were treated for 1 year with an ACE-inhibitor and a serum-lipid-reducing drug. Visits were planned at week 0, 2, 8, 16, 28, 40 and 52. Blood pressure and serum levels of lipid- and glucose-metabolites, as well as safety parameters, were obtained ("KHK Study"). 4) 786 patients in Austria with high blood pressure were treated for 4 weeks with the same drug as in the ABPM Study. Blood pressure was determined manually at the beginning and the end of the observational period and at 2 visits in between (about 1 and 2 weeks after start) ("Blood Pressure Study"). 5) 313 Swiss patients with conjunctivitis were treated for 4 weeks with an anti-inflammatory eye-lotion. Intensity of 8 symptoms (chemosis, hyperemia, lid-cramps, size of pupil, activity of lacrimal glands, sensitivity to light, itching, sensation of foreign body) were recorded in 4 categories at the beginning and the end of the observational period ("Eye Study"). In all AWBs, demographic and anamnestic data, data on study medication and concomitant therapies, as well as statements on continuation of therapy were requested. Doctors and, in most studies, patients were asked for an efficacy and safety assessment on a 4-point scale. Information on adverse events (AE) was requested for without prescribing a specific format. Can important questions be answered by means of AWBs? I will show that even marketing guided AWBs can provide data, which, if carefully evaluated, make it possible to answer important questions. Three questions will be considered in detail. Are actual therapeutic regimes in accordance with accepted standards? In the KHK Study, only 6 of 86 patients finished the AWB with a blood pressure £80/140 mmHg and serum cholesterol < 200 mg/dl, and only 13 patients had final values of blood pressure £80/140 mmHg and cholesterol/high-densitiy-lipids (CHG/HDL-) ratio <5. The mean dose of the ACE inhibitor remained at about 50 mg per day, and the mean dose of the metabolic blocker at 20 mg per day, which is clearly below the maximum allowed dose of 75 and 40 mg per day, respectively. The time course of the mean dose and its standard errors do not give any indication of a further increase in ACE-inhibitor dose after the second visit, despite the fact that at a dose of 50 mg per day only 50% of the patients were responders. Even the fact that a dose increase from 10 to 20 mg per day clearly raised the response rate (from 28% to 45%) did not stimulate the doctors to try the maximum dose. Since the rate of adverse drug reactions was low (7%) and all shifts of laboratory values were obviously caused by alcohol problems, I did not investigate further to what extent adverse drug reactions prohibited the dose increase. However, it is obvious that therapeutic targets were not reached in this study. Similarly, in the Lung Study the concept of a preventive treatment - high initial dose and regular intake of the medication, along with a maintenance therapy at constant dose after cessation of symptoms - was often not followed. By contrast, in the ABPM Study and the Blood-Pressure Study the therapeutic aims as defined by the Liga gegen den Hochdruck [1] were reached to a high degree. Thus, AWBs are suitable instruments for documentation of drug utilization and for medical quality assurance. Can relevant aspects of adverse drug reactions be detected? Adverse drug reactions was the central issue of the evaluation of ABPM Study. Calcium-antagonists as used in this study can inrease mortality risk in patients with heart insufficiency [2, 3] or after myocardial infarction [4, 5]. One likely mechanism is reflex tachycardia due to rapidly increasing inital drug levels and corresponding sudden reduction of pressure. This is avoided by a slow releasing retarded formulation with slow increase in serum levels. Many authorities, therefore, believe that a change in the recommendations for antihypertensive treatment towards beta-blocking agents is not warranted [e.g. 6]. While this can be settled only in a large controlled study, it would be reassuring, in the meantime, to see that the heart rate does not necessarily increase with routine use of the slow releasing calcium antagonist. A plot of post- against pretreatment values of mean heart rates did not show any trend or single outliers. Since hourly mean values from 52 patients were available, it was possible to search for heart rate increases on a finer time scale. In the 24-hour-profile, a very slight increase of heart rate was found only for late night hours (0.00h to 6.00h) and, unexpectedly, a slight reduction immediatly after pill intake (6.00h to 12.00h) was observed. When patients with heart insufficiency (N = 21) were removed from the data pool the late-night heart rate increase disappeared. The mean increase and its 95%-confidence interval of the heart rate at night of the entire population was 0,21 (-0,90 to +1,33) b/min, whereas patients with treatment for heart insufficiency showed an increase by 2,48 (-1,92 tol +6,88) b/min. On the other hand, in the subsample of patients treated for the first time with a calcium-antagonist the mean heart rate at night increased only by 0,18 (-1,19 to +1,55) b/min. Since the heart rate increase does not occur immediatly after drug intake, heart insufficiency in combination with calcium-antagonists, and not the calcium-antagonist by itself, is a likely explanation for that rate increase. In the Blood-Pressure Study with the same drug, close investigation of CRFs revealed almost 5 times more (27%) adverse events than the spontaneous reports from the doctors (5%). Low rates of adverse drug reactions, which differed, however, greatly between investigators, were found in the KHK Study. In the Eye Study a possible interaction with co-medication was detected. Fast information on important aspects of adverse drug reactions, documentation of reporting behavior of investigators, and drug interactions are all very important for the evaluation of drug safety, and these aspects can be investigated in AWBs. Are observed effects plausible? A comparison of the pre/post change in physiological parameters under treatment with the specific effect of the drug obtained in controlled studies is certainly essential to judge how useful a drug is in the field. In the ABPM Study mean systolic and diastolic blood pressure during the day were reduced by 18,7±15,8 mmHg and 10,2 ±10,2 mmHg, respectively. This is in the range of the specific effects obtained in controlled studies with manual measurements, and about two times that obtained in a small study using 24h-ABPM. Since in that study, baseline blood pressure values were low and nifedipine is known to lower blood pressure rarely below the normal range, the observed difference is in line with current theory. In the Blood-Pressure Study with the same drug pre/post-changes were 23,0±11,1 mmHg systolic and 12,3±110,8 mmHg diastolic, respectively. The main effect was present within the first week, but increased slightly within the next three weeks. A clear dependence on the baseline (Pearson rho = 0.73; reduction of about 10 mmHg at a baseline diastolic pressure of about 105 mmHg) was observed. These findings agree well with results of a meta-analysis of three controlled phase III studies with 105 patients. Similarly, in the KHK and the Lung Study the size and time-course were in good agreement with controlled studies. However, this may be due to different mechanisms, as discussed below. Quality of data and implementation Our examples show that clear answers to important questions can be extracted from conventional AWBs. However, it remains open as to how trustworthy these anwers are. This problem can be divided into three aspects: the quality of the data, epistemiological and statistical limitations of AWBs, and the quality of implementation of AWBs. Proportion of implausible data in AWBs GCP-like monitoring or source data checks are never performed in present AWBs. Furthermore, it is the investigator compliance not the patient compliance, which poses a problem in AWBs. (In fact, the noncompliance of patients is an important object of investigation in any AWB.) Therefore, wrong or missing items in the CRFs due to inexperience, lack of compliance or even intended fraud are a crucial issue for data quality in AWBs. On several occasions I was able to check the quality of the data provided by the investigators against investigator independent protocols. In the ABPM Study, doctors were asked to provide print-outs of the 24h-blood-pressure measurement along with the CRFs. With 202 CRFs an original or copy of the print-out was sent in. By a technical operating procedure which regulated the data entry, ABPM data were rejected if one of the following conditions was satisfied: a) the date of ABPM differed by more than 7 days from the visit date in the CRF, b) the date on the print-out was made unreadable, c) day- or night-means were not printed out separately, d) the definition of "day" differed by more than two hours from the norm (7.00h-22.00h, [1].) Fraud had to be considered if the date of print-out was marked unreadable or cut away, if the date of visit differed by more than one month from the date of ABPM, if patient diaries - sometimes also added to the protocol - noted other concomitant medication than stated in the CRF, or if the doctor marked "good effect" without having any blood pressure value recorded at visit 2. In 57 out of a total of 391 patients, one or both ABPMs were missing or were excluded according to one of the above criteria, in addition to 24 patients, where missing data were identified as due to therapy drop-outs. Since only 202 printouts were available for checking, we ended up with a rough upper estimate of 28% CRFs with major compliance problems. In addition, there were 32 CRFs with errors of transcription from the print-out to the CRF. In summary, in this particular AWB, we were able to correct mistakes in a large subsample of patients, and to identify and eliminate data from patients which were not suited for inclusion in the study. Another kind of of investigator-independent protocol are patient diaries. These were used in the Lung Study to record 3 measurements per day. We included only diaries with at least one measurement on days 1-10 and 75-84, and with no indication of fraud. Criteria for possible fraud were a) diary sheets are clean and without wrinkles, b) diary sheets are copies, c) homogenous handwriting and/or same type of pencil used for all markings. PEF-diaries of 661 of the 4702 patients were available for evaluation, the rest was missing or met one of the above criteria. We checked the correspondence between measurements made by patients versus those obtained by doctors. The mean PEF increase after treatment (as percent of baseline) as measured by the patients was 132±45%, compared to a mean value of the doctors measurement of 137±34%. A plot of patients versus doctors measurements showed a good agreement of the corresponding values. The fact that the correlation was only moderate (g0.562 at baseline and =0.490 after treatment, respectively) is due to extreme values which (apart from probable protocol errors in the case of very small values) can be explained by greater expiratory efforts undertaken by patients in the presence of doctors. (Note also that the correlation coefficent is not particularly suited for measuring agreement between two methods [7]). Therefore it is justified to conclude that patient measurements can be of sufficient quality. It is also possible to check for plausible relations within the data. Again, the Lung Study provides an example. It was planned to measure the lung function via PEF, VK, FEV1, and MEF. We used the measurements only if 0.5 £ VK £ 9.99l, 0.5 £ FEV1, MEF £ 9.99 l/min, FEV1<VK, and if MEF25;50,75 increased or declined monotonically, depending on whether "air expired" or "air to be expired" was recorded. With these criteria, 54% (2541 of 4702 patients) could be used. A similar strategy can be employed for data of the oral glucose tolerance test (OGTT), which was used by some of the investigators of the KHK Study. We excluded data of the OGTT as implausible, if the fasting value was larger than the value after 30min, or if the value at 60 min was smaller than both the 30 min and the 120 min value. 37 of 47 measurements, i.e. 79%, were judged as plausible. Clearly, the proportion of data that were missing or had to be excluded from conventional AWBs, viz., 10%-30% or sometimes up to 85%, was very high. The lesson to be learned is that, if one restricts the analysis to patients with complete data in all essential variables, AWBs may become useless. Routine statistical techniques are not useful for detection of fraud Plausibility checks like those mentioned above help to detect erroneus data, but rarely detect fraud, since in most AWBs the CRFs are short enough to permit inventing reasonable data. Normal looking and complete data free of mistakes should even arouse suspicion [8]. I will illustrate this with one of the centers participating in the Lung Study. I became suspicious, since this center (C) recruited 80 instead of the expected 10 patients. Although the sponsor attributed the high recruitment rate to the good relation between the doctors of C and the sales representative, he nevertheless asked me to look for indications of fraud.. A comparison of the distribution of the data from C with the total study population showed plausible values for VK, for changes of VK from visit 1 to 3, and for the relation between calculated normal to the actually measured values of VK. Recruitment dates and rates were inconclusive as well, since a recruitment of 80 patients within 80 days, with a rate of up to 6 patients per day, might be possible for a much frequented clinic of a lung specialist. Only for the symptom index (a score I used to summarize the information on symptoms), C ended up in the extreme better end at the final visit. In summary, usual statistical evaluations with means, distributions and correlations did not give any clear evidence of fraud. When inspecting each CRF separately, I found one patient who was 10 years older, 10 kg heavier and had been ill 10 years longer than the previous patient. The type of diagnosis showed some variation for the first 30 patients, but later patients had almost all the same diagnosis. Furthermore, 61 patients had been treated with a co-medication not approved in Germany but in neighboring Switzerland. All CRFs were accompanied by patient protocols with peakflow measurements. In 39 cases, a close inspection revealed that at least one criterion for the suspicion of fraud was satisfied. Often the same pencil and the same handwriting of markings as in the CRF was used, or copies of the original patient protocol were sent in. Visual inspection, as well as the investigation of patterns of sequences, seem to be more successful in detecting fraud than formal statistical analysis. Separate analysis for each participating center and tracing of the CRF-flow are needed for such an examination. Quality of implementation of AWBs Next, I want to check whether the major objectives and requirements specified in current guidelines for AWBs [9, 10, 11 ,12] were met in our 5 AWBs. The guidelines stress, among other things, a) the observation of normal, routine use of medications, b) priority of medical over marketing intentions, c) valid observational and evaluation plans, and d) the collection of a representative sample. Ad a): The strict "non-reactivity-principle" [12] with wide inclusion criteria is in conflict with the effort to obtain complete data, especially if complex measurements, such as in the Lung Study or ABPM Study, are involved. The ABPM Study was the only one in which the motivation to use that particular drug was inquired about. Ad b): All AWBs, except the KHK Study, were initiated by the marketing department. Apart from the KHK and the Lung Study, no medical aims beyond those contained in the original definition of an AWB were declared by the sponsor at the time of initiation. I presume that marketing aims dominated and that the AWBs were not carried out with the aim to observe the long term use of a drug. Neither was there any planning to include the results in a meta-analysis, despite the fact that the Lung Study was already the eighth structually similar AWB with that anti-inflammatory inhalant. Ad c): Though all studies were based on a protocol specifying the details of the observational plan, none of them had a biometrician involved at the planing stage or CRF design. However, I was at least able to define the main direction of evaluation before I saw the data. A sophisticated file cleaning was possible, except in the Blood Pressure and the Eye Study. Double data entry was mostly used. Since drop-out rates were high, the evaluation was largely guided by the availability of data. Ad. d): The ABPM Study was the only AWB in which I was able to compare the study population with representative data [13]. Age, sex ratio, and proportion of patients with first diagnosis of high blood pressure coincided well. In the Lung Study with almost 5000 patients, an analysis of recruited patients by sales regions showed increasing absolute numbers per region from west to east, though with great disparity. By contrast, the KHK Study had only a small and certainly not representative sample. However, conclusions from this study regarding the achievability of therapeutic targets are useful as reasonable upper limits, since participating doctors were well motivated specialists. As for bias, one must keep in mind that studies relying on pre/post-changes tend to overestimate the specific effect. This is due to regression-to-the-mean, spontaneous improvement of the disease, and placebo effects. The observed changes in my examples were mostly in the range of effects found in controlled studies. It seems as if compliance problems and greater variation within the study population contributed to a reduction of the observed effect and made the effect size in these AWBs comparable to that of controlled studies. For detection of adverse drug reactions, it is important not to rely on reported cases, but to check all comments, even those not assigned to certain CRF-fields. Low rates of reported adverse drug reactions and rare critical comments are a strong indication for selection bias caused by a positive attitude towards the drug or the sponsor. These examples are probably a positive selection with respect to quality of data, file cleaning and evaluation (as judged by reports of earlier AWBs), motivation and money alloted by the sponsor, and their reactions to the reports. Improvement of quality of AWBs In commercial AWBs, sales representatives recruit the centers. A strict control of investigator compliance is in conflict with the desired increase in sales. On the other hand, sales are the reason why valuable information is avaiable at low cost. There are several practical recommendations as to how to increase quality in AWBs which may help resolve some of this dilemma: a) involve the medical department and consult a biometrician at an early stage of the planning and define specific questions to be answered by the AWB; b) restrict the number of patients recruitable by a single doctor; c) narrow inclusion criteria of patients and centers, but offer modified versions of the observational plan; d) use protocols independent of the investigator, like patients diaries, equipment print-outs, or laboratory print-outs, in order to achieve measurement-blindness; e) restrict endorsement to complete and plausible CRFs; f) identify each center and document the flow of each CRF; g) do an internal rating of the reliability of a center; h.) alot more time and money to file cleaning procedures; i.) eliminate unplausible data and report the numbers used in each analysis [14]. The effect that N changes slightly with each variable should be accepted; j) although the results of AWB may be regarded only as "working hypotheses", use all techniques for circumventing the problem of multiple testing in order to increase credibility, including the specification of the analysis plan as soon as the data are available; k) let the specific configurations of the data set direct the final path of evaluation to answer specific questions; l) keep the doctor informed on the results of the AWB. While one may continue to use standard designs for AWBs in the future, one should encourage the use of concurrent controls. In order to do so, one needs to clarify and resolve legal problems with the randomization of centers, the concurrent use of proven equivalent therapies [15], and the observation of competing drugs. It is clear that confirmatory statements, especially on the specific effect of a treatment, cannot be obtained by means of AWBs. However, there are many more questions that need to be answered in order to optimize the health system, and only a fraction of these can be examined in full-blown randomized, controlled and blinded studies. For many important questions it appears worthwile to develop reasonable working hypotheses based on high-quality, repeated and cheap observational studies, rather than to rely on gut feelings and conventions. At the same time, this allows many institutions with a low budget to participate in medical reseach. AWBs or similar devices are particularly well suited for this and should to be taken more seriously by sponsors, such as insurance or pharmaceutical companies or health authorities, by practitioners and by journal editors.
I wish to thank Prof. Lehmacher for stressing simplicity of evaluations and making clear their epistemiological foundations, Dr. Simek for explaining science in the commercial context, and R. Wolf for comments on the manuscript. |