The Grading of Recommendations Assessment, Development and Evaluation (GRADE) system is gaining acceptance and prominence as a uniform and transparent approach to translating research evidence into recommendations (1). As more groups such as the World Health Organization (WHO), the American College of Physicians (ACP), and the British Medical Journal (BMJ) adopt this process, it is important to review the advantages and potential disadvantages of this important approach.
There are a number of different evidence grading and recommendation systems in use across the world (2, 3). In the United States, at least three nationally-vetted systems for reviewing evidence and developing recommendations in the area of preventive services have common origins and elements: the system used by the U.S. Preventive Services Task Force (USPSTF) (4), that used by the Centers for Disease Control and Prevention's Evaluation of Genomic Applications in Practice and Prevention Work Group (EGAPP) (5), and most recently, that adopted by the Department of Health and Human Services (DHHS) Secretary's Advisory Committee on Heritable Disorders of Newborns and Children (ACHDNC).
GRADE has definite advantages over all these. Notably, it is usable over a wider range of topic areas other than just screening and/or genetic testing and other preventive services, it can account for personal preferences and values, and it allows for the explicit consideration of resource use (1, 6-10). By comparison, the USPSTF process and its derivatives largely exclude preferences and values and pay only limited attention to costs and/or resource consumption. These GRADE additions are important enhancements that broaden the utility of the system. In addition, the GRADE process involves reporting on the evaluation of the quality of evidence and the rating of the strength of the recommendation separately, in a transparent fashion based on explicit criteria. While this is not unique to GRADE, the reports and recommendations using this system allow a reader to easily understand how the developers reached their decisions.
GRADE uses two related axes in the evaluation and translation of evidence: the quality of evidence and the strength of the resulting recommendation. The quality of the evidence is rated as high, moderate, low and very low (see Table 1). Explicit criteria are used to up- or down-grade evidence quality after accounting for study design, including consistency of results, directness of evidence, precision, and assessment of the source, direction and magnitude of potential bias.
Table 1: Quality of Evidence and Definitions
- High quality — Further research is very unlikely to change our confidence in the estimate of effect
- Moderate quality — Further research is likely to have an important impact on our confidence in the estimate of effect and may change the estimate
- Low quality — Further research is very likely to have an important impact on our confidence in the estimate of effect and is likely to change the estimate
- Very low quality — Any estimate of effect is very uncertain
Source: Guyatt G, et al, GRADE: an emerging consensus on rating quality of evidence and strength of recommendations, BMJ 2008;336:924-6.
The second axis is the translation of the evidence into one of four recommendation categories: a strong or weak recommendation to provide the services, or a strong or weak recommendation to not provide the service. Factors that influence this categorization include the quality of the evidence, the balance between desirable and undesirable effects, personal values and preferences, and whether the intervention represents a good use of resources (see Table 2). In addition, the GRADE system employs a unique system for rating the importance of the health outcomes considered in making the recommendation, from not important to critical for decision making. The interaction between the importance of the outcome and the strength of evidence for affecting that outcome influences the strength of the recommendation. For example, very low quality evidence for a critical outcome such as mortality could lead to a weak recommendation even in the face of good evidence for a less important outcome such as short-term improvement in symptoms.
Table 2: Factors That Affect the Strength of a Recommendation
|Factor||Examples of Strong Recommendations||Examples of Weak Recommendations|
|Quality of evidence||Many high quality randomised trials have shown the benefit of inhaled steroids in asthma||Only case series have examined the utility of pleurodesis in pneumothorax|
|Uncertainty about the balance between desirable and undesirable effects||Aspirin in myocardial infarction reduces mortality with minimal toxicity, inconvenience, and cost||Warfarin in low risk patients with atrial fibrillation results in small stroke reduction but increased bleeding risk and substantial inconvenience|
|Uncertainty or variability in values and preferences||Young patients with lymphoma will invariably place a higher value on the life prolonging effects of chemotherapy than on treatment toxicity||Older patients with lymphoma may not place a higher value on the life prolonging effects of chemotherapy than on treatment toxicity|
|Uncertainty about whether the intervention represents a wise use of resources||The low cost of aspirin as prophylaxis against stroke in patients with transient ischemic attacks||The high cost of clopidogrel and of combination dipyridamole and aspirin as prophylaxis against stroke in patients with transient ischaemic attacks|
Source: Guyatt G, et al, GRADE: an emerging consensus on rating quality of evidence and strength of recommendations, BMJ 2008;336:924-6.
Something is lost, however, in the GRADE recommendation categories when compared with other systems: the conclusion that the evidence is insufficient to support a recommendation either for or against the intervention. On the positive side, this eliminates the "no recommendation" category, which draws criticism from clinicians and patients, as it provides little direction for decision-making. To maintain important flexibility in the setting of insufficient evidence, the GRADE process allows for recommendations for use only in the context of research (8), which somewhat mitigates the problem of not having a category for insufficient evidence/no recommendation. However, the "I" letter grade for a conclusion of insufficient evidence to recommend for or against providing a service is the most commonly used grade for the USPSTF and remains a critical category for evaluating evidence for preventive services (11). It seems that more explicit attention to this common issue, at least in the area of prevention, is warranted.
Another potential disadvantage of GRADE is the possibility that weak recommendations, either for or against providing the service, might combine recommendations that are weak due to insufficient evidence with those due to other factors, such as closely matched benefits and harms, influence of values and preferences, and resource utilization. This potential heterogeneity within GRADE categories could lead recommendation panels to recommend for or against an intervention based on factors other than evidence of effectiveness, such as potential but unproven benefit. If a weak recommendation, made on the basis of potential net benefit but without at least moderate quality evidence, leads to the increased use of an intervention, and further research finds the intervention harmful, then the uncertainty inherent in the system will have led to a poor outcome.
One example of the application of the GRADE process is the WHO recommendations on the treatment of H5N1 Avian influenza with oseltamivir in humans (12). Here, the review panel rated the evidence as very low, but due to the high mortality of the disease, the low risk of harms associated with the drug, the lack of other potential therapies, and the value judgment that even though there is no direct evidence of efficacy, treatment might be helpful, the panel gave this therapy a strong recommendation. There are potential hazards with this conclusion: there may be no motivation to collect even observational data that could help clarify potential efficacy in the face of a strong recommendation. Also, while not the intent of the panel, this recommendation has been referenced by policymakers in spending millions of dollars to stockpile a drug whose benefit in the response to an influenza pandemic, outside of the potential use for prophylaxis, is uncertain at best. On the other hand, while others may or may not agree with the panel's conclusion, the advantage of the GRADE process is that the authors were transparent in how they reached this conclusion, and therefore the process allows for the recommendation to be evaluated with additional perspectives.
Inherent in all evidence rating and recommendation development is that individuals using the process exercise judgments at key points in the process. The criteria for judging evidence quality is laid out very clearly by GRADE, but the application of these criteria requires a judgment, and the same body of evidence can be rated differently by individuals who bring their own biases to judging the evidence.
The variation in judgments regarding the translation of the evidence into a recommendation is likely to be even greater than that in rating evidence, as the criteria for categorizing the recommendation as weak or strong, or even for or against the intervention, is more subjective. Weighing benefits and harms involves comparing rates of desirable and undesirable outcomes that may be quite disparate. For example, breast cancer chemoprophylaxis with tamoxifen requires balancing the prevention of invasive breast cancer with the risk of thromboembolic events, stroke, and endometrial cancer. Different evaluators, and ultimately different patients, are likely to judge the tradeoff of benefits and harms differently.
Allowing for judgments regarding values and preferences, as well as the ability to consider resource use, to influence the strength of a recommendation is an important advantage of GRADE. However, there will inevitably be variation in the judgment on how to weigh these issues; for example, judgments about what resource use or costs are justified by potential outcomes could vary widely. While again, the GRADE process requires a transparent description of how the recommendation was reached, the process may well lead to inconsistency in the translation of evidence into recommendations, very likely between guideline development groups, and possibly within the same group across different interventions.
Finally, the application of the recommendation categories may not be straightforward. While the strong recommendations for or against providing the intervention are clear, the weak recommendations may be more problematic. Should clinicians spend precious time and opportunity costs explaining the risks and benefits of an intervention in those situations where the evidence is insufficient to recommend for or against the service? From the policy perspective, while it seems clear the strong recommendations should be covered by insurance or public health care, does this mean that weak recommendations in favor of an intervention should not be covered?
The GRADE process is an important step forward in providing a consistent way of approaching evidence review and guideline development. Successful application of the GRADE process in translating to health improvement at the individual and population level requires a good understanding of the process, and a commitment of the clinician to read beyond the recommendation category to understand the judgments made by the evaluators in terms of tradeoffs between benefits and harms, influence of preferences and values, and resource utilization. A wider application of the process over a larger number of topics will be required to determine whether GRADE realizes its full potential of promoting effective, evidence-based practice.
Ned Calonge, MD, MPH
Dr. Calonge is Chair of the U.S. Preventive Services Task Force (USPSTF). This commentary is a reflection of his views and opinions, and not those of USPSTF.
The views and opinions expressed are those of the author and do not necessarily state or reflect those of the National Guideline Clearinghouse™ (NGC), the Agency for Healthcare Research and Quality (AHRQ), or its contractor, ECRI Institute.
Potential Conflicts of Interest
Dr. Calonge is Chief Medical Officer, Colorado Department of Public Health and Environment. He reports no potential financial conflicts of interest.
- Guyatt G, et al, GRADE: an emerging consensus on rating quality of evidence and strength of recommendations, BMJ. 2008;336:924-6.
- Atkins D, et al, Systems for grading the quality of evidence and the strength of recommendations I: Critical appraisal of existing approaches, BMC Health Serv Res. 2004;4:38.
- Atkins D, et al, Systems for grading the quality of evidence and the strength of recommendations II: Pilot study of a new system, BMC Health Serv Res. 2005;4:25.
- Sawaya GF, et al, Update on the methods of the U.S. Preventive Services Task Force: estimating certainty and magnitude of net benefit. Ann Intern Med. 2007;147: 871-5.
- Teutsch SM, et al, The Evaluation of Genomic Applications in Practice and Prevention (EGAPP) initiative: methods of the EGAPP Working Group, Genetics in Med. 2009;11(1):3-14.
- Atkins D, et al, Grading quality of evidence and strength of recommendations. BMJ. 2004;328; 1490-4.
- Guyatt G, et al, GRADE: what is "quality of evidence" and why is it important to clinicians? BMJ. 2008;336:995-8.
- Guyatt G, et al, GRADE: going from evidence to recommendations. BMJ. 2008;336:1049-51.
- Schünemann H, et al, GRADE: grading quality of evidence and strength of recommendations for diagnostic tests and strategies. BMJ. 2008;336:1106-10.
- Guyatt G, et al, GRADE: Incorporating considerations of resources use into grading recommendations. BMJ. 2008;336:1170-3.
- Petitti D, et al. Update on the methods of the U.S. Preventive Services Task Force: insufficient evidence. Ann Intern Med. 2009;150:199-205.
- Schünemann H, et al, WHO rapid advice guidelines for pharmacological management of sporadic human infection with avian influenza A (H5N1) virus, 2007;7 http://infection.thelancet.com .