5 December 2012. Clinicians trying out the new version of the Diagnostic and Statistical Manual of Mental Disorders, the DSM-5, largely come to the same conclusions, according to a trio of papers published online October 30 in the American Journal of Psychiatry. Conducted by the Research Group of the DSM revision team led by Darrel Regier of the American Psychiatric Association (APA) and David Kupfer of the University of Pittsburgh, and including members from academia and the APA, the studies found that for 14 out of 23 disorders with adequate sample sizes, the revised diagnostic classification criteria produced consistent results when used by different people, as did new dimensional measures of symptoms shared across disorders.
For schizophrenia, reliability in diagnosis and in dimensional measures of psychosis scored in the “good” range. This is reassuring, given that for this revision of the DSM, the APA has abandoned the subtypes of schizophrenia (e.g., paranoid, disorganized, etc.) present in previous editions of the manual.
“For the most part, the DSM-5 worked,” said Helena Kraemer, a biostatistician at Stanford University, who was part of the Research Group. “It’s probably the most scientifically credible revision of the manual that’s ever been done.”
These trials mark the endgame of a 14-year-long revision process of the DSM, a catalogue of psychiatric disorders that helps clinicians accurately and consistently make their diagnoses. The new results may help assuage some concerns over revisions to the manual, and identify sections needing further tweaking before its release in 2013.
The new studies focused on “test-retest” reliability of the manual, asking whether two different clinicians using it to evaluate the same patient would come up with the same diagnosis. This was explored under real-world conditions, in which clinicians with different backgrounds diagnosed a random sample of patients, including those with symptoms that crossed diagnostic boundaries. This approach contrasts with the previous DSM-IV field trials, which estimated reliability under ideal conditions: patients with comorbid symptoms were excluded, and clinicians were highly trained experts in a particular disorder.
The Research Group charged with testing the DSM-5 had a different mindset from the get-go, Kraemer told SRF. “A decision was made early on that the purpose of the DSM was for patient care,” she said. “So we had to evaluate the quality of these diagnoses for real patients in the hands of real clinicians in the real world.”
This real-world mentality could account for the somewhat lower measures of reliability obtained for the DSM-5 in these studies compared to those for the DSM-IV, though they still compare favorably to those in other branches of medicine, according to Kraemer and colleagues (Kraemer et al., 2012).
In the first paper, first author Diana Clarke and colleagues laid out the design of the field tests, which took place at 10 sites in the U.S. and one site in Canada. Several features of the trials allowed assessment of how the manual would perform in everyday clinical settings: 1) 279 clinicians of different levels of expertise and experience, including psychiatrists, psychologists, and mental health nurses, interviewed the patients; 2) the clinicians were trained to use the DSM-5 through one hour of Web-based training, and three hours of in-person training, similar to how training is expected to proceed after it is published; 3) standardized diagnostic interviews were not used, because they are not routinely used in clinical practice; and 4) patients were selected randomly, with 2,246 patients ultimately enrolled and 86 percent completing both interviews.
To amass enough data in 10 months on 21 adult and 12 pediatric diagnosis categories contained in the DSM-5, researchers employed a stratified sampling design in which different sites were assigned to collect patients with four to seven different target disorders. The researchers aimed to get 50 people per target diagnosis per site to make precise measures of reliability for even rare disorders. Initially, arriving patients were screened and sorted into one of these targeted disorders if they met DSM-IV criteria (or likely had symptoms associated with DSM-5 criteria), or into an “other diagnosis” catch-all. Then, a DSM-5 trained clinician, blind to this assignment, would interview the patient and diagnose according to DSM-5 criteria. Four hours to two weeks later, a second clinician, also blind to this information, would do the same.
The clinicians entered their data into a centralized database, which was then analyzed separately by field test organizers. This provided a more objective measure of reliability, Kramer says, compared to the previous DSM-IV field trials, in which data could be analyzed by the same people who had developed revisions to the manual.
In the second paper, first author Regier and colleagues reported that adequate sample sizes were obtained for 15 adult diagnoses and eight pediatric diagnoses. Reliability was quantified with intraclass kappa, a probability-based measure that reflects the predictive value of the first diagnosis, with values close to 1 being predictive (i.e., a high chance that the second diagnosis would agree), and values close to 0 being unreliable. Intraclass kappa takes into account the possibility of chance agreement, which other measures, such as a simple percentage of cases in which clinicians agreed, do not. It also provides confidence intervals, giving researchers a sense of precision for their reliability estimates.
Overall, the DSM-5 performed well, with a majority of disorders scoring in the “very good” and “good” ranges of reliability, and giving prevalence rates in the same ballpark as those obtained with the DSM-IV. Disorders in the very good range (kappa 0.60-0.79) included post-traumatic stress disorder (PTSD), complex somatic symptom disorder, major neurocognitive disorder, autism spectrum disorder and attention deficit hyperactivity disorder. Those scoring in the “good” range (kappa 0.40-0.59) included schizophrenia, schizoaffective disorder, bipolar I disorder, binge eating disorder, alcohol use disorder, mild neurocognitive disorder, borderline personality disorder, avoidant/restrictive food intake disorder, and oppositional defiant disorder.
Schizophrenia fared well enough, with kappa scores of 0.46 for schizophrenia itself and 0.50 for schizoaffective disorder. The prevalence of schizophrenia diagnoses was slightly lower at one site than that found using the DSM-IV on the same patients (0.53 vs. 0.37), but whether this will translate to population-wide changes in prevalence awaits epidemiological studies based on DSM-5 criteria. Overall, the results suggest that folding the DSM-IV’s different schizophrenia subtypes into a single schizophrenia diagnosis in the DSM-5 did not hurt diagnosis reliability. Similarly, combining different autism-related DSM-IV diagnoses into a single autism diagnosis gave a consistent kappa (0.69).
The hotly contested attenuated psychosis disorder (APS) (see SRF Live Discussion), considered a potential precursor of psychotic disorders like schizophrenia, did not accumulate enough samples to precisely estimate reliability. Though the kappa was 0.46, the confidence interval was too wide for this to be meaningful. Kraemer suggests that this reflected a design flaw that overestimated the number of these kinds of patients seen by the sites tasked with enrolling cases of APS, rather than the specifics of the APS diagnostic criteria or clinician training on it. Earlier this year, APS was stricken from the main text of the DSM-5 (see SRF related news story).
Disappointing results came for major depressive disorder and generalized anxiety disorder, which both scored in the questionable range. Because the criteria for these disorders did not change substantially in the DSM-5 revision, Kraemer suggests the low reliabilities may reflect the plethora of other symptoms that tend to come with these disorders, and their variation over time. Among field trial patients with diagnoses of major depressive disorder, generalized anxiety disorder, PTSD, and alcohol use disorder, comorbidity was the rule, rather than the exception—only a minority had “pure” versions of a disorder that did not include symptoms from other disorders.
Because different psychiatric disorders share features, and because the boundaries between diagnoses may be a difference in degree rather than type of symptom, the third study tried out new dimensional measures of these “cross-cutting” symptoms. First author William Narrow of the Division of Research at the APA and colleagues reported the consistency of the assessments of each patient in 14 psychological domains, including depression, anger, mania, anxiety, and substance abuse. Using questionnaires developed to capture these features, patients rated themselves (or informants rated patients unable to rate themselves, either because of age or ability) on a scale of 1-5 on each item at each of two visits before their interviews with a clinician. The two independent clinicians also rated patients in two domains—psychosis and suicide risk.
These assessments were remarkably consistent across the two visits, as measured by intraclass correlation coefficients (ICCs). For adult patients scoring themselves, and parents scoring their children, ICCs in the good or excellent range resulted, whereas less reliable scores emerged when children under 11 rated themselves. For the two domains rated by clinicians, however, the results were less consistent: in adults, psychosis was rated with good reliability, but less well in children. Suicide risk was also judged inconsistently, with ICCs in the questionable range for adults and unacceptable range for children, suggesting some troubleshooting still needing to be done.
Though it is not yet clear whether this dimensional view will be included in the DSM-5, the results position dimensions as promising complements to the category-dominated view of psychiatric diagnoses. A forthcoming study of “convergent validity” will explore how well dimensional measures predict diagnosis, and vice versa. This will give a sense of the overlap between these two styles of measurement, and also help address issues of validity, by focusing on how well the disorders are captured by the measures for which they were designed.
Although the overall mix of patients varied in their age, ethnicity, and education, the authors noted that the types of patients seen at these large, academic clinical sites might not catch the full spectrum of psychiatric patients. Another forthcoming study will report how the DSM-5 fared in smaller, general practice clinics. Because their small size precludes any measure of reliability, the focus will be on how user friendly the DSM-5 is in this setting, asking whether it is clear, practical to use, and useful.
This represents a tremendous amount of work for one manual, and Kraemer said it may be the last time a full-scale revision of the DSM will happen. Instead, the DSM-5 is planned to be a kind of “living document” in which advances in knowledge may be quickly incorporated into the manual, taking a year or two instead of waiting every 15 years for a full-blown revision of the entire manual. This rolling process of improvement via piecemeal adjustment would make the DSM-5 more responsive to scientific progress in psychiatric disorders, and potentially disseminate the fruits of science more quickly out into the clinical world.—Michele Solis.
Clarke DE, Narrow WE, Regier DA, Kuramoto SJ, Kupfer DJ, Kuhl EA, Greiner L, Kraemer HC. DSM-5 Field Trials in the United States and Canada, Part I: Study Design, Sampling Strategy, Implementation, and Analytic Approaches. Am J Psychiatry. 2012 Oct 30. Abstract
Regier DA, Narrow WE, Clarke DE, Kraemer HC, Kuramoto SJ, Kuhl EA, Kupfer DJ. DSM-5 Field Trials in the United States and Canada, Part II: Test-Retest Reliability of Selected Categorical Diagnoses. Am J Psychiatry. 2012 Oct 30. Abstract
Narrow WE, Clarke DE, Kuramoto SJ, Kraemer HC, Kupfer DJ, Greiner L, Regier DA. DSM-5 Field Trials in the United States and Canada, Part III: Development and Reliability Testing of a Cross-Cutting Symptom Assessment for DSM-5. Am J Psychiatry. 2012 Oct 30. Abstract