Hong Kong J Psychiatry 2004;14(3):19-25


The Chinese Bilingual SCID-I/P Project: Stage 3 — Multi-site Inter-rater Reliability
E So, I Kam, L Lam

Dr Eddie So, FHKCPsych, FHKAM, Department of Psychiatry, Tai Po Hospital, Tai Po, Hong Kong, China.
Dr Irene Kam, MRCPsych, FHKAM, Department of Psychiatry, Shatin Hospital, Shatin, Hong Kong, China.
Professor Linda Lam, MRCPsych, FHKAM, Department of Psychiatry, Chinese University of Hong Kong, Shatin, Hong Kong, China.

Address for correspondence: Dr Eddie So, Department of Psychiatry, Tai Po Hospital, Tai Po, Hong Kong, China.
E-mail: somp@ha.org.hk

Submitted: 27 October 2004; Accepted: 1 December 2004

Objective:The Chinese Bilingual Structured Clinical Interview for the Diagnostic and Statisti-cal Manual of Mental Disorders-IV is a recently introduced semi-structured psychiatric diag-nostic instrument for use for Chinese patients. The aim of this study was to ascertain the multi-site inter-rater reliability.

Patients and Methods: Newly registered outpatients from 4 centres were consecutively recruited. Multi-site inter-rater reliability is expressed in percentage agreement.

Results:Forty one outpatients from 14 diagnostic categories were recruited. When principal diagnosis, comorbidity, and life-time diagnosis were considered together, the overall percent-age agreement was 74% — 75% for the Li Ka Shing Psychiatric Centre, 74% for the Alice Ho Nethersole Hospital Psychiatric Clinic, 60% for the North District Hospital Psychiatric Clinic, and the 85% for the Alice Ho Nethersole Hospital Pain Clinic. When principal diagnosis alone was considered, the percentage agreement was 82% for the Li Ka Shing Psychiatric Centre, 83% for the Alice Ho Nethersole Hospital Psychiatric Clinic, 63% for the North District Hospital Psychiatric Clinic, and 90% for the Alice Ho Nethersole Hospital Pain Clinic. The overall percentage agreement for principal diagnoses was 80%, which improved to 87% when only the psychiatric patients were considered.

Conclusions:The Chinese Bilingual Structured Clinical Interview for the Diagnostic and Statistical Manual of Mental Disorders-IV is a reliable diagnostic instrument with a satisfactory degree of agreement among different raters and across different sites for use in the outpatient setting. Specifically, the inter-rater reliability is superior for psychiatric than for non-psychiatric settings. At this stage, use for comorbid diagnosis is not recommended.

Key words: Asian continental ancestry group, Outpatients, Reproducibility of results


The Chinese Bilingual Structured Clinical Interview for the Diagnostic and Statistical Manual of Mental Disorders-IV (DSM-IV; CB-SCID-I/P) is a semi-structured diagnos-tic instrument for making the Axis I DSM-IV diagnoses.1 The CB-SCID-I/P is a semi-translated Chinese version of the SCID-I/P,2 in which the DSM-IV diagnostic criteria remains in its original English format. Translation followed the standard back-translation procedure.3 Research method-ology paralleled that of the original authors.4,5 The study is divided into the following 5 stages:

  • stage 1 — translation of the SCID-I/P. Test-retest reliability and clinician-rater reliability of the CB-SCID-I/P on mood disorders and schizophrenia and related psychotic disorders in an inpatient population
  • stage 2 — clinician-rater reliability of the CB-SCID-I/P on anxiety disorders, adjustment disorders, and ‘no diagnoses’ in an outpatient population
  • stage 3 — multi-site inter-rater reliability of the CB-SCID-I/P in an outpatient population
  • stage 4 — reliability of the CB-SCID-I/P for low-prevalence disorders by enhanced sampling
  • stage 5 — reliability of the SCID-I/P major psychiatric diagnoses against pooled data from stages 1, 2, and 4.

Stages 1 and 2 have already been completed.6,7 In stage 1, test-retest reliability between the SCID raters achieved percentage agreement of 89.6% with of 0.84. When com-pared with the clinicians’ best-estimated diagnoses, the over-all for clinician-rater reliability was 0.77.6,7

The results from stage 2 gave an overall K of 0.71 for clinician-rater reliability.6 Taken together, stages 1 and 2 support the CB-SCID-I/P as a sensitive and specific diagnostic instrument for ascertaining DSM diagnoses in Chinese people.

7 inter-rater reliability between inexperienced and experienced raters would therefore become more sensi-tive when used for an outpatient cohort. In addition, the use of 2 new recruitment sites, the North District Hospital Psychiatric Clinic (NDH) and the Alice Ho Nethersole Hospital Pain Clinic (PC), a non-psychiatric facility that also acted as a control towards over-diagnosis, ascertained the broader applicability of the instrument.

Patients and Methods

Between 2002 and 2004, patients were independently and consecutively recruited from the Li Ka Shing Psychiatric Centre (LKS), Alice Ho Nethersole Hospital Psychiatric Clinic (AHNH), NDH, and PC. There was no prescreening and the same recruitment criteria were used. Test-retest design was employed. Data from each clinic were pooled together. Population characteristics approximated each other among the sites. The degree of inter-rater reliability was expressed in the form of percentage agreement. The recruit-ment criteria were as follows:

  • fluent in the Cantonese dialect
  • age from 16 to 65 years
  • ability to give written informed consent
  • absence of history of major head injury, serious neurological or medical problems, severe pervasive developmental disorder (autism), or significant cognitive deficits (mental retardation, dementia).

Rater 1 consisted of 2 experienced SCID raters, both of whom had been extensively involved in the development of the instrument and stages 1 and 2 of the reliability study. There were 6 novice raters (rater 2) from the 4 sites. Two raters completed the SCID workshop in 2000 (from LKS and AHNH). The others were individually trained to ad-minister the instrument (from NDH and PC). All raters were clinicians. They were allowed to use whatever information was available to them at the time of the assessment. Diag-noses were generated according to the CB-SCID-I/P manual. The number of diagnoses generated was limited to a maxi-mum of 3 — namely, principal diagnosis, comorbid diagnosis, and lifetime diagnosis. The novice conducted the first SCID interview. The retest interviews took place immediately after the initial interview, with the exception of 3 patients for whom the interviews took place 2 weeks apart. The second rater was blinded to the first interviewer’s findings. A principal diagnosis according to DSM-IV was generated for each patient. When indicated, a comorbid diagnosis and/or lifetime diagnosis was also entered. Only Axis I diagnoses were recorded.


It took between 35 and 80 minutes to conduct an SCID interview. All data were processed by the Statistical Pack-age for the Social Sciences version 11.

Forty one patients were recruited, as follows: 11 from LKS, 12 from AHNH, 8 from NDH, and 10 from PC. All patients gave written informed consent and completed the study. Table 1 shows the patients’ characteristics according to the study sites. The demographic characteristics of the 4 groups of patients were comparable. Women were over-represented.

Diagnostic Profile

For the ‘principle diagnosis’, rater 1 and rater 2 each generated 31 DSM diagnoses and 10 ‘no diagnosis’ for the 41 patients; there was disagreement for the principal diagnosis for 8 patients. Six patients were given a ‘comorbid diagnosis’, of which half were in agreement. Rater 1 diag-nosed 5 patients, whereas rater 2 diagnosed 4 patients with comorbid conditions. Four of these patients were from the AHNH site. Seven patients were given a ‘lifetime diagnosis’, of which 4 were in agreement. Rater 1 entered 7 lifetime diagnoses and the novices entered 4. Overall, there were 14 DSM categories. Rater 1 and rater 2 disagreed with each other for 11 patients, with 14 discrepant diagnostic pairs. The diagnos-tic profiles from each site are represented in Tables 2 to 5.

Table 2 shows the diagnostic profile of the 11 patients from LKS. Twelve SCID diagnostic pairs were generated. The raters differed in diagnoses for patients 2, 6, and 11.Both raters agreed that patients 4 and 8 had no psychiatric diagnoses.

Table 3 shows the profile of diagnoses at the AHNH site. There were 18 DSM diagnostic pairs among the 12 patients. Rater 1 entered 16 diagnoses and rater 2 entered 14 diagnoses. The raters differed for patients 12, 14, and 18, amounting to 6 discrepant diagnostic pairs from 3 patients.

Table 4 shows the profile of the 8 patients from NDH. There were 10 diagnostic pairs and 4 were discrepant. Rater 1 made 10 SCID diagnoses versus 9 for rater 2. The raters differed for patients 24, 30, and 31.

Table 5 shows the profile of patients from PC. Rater 1 and rater 2 generated equal amounts of SCID diagnoses. Both raters agreed that 6 of the patients had no principle psychi-atric diagnoses, but disagreed on the diagnosis of 2 patients.

Multi-site Inter-rater reliability

Table 6 compares the principal diagnoses of rater 1 with those of rater 2 at all sites. The diagnosis of patient 18 is considered to be equivalent to major depressive disorder. The diagnosis of ‘substance-induced psychotic disorder’ is incorporated into ‘schizophrenia and related disorders’. ‘Panic disorder with agoraphobia’ is included in the ‘panic disorder’ category. ‘Pain disorder’ and ‘undifferentiated somatoform disorder’ are included in ‘somatisation dis-orders’. The number of diagnoses made by the experienced raters was the same as for the novices. There was disagree-ment about the diagnosis for 8 patients. The percentage agreement between raters 1 and 2 was 80%.

Table 7 shows the correlation between the 3 psychiatric sites. When the PC was excluded, there were 31 patients and those with no diagnosis decreased to 4, resulting in a more specific inter-rater reliability evaluation for the psychiatric cohort. Rater 1 diagnosed 1 more patient than rater 2. The percentage agreement was 87%.

Table 8 shows the data for all sites and all diagnoses. The number of diagnostic pairs increased to 54 and non-concordance increased to 14. The rate of agreement was 74%.


The patient’s demographics and the range of DSM diagnoses resembled the characteristics expected from an average out-patient clinic. The profile for principle diagnosis between rater 1 and rater 2 were identical, with no observable bias towards any particular diagnostic pattern. Experienced raters are more likely to commit on comorbid and lifetime diagnoses than the less experienced raters (in stage 1, SCID raters made twice the amount of comorbid diagnoses compared with the clinicians).6,7 However, this trend was not borne out when only non-psychiatric patients were considered. In line with the results from stage 2, this finding refutes the argument that the instrument has a propensity towards over-diagnosis. A sample size of 41 patients is too small to ascertain individual scores for each type of disorder, but is sufficient to demonstrate the overall degree of inter-rater reliability between the 2 groups of raters. When restricted to principle diagnosis, the percentage agreement was 80%, which improved to 87% when only psychiatric patients were considered, closely approximating the 90% percentage agreement in stage 1 of the study. It should be emphasised that stage 1 involved inpatients while stage 3 involved outpatients.

As argued by Robins8,9 and demonstrated in stage 2,7 a lower degree of inter-rater reliability is therefore expected to the reduced prevalence of psychopathology among out-patients. However, the fact that the percentage agreement becomes less promising when comparing ‘psychiatric patients’ with ‘all patients’ suggests that the CB-SCID instru-ment has less reliability and therefore less credibility for non-psychiatric application. Similarly, when all diagnoses are taken into consideration, the percentage agreement decreases from 87% to 74%, as found in stages 1 and 2, so the authors cannot recommend the use of the instrument for studying psychiat-ric comorbidity. Table 9 shows the different percentage agreements from each site and with different cohorts. The strength of the present investigation is its multi-site multi-rater design and its consecutive recruitment process, which are crucial for ensuring general application of the instrument. The study demonstrated a satisfactory degree of test-retest inter-rater reliability of the CB-SCID-I/P when used by trained novice interviewers from 4 sites. Novices who had attended the training workshop showed better inter-rater reliability than raters who were trained individually. When standardised training is given, the degree of diagnostic accu-racy among the novice raters closely approaches that among the experienced raters.10 The results attest that the adminis-tration of the CB-SCID-I/P can be reliably conducted by trained clinicians and does not lead to over-diagnosis.


The authors would like to thank Drs Sandra Chan, Teddy Chan, PT Ho, Simon Lui, Arthur Mak, and TS Wong for their contribution towards the study. Material support from the Prince of Wales Hospital, Alice Ho Nethersole Hospital, and North District Hospital is gratefully acknowledged.


