Skip to main content

Population-Based Identification of Biopsy Proven IGA Nephropathy using Natural Language Processing: The Knight Study

Rishi Parikh

Rishi Parikh, Thida Tan, Ajit Mahapatra, Weijia Wang, Robert Perkins, Alan Go

Full publication


Background and aims
IgA nephropathy (IgAN) is the most common form of glomerulonephritis worldwide and is typically confirmed through kidney biopsy. Although clinical disease coding systems, including the International Classification of Diseases, 10th edition (ICD-10) have added approximate diagnosis codes for IgAN, they are variably used among healthcare delivery systems and have low sensitivity and specificity for biopsy proven IgAN. To facilitate more effective population-based disease management, we developed and validated natural language processing (NLP) algorithms to identify adults with IgAN at a population-level using electronic health records within a large, integrated healthcare delivery system.

We identified an initial sample of all adults with ICD-10 codes for IgAN, glomerulonephritis or nephrotic syndrome in Kaiser Permanente Northern California (KPNC) who were alive and health plan members between 1 January 2010 and 31 December 2020. A random sample of 200 electronic health records, including progress notes, laboratory results and biopsy reports, were manually adjudicated by two physicians for the presence of biopsy proven IgAN and associated IgAN diagnosis date. We developed rule-based NLP algorithms using i2E software (v6.2.0, Linguamatics, Cambridge, UK) within 50% of this manually adjudicated sample (derivation set) and calculated performance metrics in the remaining 50% (validation set). Finally, the validated algorithms were applied to the full KPNC population with diagnosis codes for potential IgAN to identify the IgAN cohort from 2010 to 2020. We also described characteristics of the IgAN cohort at their identified diagnosis date.

Compared with ‘gold-standard’ physician adjudication, our NLP algorithms identified IgAN with a sensitivity of 93.9%, specificity of 98.5%, positive predictive value of 96.9% and negative predictive value of 97.0% in the validation set. Among 31 471 adults with diagnosis codes of IgAN, glomerulonephritis or nephrotic syndrome who were alive and members between 2010 and 2020, the NLP algorithms identified 1735 IgAN patients. IgAN-specific diagnosis codes were neither sensitive nor specific for IgAN as identified by NLP (Table 1). The IgAN cohort had a mean (±SD) age of 40.9 (16.2) years old, 47.8% were women, 40.6% were Asian or Pacific Islander and 18.4% were of Hispanic ethnicity (Table 2). The comorbidity burden was moderate, with the most common comorbid conditions being hypertension (40.7%), dyslipidemia (40.0%) and chronic lung diseases (16.9%). The median estimated glomerular filtration rate by CKD-EPI was 61.2 mL/min/1.73 m2 and 51.4% had severe proteinuria (equivalent urine albumin-to-creatinine ratio of ≥300 mg/g) (Table 2).

Within an integrated healthcare delivery system, we developed NLP techniques using structured and unstructured data from a robust electronic health record system to accurately identify and characterize patients with biopsy proven IgAN at a population level. These methods can enable health systems to conduct large-scale identification, centralized management, and long-term follow-up of patients with IgAN. Future studies should use these methods to characterize disease burden and identify modifiable predictors of health outcomes in adults with IgAN.

Ready to get started?

Request a Demo

Questions? Ask our experts