This article was originally published here
JMIR Med Inform. 2020 Dec 12. doi: 10.2196/25457. Online ahead of print.
BACKGROUND: Medical notes are a rich source of patient data, however the nature of unstructured text has largely precluded using these data in large retrospective analyses. Transforming clinical text into structured data can enable large-scale research studies with electronic health records (EHR) data. Natural language processing (NLP) can be used for text information retrieval, reducing the need for labor intensive chart review. Here we present an application of NLP to large-scale analysis of medical records at two large hospitals for patients hospitalized with COVID-19 infections.
OBJECTIVE: Our study goal was to develop an NLP pipeline to classify the discharge disposition (home, inpatient rehabilitation, skilled inpatient nursing facility (SNIF) and death) of patients hospitalized with COVID-19 based on hospital discharge summaries notes.
METHODS: Text mining and feature engineering were applied to unstructured text from hospital discharge summaries. The study included patients with COVID-19 discharged from 2 hospitals in the Boston, Massachusetts area (Massachusetts General Hospital and Brigham and Women’s Hospital) between March 10, 2020, and June 30, 2020. The data was divided into 70% for training and 30% for a hold-out test set. Discharge summaries were represented as bags-of-words consisting of single words (1-grams), 2-grams and 3-grams. The number of features was reduced during training by excluding n-grams that occurred in fewer than 10% of discharge summaries, and further using LASSO regularization while training a multiclass logistic regression model. Model performance was evaluated in the hold-out test set.
RESULTS: The study cohort comprised 1737 adult patients (median [SD] age, 61 years old; 55% men; 45% White and 16% Black; 14% non-survivors; 61% discharged home). The model selected 179 from a vocabulary of 1056 engineered features, consisting of combinations of unigrams, bigrams and trigrams. The top features contributing most to the classification by the model (for each outcome) were: ‘appointments specialty’, ‘home health’ and ‘home care’ (home), ‘intubate’, and ‘ARDS’ (inpatient rehabilitation), ‘service’ (SNIF), ‘brief assessment’ and ‘covid’ (death). The model achieved micro average area under the receiver operating characteristic and average precision in the testing set of 0.98 (95% CI 0.97-0.98) and 0.81 (95% CI 0.75-0.84), respectively, for prediction of discharge disposition.
CONCLUSIONS: A supervised learning-based NLP approach is able to classify discharge disposition of patients hospitalized with COVID-19 infection. This approach has the potential to accelerate and increase the scale of research on patients’ discharge disposition that is possible with EHR data.