Best Practices on Big Data Analytics to Address Sex-Specific Biases in Our Understanding of the Etiology, Diagnosis, and Prognosis of Diseases

This article was originally published here

Annu Rev Biomed Data Sci. 2022 May 13. doi: 10.1146/annurev-biodatasci-122120-025806. Online ahead of print.


A bias in health research to favor understanding diseases as they present in men can have a grave impact on the health of women. This paper reports on a conceptual review of the literature on machine learning or natural language processing (NLP) techniques to interrogate big data for identifying sex-specific health disparities. We searched Ovid MEDLINE, Embase, and PsycINFO in October 2021 using synonyms and indexing terms for (a) “women,” “men,” or “sex”; (b) “big data,” “artificial intelligence,” or “NLP”; and (c) “disparities” or “differences.” From 902 records, 22 studies met the inclusion criteria and were analyzed. Results demonstrate that the inclusion by sex is inconsistent and often unreported, although the inclusion of men in these studies is disproportionately less than women. Even though artificial intelligence and NLP techniques are widely applied in health research, few studies use them to take advantage of unstructured text to investigate sex-related differences or disparities. Researchers are increasingly aware of sex-based data bias, but the process toward correction is slow. We reflect on best practices on using big data analytics to address sex-specific biases in understanding the etiology, diagnosis, and prognosis of diseases. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 5 is August 2022. Please see for revised estimates.

PMID:35562851 | DOI:10.1146/annurev-biodatasci-122120-025806