Deep Denoising of Raw Biomedical Knowledge Graph from COVID-19 Literature, LitCovid and Pubtator

This article was originally published here

J Med Internet Res. 2022 May 30. doi: 10.2196/38584. Online ahead of print.

ABSTRACT

BACKGROUND: The multiple types of biomedical associations of the knowledge graphs, including the COVID-19-related ones, are constructed based upon the co-occurring biomedical entities retrieved from recent literature. However, the applications dervived from these raw graphs (e.g., association predictions amongst genes, drugs, and diseases) have a high probability of false-positive predictions as the co-occurrences in literature do not always mean a true biomedical association between two entities.

OBJECTIVE: Data quality plays an important role in training deep neural network models, however, most of the current work in this area have been focused on improving a model’s performance with the assumption that the pre-processed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information.

METHODS: The proposed framework utilized generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two Generative Adversarial Network models, NetGAN and CELL, were adopted for the edge classification (i.e., link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator.

RESULTS: The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the promised method still achieved favorable results (AUCROC > 0.8 for synthetic and 0.7 for real dataset) despite the limited amount of testing data available.

CONCLUSIONS: Our preliminary findings showed the proposed framework achieved promising results for removing noise in data preprocessing of the biomedical knowledge graph potentially improving the performance of downstream applications by providing cleaner data.

PMID:35658098 | DOI:10.2196/38584