TyG-er: An ensemble Regression Forest approach for identification of clinical factors related to insulin resistance condition using Electronic Health Records


Insulin resistance is an early-stage deterioration of Type 2 diabetes. Identification and quantification of insulin resistance requires specific blood tests; however, the triglyceride-glucose (TyG) index can provide a surrogate assessment from routine Electronic Health Record (EHR) data. Since insulin resistance is a multi-factorial condition, to improve its characterisation, this study aims to discover non-trivial clinical factors in EHR data to determine where the insulin-resistance condition is encoded.


We proposed a high-interpretable Machine Learning approach (i.e., ensemble Regression Forest combined with data imputation strategies), named TyG-er. We applied three different experimental procedures to test TyG-er reliability on the Italian Federation of General Practitioners dataset, named FIMMG_obsdataset, which is publicly available and reflects the clinical use-case (i.e., not all laboratory exams are prescribed on a regular basis over time).


Results detected non-conventional clinical factors (i.e., uricemia, leukocytes, gamma-glutamyltransferase and protein profile) and provided novel insight into the best combination of clinical factors for detecting early glucose tolerance deterioration. The robustness of these extracted clinical factors was confirmed by the high agreement (from 0.664 to 0.911 of Lin’s correlation coefficient (rc)) of the TyG-er approach among different experimental procedures. Moreover, the results of the three experimental procedures outlined the predictive power of the TyG-er approach (up to a mean absolute error of 5.68% and rc=0.666,p<.05).


The TyG-er approach is able to carry information about the identification of the TyG index, strictly correlated with the insulin-resistance condition, while extracting the most relevant non-glycemic features from routine data.