Using Statistical Techniques and Machine Learning to Assess Potential Biomarkers and Clinical Characteristics for Differences between Chronic Hepatitis C Virus and Hepatocellular Carcinoma
Hepatitis C virus (HCV) belongs to a family of viruses that infect and inflame the liver. Severity of HCV infection ranges from mild illness lasting a few weeks (acute) to lifelong liver injury resulting in fibrosis (chronic). Approximately 75%–85% of people infected with hepatitis C virus will progress to chronic infection. Chronic hepatitis C causes the liver to progress through stages of fibrosis. Patients with stage IV hepatic fibrosis can progress to hepatocellular carcinoma (HCC) . Poor prognosis of HCC comes from a lack of sensitive and specific screening techniques. However, several studies have suggested that metabolic changes in the liver may prove to be viable biomarkers in early detection of HCC.
This research consists of secondary data analysis of metabolic and clinical variables for 29 patients with stage IV fibrosis (cirrhotic) HCV and 30 patients with cirrhotic HCV who have also progressed to HCC (HCV-HCC). Several statistical and machine learning techniques are used to determine possible viable biomarkers that differ between patients with HCV-induced stage IV fibrosis and those who have progressed to HCC. Initially, the number of variables in the data set is reduced using t-tests (with Benjamini-Hochberg corrections) and Random Forest (RF) analysis. Thereafter, models were constructed using Multivariable Adaptive Regression Splines (MARS) and Logistic Regression (LR) to further reduce the number of variables. Ideally, only biomarkers capable of predicting HCC are left. While each model produced sub-optimal performance statistics for a direct clinical test, several biomarkers of interest were identified for further study. MARS’s modeling scheme produced the highest performance statistics and seemed to be an overall better modeler for the data. LR had problems with high standard errors and widely varying performance statistics. Interestingly, all models only selected metabolic biomarkers despite clinical characteristics being added to the model. MARS identified N-methyl Proline, Octadecanedioate, Pantoprazole, and Xylitol. Logistic Regression identified 3-hydroxypropanoate, Laurylcarnitine, N4-acetylcytidine, N6-acetyllysine, Octadecanedioate, Oleoylcarnitine, Pantoprazole, and Xylitol.