Date of Award

12-11-2025

Degree Type

Thesis

Degree Name

Master of Science (M.S.)

Department

Computer Science

First Advisor

Manar Samad

Abstract

Understanding causal relationships between input variables and outcomes is critical for scientific advancement. The famous adage in medicine, "correlation does not imply causation," holds profound significance for understanding disease etiology. Although explainable machine learning (ML) has shown major advances in predictive modeling, it remains unclear how ML-derived important variables relate to causal variables. This thesis investigates the association between causal variables and those identified as correlated and important for ML-based prediction to bridge a critical knowledge gap in data science. Specifically, the thesis explores two research questions: when and to what extent (1) statistically correlated variables are also causal? and (2) variables important for ML-based predictions are also causal? To answer these questions, this work introduces a novel framework for nonlinear Causal Structure Discovery (CSD) to quantify causal strengths and enable CSD with mixed-type data. The proposed method is evaluated on real-world heart failure (HF) data sets and validated on 16 tabular data sets from diverse domains. from various domains. Comparative experiments demonstrate that the nonlinear CSD model identifies clinically meaningful causal variables than its linear counterpart. Findings show that important features of the ML classifiers are strongly associated with causal variables in male and female heart failure diagnoses. A similar association is observed in 11 out of 16 tabular data sets. Overall, nonlinear CSD models are more accurate, producing results more aligned with ML-based variable importance than the linear counterpart. This thesis demonstrates an effective method to provide causal explainability of machine learning through variable importance.

Share

COinS