In the fight against cancer, identifying tumor cells among healthy tissue remains one of the most critical challenges in modern oncology. Our groundbreaking project, "Precision Oncology and Interpretable Machine Learning" leverages machine learning and single-cell RNA sequencing to revolutionise how we detect and understand cancer at the cellular level.
Cancer therapeutics traditionally rely on broad population-based treatment protocols and bulk tumor profiling, but these approaches often fail to account for the cellular heterogeneity that drives therapeutic resistance and treatment failure. Single-cell RNA sequencing has opened new frontiers, but identifying the specific cellular populations that respond to or resist targeted therapies remains computationally challenging.
We analyzed three comprehensive cancer datasets (GSE176078, GSE161529, GSE180286) to ensure our models generalise across different patient populations and experimental conditions. This cross-validation approach is crucial for developing clinically applicable tools.
We engineered novel cellular features including CNV scores, cell cycle signatures, apoptosis markers, ribosomal percentage count, RNA count, protooncogene score and oncogene activation patterns to create comprehensive cellular fingerprints.
Using XGBoost we developed interpretable machine learning models that not only classify cells but explain the biological reasoning behind each prediction. Performed threshold analysis to find the optimal value for the model.
We validated our predictions using orthogonal measures like genomic instability and performed comprehensive pathway analysis to identify therapeutic targets. Our pathway analysis leveraged - PI3K–MAPK–EGFR, CDK4–CDK6–CCND1 - to map dysregulated genes to biological processes and signaling networks, revealing oncogenic dependencies and potential intervention points across multiple cancer-related pathways.
Integration with drug databases - Drug Gene Interaction Database (DGIdb), Open Targets Databases etc. revealed actionable therapeutic targets, bridging the gap between computational discovery and clinical application.
A critical breakthrough was using CNV (Copy Number Variation) scores as independent validation. Predicted tumor cells consistently showed higher genomic instability, confirming our models were capturing genuine cancer hallmarks rather than technical artifacts.
Our models achieved remarkable accuracy by focusing on key oncogenes including MYC, CCND1, ERBB2, CDK4 etc. These genes showed consistent upregulation in predicted tumor cells across all datasets, providing robust biomarkers for cancer detection.
We visualized complex oncogenic networks including the PI3K–MAPK–EGFR axis, CDK4–CDK6–CCND1 pathway, and MYC signaling cascade. These network analyses revealed interconnected pathways driving tumor progression and identified multiple intervention points.
Our analysis identified several drugs that target highly expressed oncogenes in our predicted tumor cells, suggesting immediate clinical applications.
1) FDA-approved drugs (Palbociclib-CDK4, CDK6; Erlotinib - EGFR; Venetoclax - BCL2 etc.)
2) Therapeutic Expansion drugs (Abemaciclib, Osimertinib, Alpelisib etc.)
3) Emerging Clinical drugs (Tirbaninulin, Palifermin etc.)
4) Underexplored drugs (Futibatinib etc.)
Our computational pipeline integrates bioinformatics tools with advanced machine learning frameworks. The use of SHAP (SHapley Additive exPlanations) for model interpretation ensures that our predictions are not just accurate but also biologically meaningful and clinically interpretable.
This work represents a significant step toward precision oncology, where treatment decisions are guided by comprehensive cellular analysis rather than broad population averages. Our interpretable AI approach ensures that clinicians can understand and trust the computational predictions, accelerating adoption in clinical settings.
Dive into our comprehensive analysis, examine our code, and discover cancer detection.
View on GitHub Browse Notebooks View Research Poster