Precision Oncology and Interpretable Machine Learning

Tumor vs Normal Cell Signatures from Single-Cell Data
Haider Rizvi & Rajeev Prasad

In the fight against cancer, identifying tumor cells among healthy tissue remains one of the most critical challenges in modern oncology. Our groundbreaking project, "Precision Oncology and Interpretable Machine Learning" leverages machine learning and single-cell RNA sequencing to revolutionise how we detect and understand cancer at the cellular level.

3 Cancer Datasets
89% Model Accuracy
40+ Oncogenes Analyzed
300K+ Single Cells

The Challenge: Finding Needles in a Haystack

Cancer therapeutics traditionally rely on broad population-based treatment protocols and bulk tumor profiling, but these approaches often fail to account for the cellular heterogeneity that drives therapeutic resistance and treatment failure. Single-cell RNA sequencing has opened new frontiers, but identifying the specific cellular populations that respond to or resist targeted therapies remains computationally challenging.

Key Insight: Our machine learning models successfully identified tumor cells with 89% accuracy by analyzing expression patterns of oncogenes, cell cycle markers, and genomic instability signatures (copy number variation-CNV scores) across three independent breast cancer datasets.

Our Innovative Approach

Multi-Dataset Validation Strategy

We analyzed three comprehensive cancer datasets (GSE176078, GSE161529, GSE180286) to ensure our models generalise across different patient populations and experimental conditions. This cross-validation approach is crucial for developing clinically applicable tools.

Phase 1: Data Engineering & Feature Discovery

We engineered novel cellular features including CNV scores, cell cycle signatures, apoptosis markers, ribosomal percentage count, RNA count, protooncogene score and oncogene activation patterns to create comprehensive cellular fingerprints.

Phase 2: Machine Learning Model Development

Using XGBoost we developed interpretable machine learning models that not only classify cells but explain the biological reasoning behind each prediction. Performed threshold analysis to find the optimal value for the model.

Phase 3: Biological Validation & Pathway Analysis

We validated our predictions using orthogonal measures like genomic instability and performed comprehensive pathway analysis to identify therapeutic targets. Our pathway analysis leveraged - PI3K–MAPK–EGFR, CDK4–CDK6–CCND1 - to map dysregulated genes to biological processes and signaling networks, revealing oncogenic dependencies and potential intervention points across multiple cancer-related pathways.

Phase 4: Clinical Translation & Drug Target Discovery

Integration with drug databases - Drug Gene Interaction Database (DGIdb), Open Targets Databases etc. revealed actionable therapeutic targets, bridging the gap between computational discovery and clinical application.

Breakthrough Discoveries

🧬 Genomic Instability as Validation

A critical breakthrough was using CNV (Copy Number Variation) scores as independent validation. Predicted tumor cells consistently showed higher genomic instability, confirming our models were capturing genuine cancer hallmarks rather than technical artifacts.

🎯 Precision Tumor Detection

Our models achieved remarkable accuracy by focusing on key oncogenes including MYC, CCND1, ERBB2, CDK4 etc. These genes showed consistent upregulation in predicted tumor cells across all datasets, providing robust biomarkers for cancer detection.

🎨 Network Biology Insights

We visualized complex oncogenic networks including the PI3K–MAPK–EGFR axis, CDK4–CDK6–CCND1 pathway, and MYC signaling cascade. These network analyses revealed interconnected pathways driving tumor progression and identified multiple intervention points.

💻 Clinical Impact:

Our analysis identified several drugs that target highly expressed oncogenes in our predicted tumor cells, suggesting immediate clinical applications.

Drug targets visualization showing oncogene interactions

1) FDA-approved drugs (Palbociclib-CDK4, CDK6; Erlotinib - EGFR; Venetoclax - BCL2 etc.)

2) Therapeutic Expansion drugs (Abemaciclib, Osimertinib, Alpelisib etc.)

3) Emerging Clinical drugs (Tirbaninulin, Palifermin etc.)

4) Underexplored drugs (Futibatinib etc.)

Technical Innovation

Python Scanpy XGBoost SHAP NetworkX Single-cell RNA-seq Machine Learning UMAP Pathway Analysis

Our computational pipeline integrates bioinformatics tools with advanced machine learning frameworks. The use of SHAP (SHapley Additive exPlanations) for model interpretation ensures that our predictions are not just accurate but also biologically meaningful and clinically interpretable.

Future Implications

This work represents a significant step toward precision oncology, where treatment decisions are guided by comprehensive cellular analysis rather than broad population averages. Our interpretable AI approach ensures that clinicians can understand and trust the computational predictions, accelerating adoption in clinical settings.

Next Steps

Explore Our Research

Dive into our comprehensive analysis, examine our code, and discover cancer detection.

View on GitHub Browse Notebooks View Research Poster

Meet the Researchers

Haider Rizvi - GitHub Profile
Rajeev Prasad - GitHub Profile

Post-Graduate students in Applied Data Science, passionate about applying Data Science to solve real-world problems in healthcare and beyond.