Precision Oncology through Interpretable Machine Learning

Tumor vs Normal Cell Signatures from Single-Cell Data

Research Team

Haider Rizvi & Rajeev Prasad

MADS Capstone Team 24 • Summer 2025

🎯 Abstract & Objectives

This project analyzes single-cell transcriptomics data from human breast cancers to distinguish tumor and normal cell populations through comprehensive computational methods. Our workflow integrates data acquisition, feature engineering,oxphos, protooncogene categorisation, copy number variation analysis, machine learning classification, and therapeutic target identification using DGIdb, Open Targets databases and Clinical trial datasets.

Key Goals

  • Feature Engineering: Extract biologically meaningful signatures from scRNA-seq data
  • Machine Learning Model: Develop interpretable ML models to distinguish tumor vs normal cells
  • Cross-Dataset Validation: Ensure model generalizability across multiple breast cancer datasets
  • Therapeutic Discovery: Identify clinically relevant gene targets for precision oncology

Analysis Workflow Overview

Analysis Workflow Diagram

📊 Datasets & Methodology

Single-Cell RNA-seq Datasets

  • GSE176078: Primary training dataset - Breast cancer scRNA-seq
  • GSE161529: Cross-Validation dataset - Additional breast cancer samples
  • GSE180286: Cross-Validation dataset - Additional breast cancer samples

🧬 Feature Engineering

  • Cell cycle scoring
  • Apoptosis signatures
  • Ribosomal content analysis
  • OXPHOS pathway activity
  • CNV burden quantification

🤖 Machine Learning

  • XGBoost classification
  • Threshold optimization
  • Cross-validation
  • SMOTE for class balancing
  • SHAP for interpretability

🔬 CNV Analysis

  • Chromosomal instability
  • inferCNV pipeline
  • Gencode v44 annotations
  • Copy number scoring

💊 Therapeutic Mapping

  • DGIdb druggability data
  • Open Targets associations
  • Pathway enrichment
  • Clinical trial integration

Analysis Workflow

Data Loading
Feature Engineering
CNV Analysis
ML Classification
SHAP Analysis
Therapeutic Mapping

Feature Importance Analysis

SHAP Feature Importance Comparison

📈 Key Results & Performance

🎯 Classification Performance

  • AUC:89+ Excellent discrimination
  • Precision: 0.95+ High accuracy
  • Recall: 0.89+ Comprehensive detection
  • Robust cross-dataset validation

🧬 Biological Insights

  • CNV burden as key discriminator
  • Cell cycle dysregulation in tumors
  • Metabolic pathway differences
  • Oncogene expression patterns

🔍 Feature Importance

  • CNV Score: Top discriminative feature
  • Cell Cycle: S/G2M phase enrichment
  • OXPHOS: Metabolic reprogramming
  • Ribosomal: Translation activity

💊 Therapeutic Targets

  • PI3K/MAPK/EGFR pathway genes
  • CDK4–CDK6–CCND1 pathway genes
  • Emerging and Underexplored targets
  • DGIdb status, Open Target Score, Breast cancer support categorisation and Clinical trial mapping

🌟 Major Discoveries

  • Universal Tumor Signatures: Identified robust features that discriminate tumor cells across multiple datasets
  • CNV as Biomarker: Copy number variation burden emerged as the strongest predictor of malignancy
  • Pathway Networks: Mapped therapeutic targets within key oncogenic pathways (PI3K, MAPK, EGFR)
  • Clinical Relevance: Integrated druggability and clinical trial data for precision medicine applications

Model Performance Visualization

Model Performance Results

🎛️ Technical Implementation

🐍 Core Technologies

📊 Data Processing Pipeline

🔬 Validation Strategy

Category Sankey Plot Analysis

SHAP Feature Importance Comparison

🔮 Impact & Future Directions

🏥 Clinical Applications


🚀 Future Enhancements


‼️ Potential Challenges: