نشریه علمی مهندسی پزشکی زیستی

Detection of Epidermal Growth Factor Receptor Mutations in Non-Small Cell Lung Cancer Patients Using a Supervised Representation Learning Framework

Document Type : Full Research Paper

Authors

1 Department of Biomedical Engineering, Faculty of Electrical Engineering, K. N. Toosi University of Technology, Tehran, Iran

2 KNTU university

3 Department of Pulmonology, Imam Khomeini Hospital Complex (IKHC), Tehran University of Medical Sciences, Tehran, Iran

Abstract
Lung cancer remains one of the most prevalent malignancies worldwide and is a leading cause of cancer-related mortality. Accurate, automated detection of genetic mutations—particularly in the epidermal growth factor receptor (EGFR)—is essential for selecting targeted therapies and improving clinical outcomes in patients with non–small cell lung cancer (NSCLC). In recent years, machine learning methods have shown considerable promise in analyzing clinical data to identify genetic alterations. However, data heterogeneity and class imbalance in clinical datasets remain persistent challenges, leading to reduced predictive performance and biased models. In this study, we introduce a novel supervised representation learning framework specifically designed for heterogeneous clinical data comprising both categorical and numerical features. In this framework, categorical features are first encoded through a trainable embedding layer, while numerical data are preprocessed using a normalization layer. The learned embeddings are then integrated with preprocessed numerical features, and the combined inputs are passed through a fully connected layer to produce robust representations that capture complex relationships across heterogeneous data types. Finally, to address the class imbalance problem and improve the accuracy of minority class detection, a weighted XGBoost classifier is employed, which assigns different weights to classes to facilitate the identification of rare mutations. We evaluated the effectiveness of this framework on the NSCLC Radiogenomics dataset from The Cancer Imaging Archive (TCIA), which contains data from 211 patients. Five-fold Stratified cross-validation was employed to ensure model reliability. The proposed method achieved 80.9% accuracy, 72.2% sensitivity, 62.6% F1-score, 83.7% specificity, 66.5% precision, and an area under the ROC curve (AUC) of 0.82. Comparison with state-of-the-art methods demonstrated that the proposed method significantly improves EGFR mutation detection in heterogeneous and imbalanced clinical data.

Keywords

Subjects


Volume 19, Issue 2
Summer 2025
Pages 111-120

  • Receive Date 09 September 2025
  • Revise Date 19 November 2025
  • Accept Date 29 November 2025