نوع مقاله : مقاله کامل پژوهشی
نویسندگان
1 گروه مهندسی پزشکی، دانشکده مهندسی برق، دانشگاه صنعتی خواجه نصیرالدین طوسی، تهران، ایران
2 خواجهنصیرالدینطوسی
3 گروه ریه، مجتمع بیمارستانی امام خمینی (ره)، دانشگاه علوم پزشکی تهران، تهران، ایران
کلیدواژهها
موضوعات
عنوان مقاله English
نویسندگان English
Lung cancer remains one of the most prevalent malignancies worldwide and is a leading cause of cancer-related mortality. Accurate, automated detection of genetic mutations—particularly in the epidermal growth factor receptor (EGFR)—is essential for selecting targeted therapies and improving clinical outcomes in patients with non–small cell lung cancer (NSCLC). In recent years, machine learning methods have shown considerable promise in analyzing clinical data to identify genetic alterations. However, data heterogeneity and class imbalance in clinical datasets remain persistent challenges, leading to reduced predictive performance and biased models. In this study, we introduce a novel supervised representation learning framework specifically designed for heterogeneous clinical data comprising both categorical and numerical features. In this framework, categorical features are first encoded through a trainable embedding layer, while numerical data are preprocessed using a normalization layer. The learned embeddings are then integrated with preprocessed numerical features, and the combined inputs are passed through a fully connected layer to produce robust representations that capture complex relationships across heterogeneous data types. Finally, to address the class imbalance problem and improve the accuracy of minority class detection, a weighted XGBoost classifier is employed, which assigns different weights to classes to facilitate the identification of rare mutations. We evaluated the effectiveness of this framework on the NSCLC Radiogenomics dataset from The Cancer Imaging Archive (TCIA), which contains data from 211 patients. Five-fold Stratified cross-validation was employed to ensure model reliability. The proposed method achieved 80.9% accuracy, 72.2% sensitivity, 62.6% F1-score, 83.7% specificity, 66.5% precision, and an area under the ROC curve (AUC) of 0.82. Comparison with state-of-the-art methods demonstrated that the proposed method significantly improves EGFR mutation detection in heterogeneous and imbalanced clinical data.
کلیدواژهها English