跨模态Transformer融合呼吸音与电阻抗成像的肺病诊断方法研究
DOI:
CSTR:
作者:
作者单位:

1.南京林业大学机械电子工程学院;2.安徽理工大学医学院生物化学与分子生物学教研室;3.西安理工大学机械工程学院;4.暨南大学物理与光电工程学院

作者简介:

通讯作者:

中图分类号:

TH772

基金项目:

国家自然科学基金项目(62501288、82401526、82302874),中国博士后面上项目(2025M771376、2025M771364),陕西省科技计划项目(2025GH-YBXM-007),东莞市社会发展科技重点项目(20231800935762),南京林业大学大学生创新训练计划项目(202510298015Z),南京林业大学自制仪器项目(nlzzyq202605)资助项目。


A Cross-modal Transformer for Pulmonary Disease Diagnosis by Fusing Respiratory Sounds and Electrical Impedance Tomography
Author:
Affiliation:

1.College of Mechanical and Electronic Engineering, Nanjing Forestry University, Nanjing 210037, China;2.江苏省南京市玄武区龙蟠路159号;3.School of Medicine, Anhui University of Science and Technology, Huainan 232000, China;4.School of Mechanical Engineering, Xi’an University of Technology, Xi’an 710048, China;5.College of Physics and Optoelectronic Engineering, Jinan University, Guangzhou 510632, China

Fund Project:

This work was supported by grants from The National Natural Science Foundation of China (62501288, 82302874, 82401526), the China Postdoctoral Science Foundation (2025M771376, 2025M771364), the Shaanxi Provincial Science and Technology Program (2025GH-YBXM-007), the Jiangsu Provincial Science and Technology Program Special Fund (BZ2024036), Nanjing Forestry University College Students Innovation and Entrepreneurship Training Program Project (202510298015Z), and Nanjing Forestry University Self-developed Experimental Teaching Instrument Project (nlzzyq202605).

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    目的 针对肺部疾病快速诊断困难的问题,本文提出一种基于跨模态Transformer(cross-modal transformer, CMT)的呼吸音(respiratory sounds, RS)与电阻抗成像(electrical impedance tomography,EIT)多模态融合方法,以提高多分类任务的准确性与鲁棒性。方法 构建双分支特征提取框架,采用卷积神经网络(convolutional neural networks, CNN)对RS时频谱图与EIT通气图像进行特征提取,并利用双向长短期记忆网络(bidirectional long short-term memory networks, BiLSTM)进行时序依赖建模;进一步引入CMT注意力机制,实现RS与EIT特征之间的跨模态融合。在BRACETS(三分类,795例样本)和CleftPalate(二分类,549例样本)两个数据集上对所提CMT方法进行实验验证,并采用准确率(Accuracy)、平衡准确率(balanced accuracy, BAcc)和宏平均F1值(MacroF1)进行定量评估。结果 在BRACETS数据集上,CMT方法取得87.21%的Accuracy、88.24%的BAcc和87.40%的MacroF1,与最优基线方法DCNN相比,MacroF1值提升了8.73%,充分证明了其显著的性能优势。消融实验表明,CNN、BiLSTM和Transformer均对性能提升有贡献,移除后MacroF1分别下降1.84%、1.13%和2.20%,其中Transformer影响最为显著。在CleftPalate数据集上,CMT方法分别取得96.58%的Accuracy、96.60%的BAcc和96.56%的MacroF1,相较于DCNN提升了2.51%。结论 本文方法能够有效实现RS与EIT异质模态信息的深度融合,充分挖掘呼吸气流与区域通气之间的互补关系,在不同数据分布和任务场景下均表现出良好的分类性能,可为肺部疾病的无创智能诊断提供新的技术路径。

    Abstract:

    Objective In order to address the challenge of rapid diagnosis in pulmonary diseases, this paper proposes a cross-modal fusion method based on the Cross-modal Transformer (CMT) that integrates respiratory sounds (RS) and electrical impedance tomography (EIT), with the aim of improving the accuracy and robustness of multi-classification tasks. RS encodes the acoustic characteristics of the airways via the Meyrieh spectrogram, whilst EIT depicts regional lung ventilation distribution through spatio-temporal image sequences. The two modalities are naturally complementary in terms of functional and spatial information, providing a physiological basis for multimodal fusion diagnosis.Methods A dual-branch feature extraction framework was constructed, employing Convolutional Neural Networks (CNNs) to extract local features from RS time-frequency spectra and EIT ventilation images, whilst utilising Bidirectional Long Short-Term Memory Networks (BiLSTMs) to model the temporal dependencies across modalities. The development of a transformer-based cross-modal attention fusion module represents a significant advancement in the field. The module utilises a multi-head self-attention mechanism and gated convolutional units to achieve deep semantic alignment and complementary information fusion between RS acoustic features and EIT spatial ventilation features. The model employs an end-to-end joint optimisation strategy, with a cross-entropy loss function supervising the overall training process, and utilises a time-synchronisation mechanism to ensure strict alignment of the dual-modal inputs within the respiratory cycle. The proposed CMT method was subjected to systematic experimental validation on two datasets: The BRACETS dataset (three-class classification, 795 samples) and the CleftPalate dataset (two-class classification, 549 samples) are the focus of this study. Quantitative evaluation was performed using accuracy, balanced accuracy (BAcc) and macroF1 score.Results On the BRACETS dataset, the CMT method achieved an accuracy of 87.21%, a balanced accuracy (BAcc) of 88.24%, and a macroF1 score of 87.40%. In comparison to the optimal baseline method, DCNN, the macroF1 score exhibited an enhancement of 8.73 percentage points, thereby substantiating a substantial performance superiority. The findings of the ablation experiments suggest that the three core modules – CNN, BiLSTM and Transformer – all contribute to performance enhancements. Upon the removal of these modules, the MacroF1 score decreased by 1.84%, 1.13% and 2.20%, respectively. Among these, the Transformer cross-modal fusion module had the most significant impact, validating its crucial role in the interaction of heterogeneous modal information. Hyperparameter sensitivity analysis indicates that the optimal parameter configuration is a sequence length of T=128 and a feature dimension of d=128, achieving a good balance between classification performance and computational efficiency. Sequences that are insufficiently extensive or dimensions that are unduly limited impede the capacity to adequately represent temporal dynamics, whilst excessively protracted sequences may engender superfluous information. On the CleftPalate dataset, the CMT method achieved an accuracy of 96.58%, a BAcc of 96.60%, and a MacroF1 of 96.56%, representing a further improvement of 2.51 percentage points compared to the DCNN. This result validates the model's generalisation capability across different data distributions and task scenarios. The fusion representation learned by CMT has been shown to exhibit tighter intra-class cohesion and clearer inter-class separation boundaries, as evidenced by feature visualisation and case studies (Smith et al., 2022). The system has the capacity to adaptively aggregate effective features when the reliability of multimodal information is uneven, and can maintain correct classification even when ambiguity exists in a single modality.Conclusion The proposed method effectively achieves deep fusion of heterogeneous modal information from RS and EIT, fully exploiting the physiological complementary relationship between respiratory airflow acoustic features and regional lung ventilation distribution. The model displays excellent classification performance and stability across various data distributions and task scenarios, thus providing a novel technical pathway for the non-invasive intelligent diagnosis of pulmonary diseases.

    参考文献
    相似文献
    引证文献
引用本文

吴&#; 阳,顾钰颖,周海燕,胡刘兵,孙&#; 博,姚佳烽.跨模态Transformer融合呼吸音与电阻抗成像的肺病诊断方法研究[J].生物化学与生物物理进展,,():

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2026-04-11
  • 最后修改日期:2026-05-29
  • 录用日期:2026-06-01
  • 在线发布日期:
  • 出版日期:
文章二维码