1)西安石油大学计算机学院,西安 710065;2)西北工业大学自动化学院,信息融合教育部重点实验室,西安 710072
国家自然科学基金(62173271),陕西省自然科学基础研究计划(2023-JC-YB-591)和西安石油大学研究生创新与实践能力培养计划(YCS23213171)资助项目。
1)College of Computer Science, Xi’an Shiyou University, Xi’an 710065, China;2)School of Automation, Key Laboratory of Information Fusion Technology of Ministry of Education, Northwestern Polytechnical University, Xi’an 710072, China
This work was supported by grants from The National Natural Science Foundation of China (62173271), Shaanxi Natural Science Basic Research Program (2023-JC-YB-591), and the Postgraduate Innovation and Practical Ability Training Program of Xi’an Shiyou University (YCS23213171).
目的 药物研发成本高、周期长且成功率低。准确预测分子属性对有效筛选药物候选物、优化分子结构具有重要意义。基于特征工程的传统分子属性预测方法需研究人员具备深厚的学科背景和广泛的专业知识。随着人工智能技术的不断成熟,涌现出大量优于传统特征工程方法的分子属性预测算法。然而这些算法模型仍然存在标记数据稀缺、泛化性能差等问题。鉴于此,本文提出一种基于Bert+GCN的多模态数据融合的分子属性预测算法(命名为BGMF),旨在整合药物分子的多模态数据,并充分利用大量无标记药物分子训练模型学习药物分子的有用信息。方法 本文提出了BGMF算法,该算法根据药物SMILES表达式分别提取了原子序列、分子指纹序列和分子图数据,采用预训练模型Bert和图卷积神经网络GCN结合的方式进行特征学习,在挖掘药物分子中“单词”全局特征的同时,融合了分子图的局部拓扑特征,从而更充分利用分子全局-局部上下文语义关系,之后,通过对原子序列和分子指纹序列的双解码器设计加强分子特征表达。结果 5个数据集共43个分子属性预测任务上,BGMF方法的AUC值均优于现有其他方法。此外,本文还构建独立测试数据集验证了模型具有良好的泛化性能。对生成的分子指纹表征(molecular fingerprint representation)进行t-SNE可视化分析,证明了BGMF模型可成功捕获不同分子指纹的内在结构与特征。结论 通过图卷积神经网络与Bert模型相结合,BGMF将分子图数据整合到分子指纹恢复和掩蔽原子恢复的任务中,可以有效地捕捉分子指纹的内在结构和特征,进而高效预测药物分子属性。
Objective Molecular property prediction plays a crucial role in drug development, especially in virtual screening and compound optimization. The advancement of artificial intelligence (AI) technologies has led to the emergence of numerous deep learning-based methods, which have demonstrated significant potential in improving molecular property prediction. Nonetheless, acquiring labeled molecular data can be both costly and time-consuming. The scarcity of labeled data poses a substantial challenge for supervised machine learning models to effectively generalize across the vast chemical space. In order to overcome the above limitations, in this work, we proposed a novel Bert and GCN-based multimodal fusion method (called BGMF) to predict molecular property.Methods BGMF can extract comprehensive molecular representation from atomic sequences, molecular fingerprint sequences, and molecular graph data and combine them through pre-training and fine-tuning. Specifically, our method consists of the following three main parts. (1) Molecular feature extraction; (2) Bert-GCN based pre-training; (3) fine-tuning. During molecular feature extraction, the Morgan algorithm is employed to generate the molecular fingerprints, transforming input SMILES strings of drugs into molecular fingerprint sentences. Simultaneously, atom sentences are created based on the atom indices within the molecule, Consequently, drug molecule are represented as both molecular fingerprint sentences and atom sentences. In the pre-training section, BGMF utilizes a self-supervised learning strategy, specifically masked molecular fingerprint and masked atom recovery, on a large dataset of unlabeled data using the Bert model. Here, molecular graph data is incorporated by merging graph convolutional neural networks with the Bert model, effectively combining the global “word” features of drug molecules with the local topological features of molecular graphs. We have also developed a dual decoder for atomic and molecular fingerprints to amplify molecular feature expression. Finally, in the fine-tuning stage, the addition of a pooling layer and task-specific fully connected neural networks allows the pre-trained module to be applied to a variety of downstream tasks for molecular property prediction.Results To validate the effectiveness of our BGMF, we conduct several experiments on 43 molecular attribute prediction tasks across 5 datasets. In comparison with other recent state-of-the-art methods, our BGMF achieves the best results in terms of area under the ROC curve (AUC). We also verified the generalization performance of the BGMF model by constructing independent test dataset, showing that the BGMF model has the best generalization performance. Additionally, we conduct the ablation studies to demonstrate the effect of atomic sequence, molecular fingerprint sequence, GCN based molecular graph module, and pre-training module on the overall performance of the model.Conclusion In this paper, we propose a novel method for drug molecular attribute prediction named BGMF which integrating the molecular graph data into tasks of molecular fingerprint recovery and masked atom recovery by combining graph convolutional neural network with the Bert model. The molecular fingerprint representations generated by BGMF were visualized using t-SNE, revealing that the BGMF model effectively captures the intrinsic structure and features of molecular fingerprints.
闫效莺,靳艳春,冯月华,张绍武.基于Bert+GCN多模态数据融合的药物分子属性预测[J].生物化学与生物物理进展,2025,52(3):783-794
复制生物化学与生物物理进展 ® 2025 版权所有 ICP:京ICP备05023138号-1 京公网安备 11010502031771号