1)College of Computer Science, Xi’an Shiyou University, Xi’an 710065, China;2)School of Automation, Key Laboratory of Information Fusion Technology of Ministry of Education, Northwestern Polytechnical University, Xi’an 710072, China
This work was supported by grants from The National Natural Science Foundation of China (62173271), Shaanxi Natural Science Basic Research Program (2023-JC-YB-591), and the Postgraduate Innovation and Practical Ability Training Program of Xi’an Shiyou University (YCS23213171).
Objective Molecular property prediction plays a crucial role in drug development, especially in virtual screening and compound optimization. The advancement of artificial intelligence (AI) technologies has led to the emergence of numerous deep learning-based methods, which have demonstrated significant potential in improving molecular property prediction. Nonetheless, acquiring labeled molecular data can be both costly and time-consuming. The scarcity of labeled data poses a substantial challenge for supervised machine learning models to effectively generalize across the vast chemical space. In order to overcome the above limitations, in this work, we proposed a novel Bert and GCN-based multimodal fusion method (called BGMF) to predict molecular property.Methods BGMF can extract comprehensive molecular representation from atomic sequences, molecular fingerprint sequences, and molecular graph data and combine them through pre-training and fine-tuning. Specifically, our method consists of the following three main parts. (1) Molecular feature extraction; (2) Bert-GCN based pre-training; (3) fine-tuning. During molecular feature extraction, the Morgan algorithm is employed to generate the molecular fingerprints, transforming input SMILES strings of drugs into molecular fingerprint sentences. Simultaneously, atom sentences are created based on the atom indices within the molecule, Consequently, drug molecule are represented as both molecular fingerprint sentences and atom sentences. In the pre-training section, BGMF utilizes a self-supervised learning strategy, specifically masked molecular fingerprint and masked atom recovery, on a large dataset of unlabeled data using the Bert model. Here, molecular graph data is incorporated by merging graph convolutional neural networks with the Bert model, effectively combining the global “word” features of drug molecules with the local topological features of molecular graphs. We have also developed a dual decoder for atomic and molecular fingerprints to amplify molecular feature expression. Finally, in the fine-tuning stage, the addition of a pooling layer and task-specific fully connected neural networks allows the pre-trained module to be applied to a variety of downstream tasks for molecular property prediction.Results To validate the effectiveness of our BGMF, we conduct several experiments on 43 molecular attribute prediction tasks across 5 datasets. In comparison with other recent state-of-the-art methods, our BGMF achieves the best results in terms of area under the ROC curve (AUC). We also verified the generalization performance of the BGMF model by constructing independent test dataset, showing that the BGMF model has the best generalization performance. Additionally, we conduct the ablation studies to demonstrate the effect of atomic sequence, molecular fingerprint sequence, GCN based molecular graph module, and pre-training module on the overall performance of the model.Conclusion In this paper, we propose a novel method for drug molecular attribute prediction named BGMF which integrating the molecular graph data into tasks of molecular fingerprint recovery and masked atom recovery by combining graph convolutional neural network with the Bert model. The molecular fingerprint representations generated by BGMF were visualized using t-SNE, revealing that the BGMF model effectively captures the intrinsic structure and features of molecular fingerprints.
YAN Xiao-Ying, JIN Yan-Chun, FENG Yue-Hua, ZHANG Shao-Wu. A Multimodal Fusion Drug Molecular Attribute Prediction Method Based on Bert and GCN[J]. Progress in Biochemistry and Biophysics,2025,52(3):783-794
Copy® 2025 All Rights Reserved ICP:京ICP备05023138号-1 京公网安备 11010502031771号