Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Molecular String Representation

Nguyen Doan Hieu Nguyen, Nhat Truong Pham, Duong Thanh Tran, Balachandran Manavalan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Generating de novo molecules from textual descriptions is challenging due to potential issues with molecule validity in SMILES representation and limitations of autoregressive models. This work introduces Lang2Mol-Diff, a diffusion-based language-to-molecule generative model using the SELFIES representation. Specifically, Lang2Mol-Diff leverages the strengths of two state-of-the-art molecular generative models: BioT5 and TGM-DLM. By employing BioT5 to tokenize the SELFIES representation, Lang2Mol-Diff addresses the validity issues associated with SMILES strings. Additionally, it incorporates a text diffusion mechanism from TGM-DLM to overcome the limitations of autoregressive models in this domain. To the best of our knowledge, this is the first study to leverage the diffusion mechanism for text-based de novo molecule generation using the SELFIES molecular string representation. Performance evaluation on the L+M-24 benchmark dataset shows that Lang2Mol-Diff outperforms all existing methods for molecule generation in terms of validity. Our code and pre-processed data are available at https://github.com/nhattruongpham/mollang-bridge/tree/lang2mol/.

Original languageEnglish
Title of host publicationLang + Mol 2024 - 1st Workshop on Language + Molecules, Proceedings of the Workshop
EditorsCarl Edwards, Qingyun Wang, Manling Li, Lawrence Zhao, Tom Hope, Heng Ji
PublisherAssociation for Computational Linguistics (ACL)
Pages129-135
Number of pages7
ISBN (Electronic)9798891761483
StatePublished - 2024
Event1st Workshop on Language + Molecules, Lang + Mol 2024 - co-located with ACL 2024 - Bangkok, Thailand
Duration: 15 Aug 2024 → …

Publication series

NameLang + Mol 2024 - 1st Workshop on Language + Molecules, Proceedings of the Workshop

Conference

Conference1st Workshop on Language + Molecules, Lang + Mol 2024 - co-located with ACL 2024
Country/TerritoryThailand
CityBangkok
Period15/08/24 → …

Fingerprint

Dive into the research topics of 'Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Molecular String Representation'. Together they form a unique fingerprint.

Cite this