Explainable Molecular Property Prediction: Aligning Chemical Concepts with Predictions via Language Models

Zhenzhong Wang, Zehui Lin, Wanyu Lin, Ming Yang, Minggang Zeng, Kay Chen Tan

5/25/2024

Abstract

Providing explainable molecular property predictions is critical for many scientific domains, such as drug discovery and material science. Though transformer-based language models have shown great potential in accurate molecular property prediction, they neither provide chemically meaningful explanations nor faithfully reveal the molecular structure-property relationships. In this work, we develop a framework for explainable molecular property prediction based on language models, dubbed as Lamole, which can provide chemical concepts-aligned explanations. We take a string-based molecular representation -- Group SELFIES -- as input tokens to pretrain and fine-tune our Lamole, as it provides chemically meaningful semantics. By disentangling the information flows of Lamole, we propose combining self-attention weights and gradients for better quantification of each chemically meaningful substructure's impact on the model's output. To make the explanations more faithfully respect the structure-property relationship, we then carefully craft a marginal loss to explicitly optimize the explanations to be able to align with the chemists' annotations. We bridge the manifold hypothesis with the elaborated marginal loss to prove that the loss can align the explanations with the tangent space of the data manifold, leading to concept-aligned explanations. Experimental results over six mutagenicity datasets and one hepatotoxicity dataset demonstrate Lamole can achieve comparable classification accuracy and boost the explanation accuracy by up to 14.3%, being the state-of-the-art in explainable molecular property prediction.

AI-Generated Overview

Research Focus: The study investigates a framework named Lamole for explainable molecular property prediction using transformer-based language models, focusing on chemically meaningful explanations and molecular structure-property relationships.
Methodology: The approach involves pre-training and fine-tuning language models with string-based molecular representations (Group SELFIES), disentangling information flows, integrating self-attention weights and gradients, and applying a marginal loss to enhance explanation fidelity.
Results: Experimental results indicate that Lamole achieves comparable classification accuracy while improving explanation accuracy by up to 14.3% across six mutagenicity datasets and one hepatotoxicity dataset, outperforming existing methods in explanation plausibility.
Key Contribution(s): The paper introduces an innovative framework that: (1) uses Group SELFIES for chemically meaningful input, (2) combines attention weights and gradients for generating explanations, and (3) applies a marginal loss for aligning explanations with chemical intuitions and annotations.
Significance: This work is significant as it addresses the critical need for explainability in molecular property prediction models, offering insights that can enhance scientific hypotheses validation and expedite drug discovery and material science applications.
Broader Applications: The methodology and insights can be applied in various fields such as drug discovery, material science, toxicology, and any domain where understanding the structure-property relationship of molecules is essential for advancing research and practical applications.

Explainable Molecular Property Prediction: Aligning Chemical Concepts with Predictions via Language Models

Abstract

AI-Generated Overview

Relevant Links

Stay Updated