Are large language models superhuman chemists?

Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Amir Mohammad Elahi, Mehrdad Asgari, Juliane Eberhardt, Hani M. Elbeheiry, María Victoria Gil, Maximilian Greiner, Caroline T. Holick, Christina Glaubitz, Tim Hoffmann, Abdelrahman Ibrahim, Lea C. Klepsch, Yannik Köster, Fabian Alexander Kreth, Jakob Meyer, Santiago Miret, Jan Matthias Peschel, Michael Ringleb, Nicole Roesner, Johanna Schreiber, Ulrich S. Schubert, Leanne M. Stafast, Dinga Wonanke, Michael Pieler, Philippe Schwaller, Kevin Maik Jablonka

4/1/2024

Abstract

Large language models (LLMs) have gained widespread interest due to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here, we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question-answer pairs, evaluated leading open- and closed-source LLMs, and found that the best models outperformed the best human chemists in our study on average. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs' impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.

AI-Generated Overview

Here’s an overview of the scientific paper based on the specified categories:

Research Focus: The study evaluates the chemical knowledge and reasoning capabilities of large language models (LLMs) in comparison to human chemists, specifically through the introduction of a benchmarking framework called "ChemBench."
Methodology: The researchers curated a comprehensive benchmark consisting of over 2,700 question-answer pairs, which were used to assess various LLMs, including leading open- and closed-source models. The study involved expert evaluations to provide context for the models' performance against human chemists.
Results: The findings revealed that the best LLMs outperformed human chemists on average across several questions. However, the models exhibited significant limitations in knowledge-intensive tasks and tended to give overconfident predictions.
Key Contribution(s): The paper introduces ChemBench, an automated framework designed to benchmark LLMs in chemistry effectively, and emphasizes the importance of refining model evaluation methods for chemical applications.
Significance: The results highlight the impressive potential of LLMs in the chemical sciences while also pointing out critical areas for improvement, particularly concerning their reliability and safety in practical applications.
Broader Applications: The insights gained from this research could inform the development of next-generation AI tools for chemists, improve chemistry education, and facilitate safer usage of AI technologies in predicting chemical properties and reactions.

Are large language models superhuman chemists?

Abstract

AI-Generated Overview

Relevant Links

Stay Updated