A survey and benchmark of high-dimensional Bayesian optimization of discrete sequences
Abstract
Optimizing discrete black-box functions is key in several domains, e.g. protein engineering and drug design. Due to the lack of gradient information and the need for sample efficiency, Bayesian optimization is an ideal candidate for these tasks. Several methods for high-dimensional continuous and categorical Bayesian optimization have been proposed recently. However, our survey of the field reveals highly heterogeneous experimental set-ups across methods and technical barriers for the replicability and application of published algorithms to real-world tasks. To address these issues, we develop a unified framework to test a vast array of high-dimensional Bayesian optimization methods and a collection of standardized black-box functions representing real-world application domains in chemistry and biology. These two components of the benchmark are each supported by flexible, scalable, and easily extendable software libraries (poli and poli-baselines), allowing practitioners to readily incorporate new optimization objectives or discrete optimizers. Project website: https://machinelearninglifescience.github.io/hdbo_benchmark
AI-Generated Overview
-
Research Focus: The paper surveys high-dimensional Bayesian optimization (HDBO) methods for enhancing optimization of discrete black box functions, particularly in fields like protein engineering and drug design.
-
Methodology: The authors developed a unified framework comprising standardized black box functions and a collection of high-dimensional Bayesian optimization solvers, implemented within the software libraries poli and poli-baselines.
-
Results: The benchmarking reveals that solvers utilizing discrete sequence spaces perform adequately on simpler tasks but do not scale well to more complex problems, whereas optimizers that leverage pre-trained latent-variable models exhibit superior performance in high-dimensional scenarios.
-
Key Contribution(s): This research proposes a refined taxonomy of HDBO methods that emphasizes differences between those optimizing directly in discrete sequence space and those utilizing latent representations, along with an open-source framework for consistent benchmarking.
-
Significance: By addressing inconsistencies in experimental setups and promoting replicability of results, this work aims to enhance the application of Bayesian optimization in critical areas of science.
-
Broader Applications: The findings and methodologies can potentially influence a wide range of applications, including automated drug discovery and protein design, supporting broader research efforts in computational biology and pharmaceutical development.