Mardochée Réveil, PhD
Back to Publications

RLSF: Reinforcement Learning via Symbolic Feedback

Piyush Jha, Prithwish Jana, Pranavkrishna Suresh, Arnav Arora, Vijay Ganesh
5/26/2024

Abstract

Reinforcement Learning with Human Feedback (RLHF) is considered a standard approach to fine-tuning Large Language Models (LLMs). However, such methods often face limitations such as unsound black-box reward models, difficulties in collecting human preference data, and the reliance on sparse scalar rewards. These methods often fall short when applied to tasks that require complex domain-specific understanding. To address these challenges, we propose a new fine-tuning paradigm we refer to as Reinforcement Learning via Symbolic Feedback (RLSF), which aims to improve domain-specific understanding of LLMs more effectively than traditional reward signals. In the RLSF setting, the LLM being fine-tuned is considered an RL agent, while the environment is allowed access to reasoning or domain knowledge tools (e.g., solvers, provers, algebra systems, or knowledge bases). Crucially, in RLSF, these reasoning tools can provide feedback to the LLMs via poly-sized certificates (e.g., proofs), that characterize errors in the LLM-generated object with respect to some correctness specification. As a bonus, our RLSF approach does not require the reasoning systems we use to be differentiable. The ability of RLSF-based fine-tuning to leverage certificate-generating symbolic tools enables sound fine-grained (token-level) reward signals to LLMs, and thus addresses the limitations of traditional reward models mentioned above. Via extensive evaluations, we show that our RLSF-based fine-tuning of LLMs outperforms traditional approaches on five different applications, namely, program synthesis from natural language pseudo-code to programming language, three chemistry tasks, and solving the Game of 24. A takeaway is that fine-tuning via RLSF enables relatively smaller LLMs to significantly outperform closed-source models that are orders of magnitude larger (e.g., GPT-4).

AI-Generated Overview

  • Research Focus: The paper proposes a new fine-tuning paradigm called Reinforcement Learning via Symbolic Feedback (RLSF) to enhance domain-specific reasoning capabilities in Large Language Models (LLMs) beyond traditional methods reliant on black-box reward mechanisms.

  • Methodology: RLSF adjusts LLM training by incorporating symbolic reasoning tools that provide fine-grained, certificate-based feedback at the token level instead of using sparse scalar rewards derived from human feedback.

  • Results: Extensive evaluations show that RLSF-tuned models outperform standard fine-tuning methods across various tasks, such as program synthesis, chemistry applications, and solving the Game of 24, demonstrating significant improvements in functional correctness and success rates.

  • Key Contribution(s): The core innovation is the integration of token-level symbolic feedback into the fine-tuning process, enabling more accurate error correction and domain-specific understanding in LLMs, which significantly improves performance compared to traditional methods.

  • Significance: The findings highlight the potential of RLSF to enable relatively smaller LLMs to achieve competitive performance with much larger models, thereby suggesting a more efficient path for model enhancement in specialized applications.

  • Broader Applications: The RLSF paradigm could be applied across various reasoning tasks in artificial intelligence, especially in areas requiring deeper logical comprehension, such as scientific research, complex programming tasks, and other domains involving structured knowledge.

Relevant Links

Stay Updated

Subscribe to my Substack for periodic updates on AI and Materials Science