Google Research has introduced a new technique called Parameter Efficient Reinforcement Learning (PERL), which aims to make the process of aligning LLMs with human preferences more efficient and accessible.
This research paper published is available here.
The researchers propose using a parameter-efficient method called Low-Rank Adaptation (LoRA) to fine-tune the reward model and reinforcement learning policy in the Reinforcement Learning from Human Feedback (RLHF) process.
In PERL, LoRA, a method that fine-tunes a small number of parameters, is applied to make training more efficient. It’s used in both the reward model and the reinforcement learning (RL) policy of language models by attaching LoRA adapters to specific parts.
During training, only these adapters are updated, leaving the main part of the model unchanged. This approach reduces the amount of data needed to train and speeds up the process, making it possible to train the models with less computational power.
The team conducted extensive experiments on seven datasets, including two novel datasets called ‘Taskmaster Coffee’ and ‘Taskmaster Ticketing,’ which they released as part of this work.
The results showed that PERL performed on par with conventional RLHF while training faster and using less memory. This finding is significant because the computational cost and complexity of the RLHF process have hindered its adoption as an alignment technique for large language models. This advancement could lead to wider adoption of RLHF as an alignment technique, potentially improving the quality and safety of large language models.