How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Technical Screen rounds at Anthropic.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Anthropic during technical interviews.

Debug a GRPO training loop and explain ratios | Anthropic Interview Question

Q: Debug a GRPO training loop and explain ratios

This question evaluates debugging and implementation knowledge for on-policy reinforcement learning, focusing on GRPO/PPO-style training loops, importance-sampling ratios, log-prob computations, masking, and advantage normalization.

You are given a simplified implementation of a GRPO (Group Relative Policy Optimization) training step for an RLHF-style policy model. The training is supposed to be strictly on-policy, meaning rollouts are generated by the same policy that is being updated.

Tasks:

Walk through the end-to-end GRPO training flow : sampling prompts, generating rollouts, computing group-based advantages (relative within a group of completions), computing the policy gradient loss, and updating the policy.
You find that training is unstable due to several straightforward implementation issues. Describe three common, easy-to-miss bugs in a GRPO/PPO-like training loop that would cause incorrect learning (e.g., wrong log-prob computation, incorrect masking, advantage normalization mistakes, mixing policies, etc.). For each, explain how you would detect it and how to fix it.
During debugging you notice the importance-sampling ratio

$\text{ratio} = \exp(\log \pi_{\theta}(a|s) - \log \pi_{\text{old}}(a|s))$

is not always 1, even though you expected the method to be strictly on-policy. Explain why the ratio might deviate from 1 in practice. List the most likely causes and what to check in the training pipeline to confirm each cause.

Tasks:

Walk through the end-to-end GRPO training flow : sampling prompts, generating rollouts, computing group-based advantages (relative within a group of completions), computing the policy gradient loss, and updating the policy.
You find that training is unstable due to several straightforward implementation issues. Describe three common, easy-to-miss bugs in a GRPO/PPO-like training loop that would cause incorrect learning (e.g., wrong log-prob computation, incorrect masking, advantage normalization mistakes, mixing policies, etc.). For each, explain how you would detect it and how to fix it.
During debugging you notice the importance-sampling ratio

$\text{ratio} = \exp(\log \pi_{\theta}(a|s) - \log \pi_{\text{old}}(a|s))$

Debug a GRPO training loop and explain ratios

Quick Overview

Solution

Comments (0)

Debug a GRPO training loop and explain ratios

Quick Overview

Solution

Comments (0)