Combining attack techniques jumps AI safety failures from 14% to 71%

A paper posted this week tested the robustness of RL-based alignment by combining multiple jailbreak techniques simultaneously rather than applying them individually. The result: attack success rate on OpenAI’s gpt-oss-20b jumped from 14.3% with isolated methods to 71.4% with compound attacks. The model had individual defences against each technique. It did not have defences against combinations of them. The mechanism the researchers targeted is the “instruction hierarchy maintenance” process, the safety layer responsible for keeping behavioural constraints in place when the model is under pressure.

The theoretical framing matters as much as the number. The paper argues that RL-based training “does not acquire new capabilities but merely redistributes the utilisation probabilities of existing ones.” Safety training, under this view, is not teaching the model new knowledge about what not to do. It is adjusting probability weights over pre-existing behaviours. When an attacker combines vectors that each individually shift the weight distribution in small ways, the compound effect can overwhelm the adjustment. Capability training generalises robustly because capabilities build on each other; safety constraints, as currently implemented, do not have the same self-reinforcing structure.

This has a direct implication for how safety evaluations are designed. Standard red-team evaluations test individual attack vectors: prompt injection, roleplay framing, instruction override, encoding tricks. A model that passes each individual test in isolation receives a clean report. The compound attack finding suggests that passing individual tests is not sufficient evidence of robustness. Safety teams evaluating production deployments need evaluation suites that systematically test combinations, not just additions, of known attack vectors.

The 71.4% figure applies to one specific model under one experimental setting. It does not mean all models fail at this rate, and the paper is not a comprehensive survey of the frontier. What it demonstrates is a structural weakness in how RL alignment is typically framed: as a set of independent constraints rather than a jointly robust policy. Teams building fine-tuned or RLHF-trained models for production use cases where misuse could cause real harm should treat this as a prompt to test compound scenarios, not a reason to stop deploying.

The practical path forward is not obvious. Fine-tuning against compound attacks risks overfitting to the specific combinations that were tested. Constitutional AI approaches and process-level supervision may generalise better precisely because they operate on reasoning, not on output classification. The paper does not solve the problem, but it specifies it more precisely than most prior work: alignment training and capability training have different generalisation properties, and that gap is exploitable.