Catastrophic Cyber Capabilities Benchmark (3CB)

Authors: Anonymous

Abstract: LLM agents have the potential to revolutionize defensive cyber operations, but their offensive capabilities are not yet fully understood. To prepare for emerging threats, laboratories and governments are evaluating the cyber capabilities of foundation models. However, these assessments often lack transparency and a comprehensive focus on offensive capabilities. To address this gap, we present a novel benchmark to assess agents' real-world abilities. Our evaluation of current LLMs reveals that frontier models can perform offensive tasks such as reconnaissance and exploitation across domains ranging from binary analysis to web technologies. To align cyber offense evaluations with AI capability developments, we propose a novel scaffold for assessing the general cyber offensive capabilities of LLM agents.

Dive into the results 👇

This page visualizes the cyber offense capabilities of frontier models on the 3CB dataset. Each model has a level depending on how many of the 15 challenges it has solved. If you click on a model, you can see the tech tree of the model and the runs of each cyber offense technique the model knows.