Early benchmark results for OpenAI’s GPT-5.5 reveal strong performance in isolated command-line tasks but weaker results on long, multi-step software engineering challenges. Terminal-Bench 2.0 scores ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results