CodeClash: Training AI Agents for Long-Horizon Programming

Expanding code post-training beyond unit tests to competitive, long-horizon programming tasks with CodeClash’s new training arenas
llm
agents
software-engineering
Author

muhtasham

Published

January 7, 2026

Excited to share the release of CC:Train - CodeClash’s official training split!

Training only on narrowly defined tasks or GitHub issues doesn’t lead to true AI software engineers. Models need to be trained to pursue goals rather than carry out implementation-level tasks.

CodeClash’s 9 training arenas spanning Bridge, Chess, Figgie, Gomoku, Halite II/III, and MIT BattleCode 2023-2025

Folks made great headway at training on SWE-bench style tasks in 2025. In 2026, what if we expanded code post-training to long-horizon, competitive, head-to-head coding tasks?

From Task Completion to Goal Pursuit

SWE-bench-style training teaches models to carry out implementation-level tasks, not to pursue goals. When you train a model to fix a specific bug or implement a narrowly-scoped feature, you’re teaching task completion. But real software engineering is goal-oriented - it’s about understanding high-level objectives, making strategic decisions, and iterating based on feedback to pursue better solutions.

Today’s SWE agents rely on outcome-based RL with unit tests. But this breaks at long horizons: imagine coding PyTorch from scratch - how do you get dense feedback from hours-long rollouts that aren’t unit-test-verifiable? We need a new paradigm.

Enter CodeClash.

As an initial step, we’ve released CC:Train - CodeClash’s official training split of 9 new arenas.

CodeClash is a very long horizon SWE benchmark that requires agents to write code, observe logs from that code running against another agent’s code, rewrite their code, and then repeat that cycle.

We now have a training set for it - looking forward to models getting even better on long horizon programming.

The 9 Training Arenas

  • Bridge - Imperfect information card game
  • Chess - Perfect information classical board game
  • Figgie - Multi-player trading and strategy game
  • Gomoku - Classic board game with multi-player support
  • Halite II - Multi-player space strategy competition
  • Halite III - Enhanced multi-player space combat
  • MIT BattleCode 2023 - Head-to-head programming competition
  • MIT BattleCode 2024 - Next generation BattleCode
  • MIT BattleCode 2025 - Latest BattleCode iteration

These arenas span diverse properties:

Property Examples
Information Perfect (Chess) vs. Imperfect (Bridge)
Format Classical (Gomoku) vs. Custom (BattleCode)
Players Head-to-head (Chess) vs. Multi-player (Figgie)

Why Competitive Code Training?

The key insight is that competition trains goal-oriented development, not task completion.

CodeClash doesn’t give agents a specific bug to fix or a narrow feature to implement. Instead, it gives them a goal: build a bot that wins. The agent must:

  • Understand the game mechanics and strategic landscape
  • Design an architecture that balances multiple objectives
  • Write code, observe how it performs in competition, and iterate
  • Make judgment calls about what “better” means (faster execution? smarter strategy? more robust error handling?)
  • Pursue continuous improvement over extended sessions

This is fundamentally different from SWE-bench-style training:

SWE-bench Style CodeClash Style
“Fix this specific bug” “Build the best bot you can”
Task completion Goal pursuit
Binary pass/fail from tests Continuous improvement from competition
Implementation-level Strategic-level
Narrow scope Open-ended optimization

Competition provides dense reward signals for goal-oriented development:

  • Win/loss outcomes give clear feedback on strategic decisions
  • Performance metrics (reaction time, strategy efficiency) provide gradients for improvement
  • Iterative refinement is naturally incentivized - there’s always a better bot to build
  • Bad development practices (code slop, redundancy, poor organization) hurt competitive performance
  • Agents learn to balance competing objectives rather than satisfy a fixed specification

Instead of asking “does this code pass tests?”, we ask “can this code outcompete alternatives?” - which is much closer to how human engineers think about software quality.

What’s Next: Open Questions and the Road Ahead

The release of CC:Train is just the beginning. We’ve spent 2025 perfecting training on SWE-bench-style single tasks. In 2026, it’s time to expand to truly long-horizon, goal-oriented development.

We’re curious about several research directions:

Can self-play RL work? Rather than optimizing for test passage, what if we optimize for defeating other agents through competitive self-play?

Does it transfer? Does training on Halite II/III improve performance on Halite I? What about completely different games like Gomoku? Most importantly: does training on open-ended code tasks improve performance on traditional benchmarks like SWE-bench, HumanEval, or MBPP?

Does competition improve code quality? Competition naturally punishes bad practices - single-use scripts, redundant code, poor organization, and “code slop” that accumulates over long sessions. Can competitive training mitigate these issues?

True AI software engineers won’t come from training on narrower and narrower GitHub issues. They’ll come from training on open-ended goals that require strategic thinking, iterating based on emergent feedback, balancing competing objectives, and developing intuition for software quality through competitive episodes.


CC:Train is available now at codeclash.ai. Check out the detailed announcement for full arena specifications, baseline results, and training guidelines.

We’re excited to see what the community builds with these arenas. If you’re working on code post-training or long-horizon agent development, we’d love to hear from you on our Slack or GitHub.


This work is a collaboration between the CodeClash team: Muhtasham Oblokulov, Aryan Siddiqui, and John Yang.