Notes on Anthropic's Original Take-Home

Tech

Anthropic, the company behind the most powerful (at the moment) AI software development agent Claude Code, recently had to retire the take-home job interview assignment. The reason? They

“were about to release a [Opus] model where the best strategy on take-home would be delegating to Claude Code

Tristan Hume explained the origins of the assignment, how it was designed and redesigned, and how it worked in great detail.

The assignment was since released publicly, and everyone can try to beat the AI with unlimited time. I tried it yesterday, and I need to share my first impression.

The assignment

Nerdy details: the assignment provides you with a virtual machine and a code that performs a tree traversal algorithm using the simulated hardware. Your mission, should you choose to accept it, is to optimize the algorithm and make it run as fast as possible within the constraints of the virtual machine.1

In layman’s words, this assignment is a technical task the candidate needs to solve. The task:

  • Gives a complex setup, like real-world problems tend to be
  • Provides an open-ended task larger than can be done within 2-4 hours, thus allowing to explore candidate’s abilities at different levels
  • Provides a debugging, logging, tracing, and testing harnesses, to help the candidate quickly understand the task during the job interview period
  • Has a high quality of codebase

Is it just a hype?

According to Anthropic benchmarks, their Opus 4.5 model can beat in 2 hours what a human can make in 2 hours. If given 12 hours to work, Opus 4.5 would significantly beat most humans’ results. Is this outcome an incredible feat, or is it just a hype?

There are reasons that made this task easier for Claude than some other tasks could have been:

  • Large codebase and short assignment time benefits quick “onboarding”
    • Humans had to spend 30+ minutes just to understand the repo. Claude had a head start as it could load the entire code in its context in a couple of minutes.
  • Built-in extensive logging, debugging, and tracing help if you parse them quickly
    • The harness provides the ability to see everything happening inside the virtual machine. Claude can search through kilobytes of logs quickly, helping it fix bugs and improve solution. Humans can struggle to process such amounts of information quickly.
  • Built-in testing helps LLMs a lot
    • The codebase provides a great testing setup. LLMs thrive when they can test their results quickly. This helps them avoid hallucinations and go quick skipping slow humans-in-the-loop testing.
  • Built-in reference solutions help LLMs even more
    • The assignment also provides the reference implementation of the algorithm three times: one straightforward python solution, one solution for a flat-memory data representation using python, and one solution phrased in the target VM’s language. LLMs benefit immensely from examples. Here, a reference implementation in python surely helped Claude find analogous simpler solutions, and a reference “translation” between python and VM’s language no doubt helped it translate its own code into a not-seen-before language.
  • Known task allows the agent to borrow pre-existing solutions
    • Optimizations of this sort are a known engineering task. Claude could borrow the existing solutions for similar tasks previously solved.
  • Optimization tasks are solved by consistent iterative application of ideas
    • Finally, this task is an open-ended optimization problem. There may be many ways to approach it, and only by trying out (or by calculating heuristics and estimates) can one improve the solution. Applying many separate individual approaches pays off here. Claude can likely try all of its ideas, unlike a time-limited human.

Thus, it is an incredible achievement

Surprisingly, none of those make me feel like it’s a hyped up result. In fact, I think these factors are exactly why this outcome is an incredible achievement.

Sure, Claude can be more persistent than a human at chiseling a diamond mountain down using different approaches. But it needs to figure out which approaches to apply and to actually apply them correctly. Without spoilers, this task requires you to think at many individual levels of complexity. The ability to apply correct solutions at all complexity levels is normally indicative of senior experience in human engineers. This assignment may have been easier for Claude because it is a gamified optimization task. But it still required it to be a good engineer. Iterating quickly (even when you have logging, testing, and examples) is just not enough if you don’t know what you are doing.

Claude benefited from quicker code parsing and then reusing what it knew about optimization from before. That’s great because most of the tasks of an engineer are about applying existing knowledge to a pre-existing codebase. Real-world tasks are often about onboarding in existing codebases with tests and examples, and then adding pre-existing solutions to this codebase. And humans benefit from tests and reference implementations too! So this success is representative of real-world needs from the software engineers.2

Finally, surely Claude needed to run for 12 hours, where humans needed only two. But businesses don’t care about this difference. First, if you are a PM or manager, you don’t think in terms of “2-hour” vs “12-hour” execution time. It normally takes 1-2 weeks until someone has the capacity to work on the task anyway. So absolute time doesn’t matter much. What do are the budget and the outcome. If you run six Claude Maxes, and each can churn a task in 12 hours, then you can too solve a task once every two hours on average. This will come at a price of 1,200 $ / month, which is substantially below the salary of an equally competent software engineer. So economically, this is also a notable success.3

Conclusion

When writing about all of us becoming team-leads of AI, I originally wrote that AI is currently at a junior developer level, only to correct myself that AI has already made it to middle-level by the time of publishing. That was 3 months ago. Claude’s result now makes me think that we are already at the senior engineer capacity in AI models. AI may be an unreliable senior engineer, and you need a lot of effort to make it work well. But it is a senior engineer.

I take AI benchmark results with a grain of salt. They may not be representative of the real-world tasks. This particular assignment impressed me, exactly because it wasn’t meant as a benchmark for AI. This assignment gave me a single complex task that I could do hands-on. When I got to see the results of Claude, they impressed me a lot more than percentages on benchmarks, because I developed a personal understanding of this problem.


  1. It is interestingly complementary to Advent of Code 2019, where you were invited to develop yourself a virtual machine↩︎

  2. Sidenote: you will almost never be optimizing one algorithm for hours in the real-world (sadly, I would’ve enjoyed doing it a lot), so maybe this task is a bit unrepresentative in the type of work that engineers conduct. But the setup and the needs are representative enough. ↩︎

  3. I wouldn’t encourage you to use six Claude Maxes (or even one), if you have no CS background and never worked with AI models. You will just create something that appears to work, but doesn’t work because the model will cheat you. Anthropic’s repository is the prime example. They offered anyone who could beat Claude’s result significantly to send their solution as a soft job application. Anthropic then had to update their GitHub to point out that “None of the solutions we received on the first day post-release below 1300 cycles were valid solutions”. Let me repeat. Anthropic put an effort into making a good testing harness. And still, noone who sent their solution was able to realize that they were sending bullshit. You will not have such a harness, and you will neither be competent enough to notice the bullshit you made with AI. ↩︎


Like my writing? Connect with me on LinkedIn to chat or subscribe to me on Substack to get an email every time I post an essay.