Autoresearch on Steroids with Sandboxes

Apr 3, 2026
|
5
min read

Last month, Andrej Karpathy published autoresearch. The idea is simple: give an LLM agent a training script and a plain-English definition of the search space (the parameters and code changes worth trying). The agent proposes a change, you run it, measure a metric with the proposed change, and repeat.

This idea became popular really quickly, Karpathy’s repo has hit ~64k GitHub stars within a few weeks.

The missing piece is execution. The agent generates Python scripts, and you should not run those in your host process. These scripts may write to disk with catastrophic consequences for your files, install packages you did not intend to have installed, crash your training loop, or hang and consume resources.

Sandboxes solve this by running each candidate in an isolated, with explicit time and resource limits. This post shows one way to build an autoresearch loop using Tensorlake Sandboxes as the execution layer.


The autoresearch loop

You can treat autoresearch as a hill‑climbing search over training‑script edits:

  1. Calibrate: run the baseline script and record the metric val_loss.
  2. Propose: ask the agent for N modifications (each returned as a complete Python script).
  3. Race: run all N candidates, record val_loss for each.
  4. Accept: promote the best candidate if it beats the current best.
  5. Repeat: the accepted script becomes the new baseline.

Karpathy reports running ~700 experiments overnight and getting an ~11% accuracy improvement from incremental changes. Not bad for a hands free method.


Tensorlake Sandboxes for autoresearch loops

A sandbox gives you an isolated execution environment per candidate generated program. You start the sandbox, run the script, capture stdout and stderr, and tear it down.

Creating and running the python script looks like:

from tensorlake.sandbox import SandboxClient

sb = SandboxClient()
with sb.create_and_connect(memory_mb=4096, timeout_secs=900) as box:
		ex = box.run("python3", ["-c", script], timeout=300)

The host process does not execute the candidate code. If the script crashes or exits with an error, you still get result.stdout and

result.stderr back. To run the sandboxes in parallel we have used Tensorlake map-reduce capabilities.

The full autoresearch example (baseline script, prompt, runner, and acceptance logic) is in the Tensorlake docs: https://docs.tensorlake.ai/sandboxes/agentic-autoresearch

What the agent needs to remember

The agent does better if it can see what has already been tried and what happened. A lightweight approach is to append a small experiment log to the prompt, for example the last 8 iterations:

[✓ ACCEPTED] iter=2 val=2.8103 Δ-0.0412 — Added LR decay: multiply LR by 0.999 each step
[✗ rejected] iter=3 val=2.8634 Δ+0.0531 — Increased hidden size from 64 to 128
[✓ ACCEPTED] iter=4 val=2.7891 Δ-0.0212 — Replaced tanh with ReLU activation
[✗ rejected] iter=5 val=2.9102 Δ+0.1211 — Added second hidden layer (size 32)

You can also vary temperature within a batch: keep the first candidate conservative and make later candidates more exploratory.


Preventing reward hacking

One constraint worth enforcing is a fixed training budget per run. If the agent can change STEPS, it can reduce val_loss by training longer, which makes runs incomparable.

In Karpathy’s setup this is handled in the human-controlled guidance:

STEPS  (DO NOT CHANGE — fixed budget)

On the execution side, sandboxes give you a second control lever: each run has a bounded environment and a ceiling on resources.


What the loop tends to find

MLP trained on a small public‑domain corpus, a simple hill‑climbing loop often accepts a minority of proposals (for example, ~2–4 out of 8). The accepted changes are usually straightforward training tweaks:

  • Learning-rate decay (e.g., multiply LR by 0.999 each step)
  • Activation function (e.g., tanh → ReLU)
  • Initialization scale
  • Momentum (SGD momentum vs plain SGD)

The value of this is not new research outcomes. It is running many small, validated changes without manual iteration.


Running it

pip install tensorlake openai rich python-dotenv
TENSORLAKE_API_KEY="your-api-key-here"
OPENAI_API_KEY="your-openai-key-here"

Start with a short smoke test before running a longer sweep, which allows verifying that everything is in place:

  • --smoke (example: 3 iterations, 2 candidates, 150 steps)

Docs and the full example:


Pre-warmed environments

If each candidate installs dependencies (for example, numpy) at the start of every run, that overhead can dominate short experiments. One way to avoid repeated installs is:

  1. Install dependencies once.
  2. Snapshot the sandbox.
  3. Start each candidate from that snapshot.

Tensorlake’s snapshot API is designed to restore a known filesystem and memory state quickly, so each run begins from the same baseline.

Related articles

No items found.

Get server-less runtime for agents and data ingestion

Data ingestion like never before.
TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

“With Tensorlake, we've been able to handle complex document parsing and data formats that many other providers don't support natively, at a throughput that significantly improves our application's UX. Beyond the technology, the team's responsiveness stands out, they quickly iterate on our feedback and continuously expand the model's capabilities.”

Vincent Di Pietro
Founder, Novis AI

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov
CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi
Principal Software Engineer, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe
CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya
CEO, The Intelligent Search Company