Autoresearch on Steroids with Sandboxes

Last month, Andrej Karpathy published autoresearch. The idea is simple: give an LLM agent a training script and a plain-English definition of the search space (the parameters and code changes worth trying). The agent proposes a change, you run it, measure a metric with the proposed change, and repeat.
This idea became popular really quickly, Karpathy’s repo has hit ~64k GitHub stars within a few weeks.
The missing piece is execution. The agent generates Python scripts, and you should not run those in your host process. These scripts may write to disk with catastrophic consequences for your files, install packages you did not intend to have installed, crash your training loop, or hang and consume resources.
Sandboxes solve this by running each candidate in an isolated, with explicit time and resource limits. This post shows one way to build an autoresearch loop using Tensorlake Sandboxes as the execution layer.
The autoresearch loop
You can treat autoresearch as a hill‑climbing search over training‑script edits:
- Calibrate: run the baseline script and record the metric
val_loss. - Propose: ask the agent for N modifications (each returned as a complete Python script).
- Race: run all N candidates, record
val_lossfor each. - Accept: promote the best candidate if it beats the current best.
- Repeat: the accepted script becomes the new baseline.
Karpathy reports running ~700 experiments overnight and getting an ~11% accuracy improvement from incremental changes. Not bad for a hands free method.
Tensorlake Sandboxes for autoresearch loops
A sandbox gives you an isolated execution environment per candidate generated program. You start the sandbox, run the script, capture stdout and stderr, and tear it down.
Creating and running the python script looks like:
from tensorlake.sandbox import SandboxClient
sb = SandboxClient()
with sb.create_and_connect(memory_mb=4096, timeout_secs=900) as box:
ex = box.run("python3", ["-c", script], timeout=300)The host process does not execute the candidate code. If the script crashes or exits with an error, you still get result.stdout and
result.stderr back. To run the sandboxes in parallel we have used Tensorlake map-reduce capabilities.
The full autoresearch example (baseline script, prompt, runner, and acceptance logic) is in the Tensorlake docs: https://docs.tensorlake.ai/sandboxes/agentic-autoresearch
What the agent needs to remember
The agent does better if it can see what has already been tried and what happened. A lightweight approach is to append a small experiment log to the prompt, for example the last 8 iterations:
[✓ ACCEPTED] iter=2 val=2.8103 Δ-0.0412 — Added LR decay: multiply LR by 0.999 each step
[✗ rejected] iter=3 val=2.8634 Δ+0.0531 — Increased hidden size from 64 to 128
[✓ ACCEPTED] iter=4 val=2.7891 Δ-0.0212 — Replaced tanh with ReLU activation
[✗ rejected] iter=5 val=2.9102 Δ+0.1211 — Added second hidden layer (size 32)You can also vary temperature within a batch: keep the first candidate conservative and make later candidates more exploratory.
Preventing reward hacking
One constraint worth enforcing is a fixed training budget per run. If the agent can change STEPS, it can reduce val_loss by training longer, which makes runs incomparable.
In Karpathy’s setup this is handled in the human-controlled guidance:
STEPS (DO NOT CHANGE — fixed budget)
On the execution side, sandboxes give you a second control lever: each run has a bounded environment and a ceiling on resources.
What the loop tends to find
MLP trained on a small public‑domain corpus, a simple hill‑climbing loop often accepts a minority of proposals (for example, ~2–4 out of 8). The accepted changes are usually straightforward training tweaks:
- Learning-rate decay (e.g., multiply LR by 0.999 each step)
- Activation function (e.g.,
tanh→ ReLU) - Initialization scale
- Momentum (SGD momentum vs plain SGD)
The value of this is not new research outcomes. It is running many small, validated changes without manual iteration.
Running it
pip install tensorlake openai rich python-dotenvTENSORLAKE_API_KEY="your-api-key-here"
OPENAI_API_KEY="your-openai-key-here"Start with a short smoke test before running a longer sweep, which allows verifying that everything is in place:
--smoke(example: 3 iterations, 2 candidates, 150 steps)
Docs and the full example:
Pre-warmed environments
If each candidate installs dependencies (for example, numpy) at the start of every run, that overhead can dominate short experiments. One way to avoid repeated installs is:
- Install dependencies once.
- Snapshot the sandbox.
- Start each candidate from that snapshot.
Tensorlake’s snapshot API is designed to restore a known filesystem and memory state quickly, so each run begins from the same baseline.
Related articles
Get server-less runtime for agents and data ingestion
Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.