Auto-research is wild

This is pretty exciting.

I heard about Kaparthy's Autoresearch only vaguely: it's a way of getting the LLM to run experiments on itself(?), running for long periods of time.

It wasn't until I read David Cortés' autoresearch-pi skill that I understood the significance. The skill lets you give your coding agent an optimization target for some or all of your code, then it:

comes up with ideas,
tries each of them out,
keeping the ideas that make the optimization target better, but
reverts the ones that don't.
it comes up with more ideas in a loop.

I tried it on improving the integration test performance on a project I work on. The details here are not important, but I include them here just to give a sense of how many moving parts there are involved.

The tests are testing the correctness of the FFI generator. It's a complicated test set up:

A Rust test file with a macro, which generates the actual test function.
The generated test function takes the fixture crate, compiles it then generates the FFI for it the crate.
The generated FFI is itself part typescript and part C++.
The actual test is Typescript file which is written for the fixture crate.
The test.ts and the Typescript part of the generated FFI are bundled together, with tsc and metro
The bundle is sent off to a hand made C++ test runner which runs Typescript in a Hermes execution environment.

There is a similarly convoluted set up to run the same tests with the same fixture crates, but in a WASM environment.

The tests are pretty extensive, and on my machine, they take about 30-40 minutes. They take so long, I don't tend to run them all at once very often, preferring to run them one at a time, or just JSI or just WASM.

Over the past couple of years I've tried to improve the performance, but it's always ended by backing away from it: it's too delicate and too important to break, but not important enough to make faster.

After setting up pi.dev and installing the plugin, I answered a few questions. These were essentially:

what do I want to optimize and how do I measure it?
how do we know an experiment didn't break anything?
are there any ideas that I wanted to try first?

The plugin is a skill, which I guess could be ported to Claude or Codex, or whatever, and a gadget which updates the status bar.

I chose to optimize the JSI test performance; picking out a couple of representative tests. It should run the entire test suite for JSI to check nothing had broken.

Then went to bed.

I came back in the morning, and it had got the who test suite down to 74 seconds.

Its most significant win was to string together three ideas which together unlocked parallel test running.

Since then, I have run it twice more: once for the WASM test suite, and then once when both of them were sub two minutes and running in parallel. I could get it to optimize for a realistic development scenario: mutating a template which changes a generated FFI, so that only code changes that cause a change in generated code need trigger an expensive re-compile of the C++. For these second and third times, I limited the area of code to change to one crate.

Now the entire test-suite, after two nights optimizing and almost no effort from me, is now 1m42s.

A powerful new tool for our toolsets has been discovered.