Hill Climbing for Prompts
Automated prompt optimization with keep/revert scoring. Test prompts against gold examples, climb toward better performance across any LLM backend.
pip install promptclimb
Propose, score, keep/revert with configurable iterations. Each round builds on the best prompt found so far.
Score against curated input/output pairs for objective evaluation. No more guessing if a prompt is better.
Immutable sections (user templates, format markers) are protected automatically. The proposer only mutates what it's allowed to touch. Took hit rate from 0% to 15%.
Strips proposer meta-commentary ("Here's the improved prompt:") that corrupts mutations. Structural validation rejects broken proposals before scoring.
Anthropic (Claude), Ollama (local), OpenAI (GPT), and LMStudio backends. Use a cheap model to execute, a smart model to propose.
Haiku proposes first (cheap). If it fails structurally, Sonnet repairs (expensive). Pay for intelligence only when needed.
Identifies specific cases where the prompt underperforms. Focus improvement where it matters most.
Plateau detection stops wasting iterations when convergence is reached. Save tokens and time automatically.
Store winning prompts and their scores in memoak for fleet-wide reuse.
promptclimb best | memoak store --key "system-prompt-v3"
Watsan runs promptclimb experiments and analyzes thesis results.
watsan experiment "climb system prompt on sonnet"
API keys for Anthropic/OpenAI backends stored securely in envoak.
envoak push # includes ANTHROPIC_API_KEY, OPENAI_API_KEY
promptclimb run --model <m> --iterations <n>promptclimb score --prompt <file>promptclimb propose --from <file>promptclimb compare <a> <b>promptclimb historypromptclimb bestClimb from 0.62 to 0.83 on your agent's system prompt. Objective, repeatable improvement with every iteration.
Optimize the query rewriting prompt for better retrieval. Score against known-good query/document pairs.
Run the same climb on Sonnet, Haiku, and GPT to compare ceilings. Find the best model-prompt combination.
Gold examples catch regressions that manual testing misses. Never silently break a working prompt again.
Promptclimb emerged from 354 iterations of automated prompt optimization across 3 machines and 8 experimental arms. The key finding: the optimization ceiling is set by the evaluation signal, not the optimizer.
Template splicing was discovered when Haiku dropped structural markers 100% of the time across 50+ attempts. The fix: don't ask the LLM to preserve structure — remove it from the input and reattach it after. This took the hit rate from 0% to 15%.
Prompts optimized by the loop transferred across model families: +10% to Claude Sonnet, +8% to Haiku. The loop finds task principles, not model tricks.
"The loop doesn't learn to extract better. It learns to extract better from what it's shown."
Iteratively improve a prompt by proposing changes, scoring them against gold examples, and keeping only improvements. Each iteration builds on the best prompt found so far, climbing toward higher performance.
Each prompt is evaluated against gold input/output pairs using configurable similarity metrics. The score represents how well the prompt's outputs match the expected gold outputs across your entire test set.
Anthropic (Claude Sonnet/Haiku), Ollama (any local model), and OpenAI (GPT-4o/etc.). You can run the same climb across different backends to compare performance ceilings and find the best model-prompt combination.
Typically 15-30. The early stop feature detects plateaus automatically, so you can set a higher number and let promptclimb decide when to stop. Most significant improvements happen in the first 10-15 iterations.
Stop guessing. Start climbing.
pip install promptclimb