Promptclimb | Automated Prompt Optimization

~/my-project

❯ promptclimb run --model sonnet --iterations 20

🧗 Starting hill climb...

Baseline score: 0.62

Iter 3: proposed → score 0.71 → KEEP ✓

Iter 7: proposed → score 0.68 → REVERT ✗

Iter 12: proposed → score 0.78 → KEEP ✓

Iter 18: proposed → score 0.83 → KEEP ✓

✓ Best: 0.83 (3 keeps / 20 iterations, 15% hit rate)

😰 Without Promptclimb

Manual prompt tweaking by gut feel
No objective way to compare prompt versions
Hours spent on trial-and-error
Regression when "improving" one case breaks another

🧗 With Promptclimb

Objective scoring against gold examples
Automated keep/revert with contamination guard
OPRO-inspired proposal generation
Multiple LLM backends (Anthropic, Ollama, OpenAI)

Features

Systematic Prompt Improvement

🔄

Hill Climbing Loop

Propose, score, keep/revert with configurable iterations. Each round builds on the best prompt found so far.

🥇

Gold Examples

Score against curated input/output pairs for objective evaluation. No more guessing if a prompt is better.

✂️

Template Splicing

Immutable sections (user templates, format markers) are protected automatically. The proposer only mutates what it's allowed to touch. Took hit rate from 0% to 15%.

🛡️

Contamination Guard

Strips proposer meta-commentary ("Here's the improved prompt:") that corrupts mutations. Structural validation rejects broken proposals before scoring.

🔌

Multi-Backend

Anthropic (Claude), Ollama (local), OpenAI (GPT), and LMStudio backends. Use a cheap model to execute, a smart model to propose.

⬆️

Proposer Escalation

Haiku proposes first (cheap). If it fails structurally, Sonnet repairs (expensive). Pay for intelligence only when needed.

🎯

Weak-Case Detection

Identifies specific cases where the prompt underperforms. Focus improvement where it matters most.

⏹️

Early Stop

Plateau detection stops wasting iterations when convergence is reached. Save tokens and time automatically.

Integration

Part of the Treebird Ecosystem

🧠

Promptclimb + Memoak

Store winning prompts and their scores in memoak for fleet-wide reuse.

promptclimb best | memoak store --key "system-prompt-v3"

🔬

Promptclimb + Watsan

Watsan runs promptclimb experiments and analyzes thesis results.

watsan experiment "climb system prompt on sonnet"

Promptclimb + Envoak

API keys for Anthropic/OpenAI backends stored securely in envoak.

envoak push # includes ANTHROPIC_API_KEY, OPENAI_API_KEY

Learn about envoak →

Commands

Full Command Reference

promptclimb run --model <m> --iterations <n>

Run the hill climbing loop.

promptclimb score --prompt <file>

Score a prompt against gold examples.

promptclimb propose --from <file>

Generate a candidate improvement.

promptclimb compare <a> <b>

Side-by-side comparison of two prompts.

promptclimb history

Show the climb history (scores over iterations).

promptclimb best

Output the current best prompt.

Use Cases

Climb to Better Prompts

📝 System Prompt Optimization

Climb from 0.62 to 0.83 on your agent's system prompt. Objective, repeatable improvement with every iteration.

🔍 RAG Retrieval Tuning

Optimize the query rewriting prompt for better retrieval. Score against known-good query/document pairs.

🤖 Multi-Model Testing

Run the same climb on Sonnet, Haiku, and GPT to compare ceilings. Find the best model-prompt combination.

🛡️ Regression Prevention

Gold examples catch regressions that manual testing misses. Never silently break a working prompt again.

Research

Born from the Fidelity Ceiling Thesis

Promptclimb emerged from 354 iterations of automated prompt optimization across 3 machines and 8 experimental arms. The key finding: the optimization ceiling is set by the evaluation signal, not the optimizer.

Template splicing was discovered when Haiku dropped structural markers 100% of the time across 50+ attempts. The fix: don't ask the LLM to preserve structure — remove it from the input and reattach it after. This took the hit rate from 0% to 15%.

Prompts optimized by the loop transferred across model families: +10% to Claude Sonnet, +8% to Haiku. The loop finds task principles, not model tricks.

"The loop doesn't learn to extract better. It learns to extract better from what it's shown."

FAQ

Common Questions

Iteratively improve a prompt by proposing changes, scoring them against gold examples, and keeping only improvements. Each iteration builds on the best prompt found so far, climbing toward higher performance.

Each prompt is evaluated against gold input/output pairs using configurable similarity metrics. The score represents how well the prompt's outputs match the expected gold outputs across your entire test set.

Anthropic (Claude Sonnet/Haiku), Ollama (any local model), and OpenAI (GPT-4o/etc.). You can run the same climb across different backends to compare performance ceilings and find the best model-prompt combination.

Typically 15-30. The early stop feature detects plateaus automatically, so you can set a higher number and let promptclimb decide when to stop. Most significant improvements happen in the first 10-15 iterations.