Recursive Reflector (RR) Design¶

Design document for the Recursive Reflector module (ace_next/rr/). The RR is a REPL-based trace analyser that iteratively calls an LLM to generate Python code, executes it in a sandbox, and builds structured reflections from agent execution traces.

Overview¶

The Recursive Reflector replaces the single-pass Reflector with an iterative code-execution loop. Instead of asking the LLM for a one-shot analysis, RR gives the LLM a Python REPL with pre-loaded trace data and lets it explore, query a sub-agent, and submit findings when ready.

Key properties:

Satisfies both StepProtocol and ReflectorLike — usable as a pipeline step or a drop-in reflector replacement.
Extends SubRunner (from ace_next/core/sub_runner.py) — runs an inner Pipeline in a loop.
Single shared CallBudget enforces combined LLM call limit across main calls and sub-agent calls.
Produces ReflectorOutput with an enriched raw["rr_trace"] dict for downstream observability.

from ace_next.rr import RRStep, RRConfig

# Drop-in replacement for Reflector
ace = ACELiteLLM(llm, reflector=RRStep(llm, config=RRConfig(max_iterations=10)))

# Or as a pipeline step
pipe = Pipeline([..., RRStep(llm), ...])

Architecture¶

REPL Loop¶

Each invocation of RRStep runs an iterative loop:

┌─────────────────────────────────────────────────────────┐
│  RRStep.run_loop()                                      │
│                                                         │
│  for each iteration (up to max_iterations):             │
│    ┌──────────┐   ┌──────────────┐   ┌──────────────┐  │
│    │LLMCallStep│ → │ExtractCodeStep│ → │SandboxExecStep│ │
│    └──────────┘   └──────────────┘   └──────────────┘  │
│         │                                    │          │
│         │              ┌──────────────┐      │          │
│         └──────────────│CheckResultStep│←─────┘          │
│                        └──────────────┘                 │
│                              │                          │
│                    ┌─────────┴──────────┐               │
│                    │                    │               │
│              FINAL() called?      Build feedback        │
│              ↓ yes                ↓ no                  │
│         Return result      Next iteration               │
└─────────────────────────────────────────────────────────┘

Inner Pipeline Steps¶

Each iteration runs four steps sequentially:

Step	Requires	Provides	Description
`LLMCallStep`	`messages`	`llm_response`	Trims message history, calls LLM (respects shared budget)
`ExtractCodeStep`	`llm_response`	`code`, `direct_response`	Extracts Python from response (3-layer fallback)
`SandboxExecStep`	`code`	`exec_result`	Executes code in `TraceSandbox` with timeout
`CheckResultStep`	`exec_result`, `messages`, `llm_response`	`terminated`, `reflection`, `feedback_messages`	Validates result, parses FINAL(), builds feedback

Dual Protocol Support¶

RRStep satisfies two protocols simultaneously:

class RRStep(SubRunner):
    # StepProtocol — place in any Pipeline
    requires = frozenset({"trace", "skillbook"})
    provides = frozenset({"reflection"})

    def __call__(self, ctx: ACEStepContext) -> ACEStepContext: ...

    # ReflectorLike — use as drop-in reflector in runners
    def reflect(self, *, question, agent_output, skillbook, ...) -> ReflectorOutput: ...

RRStep¶

Constructor¶

RRStep(
    llm: Any,                              # LLM client (must have complete_messages)
    config: Optional[RRConfig] = None,     # Configuration (defaults to RRConfig())
    prompt_template: str = REFLECTOR_RECURSIVE_PROMPT,  # Customisable prompt
    subagent_llm: Any = None,              # Optional separate LLM for sub-agent
)

Parameter	Description
`llm`	Main LLM client. Must expose `complete_messages(messages) -> response` where `response.text` is the text.
`config`	`RRConfig` instance controlling iteration limits, timeouts, budgets, and sub-agent settings.
`prompt_template`	The initial prompt sent to the LLM. Must contain 12 format variables (see Prompt Template Variables). Default is v5.6.
`subagent_llm`	Optional separate LLM for `ask_llm()` sub-agent calls. If `None`, uses the main `llm`. Useful for routing sub-agent calls to a smaller/faster model.

SubRunner Template Methods¶

RRStep extends SubRunner and overrides these template methods:

Method	Description
`_build_inner_pipeline(**kwargs)`	Creates `Pipeline([LLMCallStep, ExtractCodeStep, SandboxExecStep, CheckResultStep])`. Fresh pipeline per `run_loop()` call (steps hold mutable state).
`_build_initial_context(**kwargs)`	Creates `RRIterationContext(messages=(initial_prompt,), iteration=0)`.
`_is_done(ctx)`	Returns `ctx.terminated` (set by `CheckResultStep` when `FINAL()` is accepted).
`_extract_result(ctx)`	Returns `ctx.reflection` (the parsed `ReflectorOutput`).
`_accumulate(ctx)`	Appends feedback messages to history, increments iteration counter.
`_on_timeout(last_ctx, iteration, **kwargs)`	Builds a fallback `ReflectorOutput`. Optionally attempts fallback synthesis (see Fallback Synthesis).
`run_loop(**kwargs)`	Overrides base to collect per-iteration data into `iteration_log` for observability.

Prompt Template Variables¶

The prompt_template is formatted with these variables:

Variable	Type	Description
`{question_length}`	`int`	Character count of the question
`{question_preview}`	`str`	Truncated preview (150 chars max)
`{reasoning_length}`	`int`	Character count of agent reasoning
`{reasoning_preview}`	`str`	Truncated preview
`{answer_length}`	`int`	Character count of agent answer
`{answer_preview}`	`str`	Truncated preview
`{ground_truth_length}`	`int`	Character count of ground truth
`{ground_truth_preview}`	`str`	Truncated preview
`{feedback_length}`	`int`	Character count of feedback
`{feedback_preview}`	`str`	Truncated preview
`{skillbook_length}`	`int`	Character count of skillbook text
`{step_count}`	`int`	Number of trace steps

RRConfig¶

Exported as RRConfig (aliased from RecursiveConfig).

from ace_next.rr import RRConfig

config = RRConfig(
    max_iterations=20,           # Max REPL iterations before timeout
    timeout=30.0,                # Per-execution timeout in seconds (Unix only)
    enable_llm_query=True,       # Enable llm_query() in sandbox
    max_llm_calls=30,            # Combined budget for main LLM + sub-agent calls
    max_context_chars=50_000,    # Message history trim threshold
    max_output_chars=20_000,     # Per-execution output truncation limit
    enable_subagent=True,        # Enable ask_llm() sub-agent function
    subagent_model=None,         # Sub-agent model (None = same as main)
    subagent_max_tokens=8192,    # Max tokens for sub-agent responses
    subagent_temperature=0.3,    # Temperature for sub-agent responses
    subagent_system_prompt=None, # Custom sub-agent system prompt (None = default)
    enable_fallback_synthesis=True,  # Attempt LLM synthesis on timeout
)

Parameter	Default	Description
`max_iterations`	`20`	Maximum REPL loop iterations. When reached, `_on_timeout` fires.
`timeout`	`30.0`	Seconds per sandbox `execute()` call. Uses `signal.SIGALRM` on Unix; not enforced on Windows or non-main threads.
`enable_llm_query`	`True`	Whether `llm_query()` is available in the sandbox.
`max_llm_calls`	`30`	Single shared budget across main LLM calls and sub-agent calls. Prevents effective budget from being 2x the configured value.
`max_context_chars`	`50_000`	When message history exceeds this, low-value iterations are trimmed (see Message Trimming).
`max_output_chars`	`20_000`	Per-execution stdout/stderr is truncated at this limit with a `[TRUNCATED: N chars remaining]` suffix.
`enable_subagent`	`True`	Whether `ask_llm()` is available in the sandbox. When `False`, `ask_llm()` returns a stub message.
`subagent_model`	`None`	Model for sub-agent calls. `None` means use the main reflector's model.
`subagent_max_tokens`	`8192`	Max tokens for sub-agent responses.
`subagent_temperature`	`0.3`	Temperature for sub-agent responses.
`subagent_system_prompt`	`None`	Custom system prompt for sub-agent. `None` uses the default analysis prompt.
`enable_fallback_synthesis`	`True`	When `True` and max iterations is reached, attempts one more LLM call to synthesise a FINAL() from the conversation history.

RRIterationContext¶

Frozen dataclass carrying state through the four inner steps of each REPL iteration. Extends StepContext.

@dataclass(frozen=True)
class RRIterationContext(StepContext):
    # Input for this iteration
    messages: tuple[dict[str, str], ...] = ()
    iteration: int = 0

    # LLMCallStep output
    llm_response: str | None = None

    # ExtractCodeStep output
    code: str | None = None
    direct_response: str | None = None

    # SandboxExecStep output
    exec_result: Any | None = None  # ExecutionResult

    # CheckResultStep output
    terminated: bool = False
    reflection: Any | None = None  # ReflectorOutput when FINAL() accepted
    feedback_messages: tuple[dict[str, str], ...] = ()

Each iteration creates a fresh context via .replace(). The _accumulate method appends feedback_messages to messages for the next iteration.

TraceSandbox¶

Lightweight exec()-based sandbox for running LLM-generated Python code. Located in ace_next/rr/sandbox.py.

Not a security sandbox. Restricts builtins as defence-in-depth but relies on trusting the LLM not to generate malicious code. Do not use for untrusted code.

Pre-loaded Namespace¶

Variable	Type	Description
`trace`	`TraceContext \\| None`	The agent execution trace
`traces`	`dict`	Canonical traces dict (question, ground_truth, feedback, steps)
`skillbook`	`str`	Skillbook text via `as_prompt()`
`ask_llm`	`Callable`	Sub-agent query function (see Sub-Agent)
`llm_query`	`Callable`	Alias for `ask_llm(prompt, "")` (backward compat)
`FINAL`	`Callable`	Submit final result dict
`FINAL_VAR`	`Callable`	Submit a named variable as final result
`SHOW_VARS`	`Callable`	Print available variables (debugging)
`json`	module	`json` standard library
`re`	module	`re` standard library
`collections`	module	`collections` standard library
`datetime`	class	`datetime.datetime`
`timedelta`	class	`datetime.timedelta`
`date`	class	`datetime.date`
`time`	class	`datetime.time`
`timezone`	class	`datetime.timezone`

Blocked Builtins¶

open, __import__, eval, exec, compile, input, globals, locals, breakpoint, memoryview — all set to None.

safe_getattr¶

The builtin getattr is replaced with a safe version that blocks access to names starting with _:

def safe_getattr(obj, name, *default):
    if name.startswith("_"):
        raise AttributeError(f"Access to '{name}' blocked")
    return getattr(obj, name, *default)

Available as both the builtin getattr and safe_getattr in the namespace.

FINAL(value)¶

Submits the analysis result. value should be a dict matching ReflectorOutput fields:

FINAL({
    "reasoning": "...",
    "error_identification": "...",
    "root_cause_analysis": "...",
    "correct_approach": "...",
    "key_insight": "...",
    "extracted_learnings": [
        {"learning": "...", "atomicity_score": 0.8, "evidence": "..."},
    ],
    "skill_tags": [
        {"id": "section-00001", "tag": "helpful"},
    ],
})

Raises StopIteration internally to exit the exec() call. CheckResultStep catches this and parses the value into a ReflectorOutput.

FINAL_VAR(name)¶

Convenience function to submit a pre-built variable:

result = {"reasoning": "...", "extracted_learnings": [...]}
# ... build result across multiple code blocks ...
FINAL_VAR("result")  # equivalent to FINAL(result)

Raises ValueError if the variable doesn't exist in the namespace.

SHOW_VARS()¶

Debug function that prints available user variables (excludes builtins, modules, and internal names).

ExecutionResult¶

Return type of sandbox.execute():

@dataclass
class ExecutionResult:
    stdout: str = ""
    stderr: str = ""
    final_value: Any = None
    exception: Optional[Exception] = None

    @property
    def success(self) -> bool:
        return self.exception is None

Timeout Behaviour¶

Unix (main thread): Uses signal.SIGALRM. Raises ExecutionTimeoutError after config.timeout seconds.
Windows / non-main thread: No timeout enforcement. Code runs to completion.

inject(name, value)¶

Add or override a variable in the sandbox namespace after construction.

reset()¶

Clear final_value and final_called state. Used by CheckResultStep when rejecting premature or errored FINAL() calls.

Sub-Agent¶

The sub-agent system provides an LLM-callable function (ask_llm) inside the sandbox, enabling the main reflector's code to delegate semantic analysis to a secondary LLM call.

ask_llm(question, context="", mode="analysis")¶

Available in the sandbox when config.enable_subagent=True. Calls the sub-agent LLM with a formatted prompt.

Parameter	Type	Default	Description
`question`	`str`	required	The question to ask
`context`	`str`	`""`	Data to analyse (trace excerpt, code output, etc.)
`mode`	`str`	`"analysis"`	Prompt protocol: `"analysis"` for survey, `"deep_dive"` for investigation

When config.enable_subagent=False, returns "(ask_llm disabled - analyze with code)".

llm_query(prompt)¶

Backward-compatible alias: llm_query(prompt) calls ask_llm(prompt, "").

Modes and System Prompts¶

Mode	Prompt	Purpose
`"analysis"`	`SUBAGENT_ANALYSIS_PROMPT`	Survey/categorisation pass — descriptive summaries for downstream categorisation
`"deep_dive"`	`SUBAGENT_DEEPDIVE_PROMPT`	Investigation pass — evidence-rich analysis with root cause identification
unknown	`config.system_prompt`	Falls back to the configured system prompt

CallBudget¶

Shared budget enforcing a single limit across main LLM calls and sub-agent calls:

budget = CallBudget(max_calls=30)
budget.consume()    # True (29 remaining)
budget.count        # 1
budget.exhausted    # False

When the budget is exhausted: - LLMCallStep returns an empty response and logs a warning. - ask_llm returns a limit message: "(Max N LLM calls exceeded - continue with available data)".

The budget is shared — config.max_llm_calls=30 means 30 total calls, not 30 main + 30 sub-agent.

SubAgentConfig¶

@dataclass
class SubAgentConfig:
    model: Optional[str] = None       # None = same model as main reflector
    max_tokens: int = 8192
    temperature: float = 0.3
    system_prompt: str = DEFAULT_SUBAGENT_SYSTEM_PROMPT

SubAgentLLM¶

Wrapper class that tracks call history and provides the ask() method:

subagent = SubAgentLLM(llm, config=SubAgentConfig(), subagent_llm=separate_llm)
subagent.ask("What pattern do you see?", context="...", mode="deep_dive")
subagent.call_count     # 1
subagent.call_history   # [{"call_number": 1, "question": "...", ...}]
subagent.reset()        # Clear count and history

create_ask_llm_function¶

Factory that creates the bounded ask_llm callable injected into the sandbox:

ask_llm_fn = create_ask_llm_function(
    llm=llm,                    # Main LLM client
    config=SubAgentConfig(),    # Sub-agent configuration
    subagent_llm=None,          # Optional separate LLM
    max_calls=20,               # Standalone limit (when no budget)
    budget=CallBudget(30),      # Shared budget (overrides max_calls)
)

When budget is provided, it takes precedence over max_calls. The returned callable has .subagent and .max_calls attributes for introspection.

TraceContext¶

Structured trace wrapper for programmatic exploration in the sandbox. Located in ace_next/rr/trace_context.py.

TraceStep¶

@dataclass
class TraceStep:
    index: int
    action: str               # e.g. "reasoning", "tool_call:search", "user_message"
    thought: str              # Main content (reasoning, user text, tool args)
    observation: str          # Tool result or answer
    timestamp: Optional[float] = None
    metadata: Optional[Dict[str, Any]] = None

Method/Property	Description
`content`	Combined `thought + observation`
`preview(max_len=300)`	Truncated preview with char count
`__repr__()`	Short format: `TraceStep(0: reasoning...)`
`__str__()`	Detailed multi-line format

TraceContext Methods¶

Method	Description
`steps`	Property returning all `TraceStep` objects
`raw_reasoning`	Property returning the raw reasoning text
`get_step(index)`	Get step by index (returns `None` if out of bounds)
`find_steps(pattern, case_sensitive=False)`	Find steps matching a string pattern
`find_steps_regex(pattern, flags=0)`	Find steps matching a regex pattern
`get_errors()`	Find steps containing error indicators (`error`, `exception`, `failed`, `traceback`)
`get_actions(action_type)`	Get steps with a specific action type
`summary()`	Brief summary string
`to_markdown()`	Render as markdown conversation trace
`search_raw(pattern)`	Search steps, return matching indices
`search_raw_text(pattern)`	Search raw reasoning, return matched substrings
`__len__()`, `__iter__()`, `__getitem__()`	Standard container protocol

Factory Methods¶

Method	Input	Description
`from_agent_output(agent_output)`	`AgentOutput`	Auto-detects `[assistant]/[user]` markers for multi-step traces
`from_reasoning_string(reasoning)`	`str`	Parses numbered steps or falls back to single-step
`from_browser_use(history)`	browser-use `AgentHistory`	Converts browser automation history
`from_langchain(intermediate_steps)`	`list[tuple]`	Converts LangChain `(AgentAction, observation)` tuples
`from_conversation_history(messages, max_text_len=1000)`	`list[dict]`	Parses `{"role": ..., "content": ...}` message lists
`from_tau_simulation(messages, system_prompt="")`	TAU-bench messages	Handles `AssistantMessage`, `ToolMessage` with tool calls
`combine(traces)`	`list[TraceContext]`	Merge multiple traces with re-indexing

Code Extraction¶

Three-layer fallback chain for extracting Python code from LLM responses. Located in ace_next/rr/code_extraction.py.

Layer	Function	Strategy
1. Fenced	`extract_fenced_blocks()`	```python, `~~~python`, bare ``` (validated)
2. Indented	`extract_indented_block()`	4-space or tab indentation
3. FINAL	`extract_final_call()`	Balanced parenthesis extraction of `FINAL(...)`

Batch Mode¶

When the first fenced block starts with # BATCH, all fenced blocks in the response are concatenated into a single script:

# In LLM response:
# ```python
# # BATCH
# result_a = analyze_part_a()
# ```
# ```python
# result_b = analyze_part_b()
# ```
# → Both blocks execute as one script

Validation¶

looks_like_python(code) checks for Python indicators (def, import, print(, FINAL(, etc.) to filter false positives from bare code fences.

Message Trimming¶

Semantic importance-based trimming of REPL message history. Located in ace_next/rr/message_trimming.py.

When message history exceeds config.max_context_chars, iterations are scored by importance and the lowest-value ones are dropped:

Signal	Score	Rationale
Error indicators (Error, Exception, Traceback, stderr:)	+3.0	Debugging context is high value
Finding indicators (found, pattern, insight, discovered)	+2.0	Analysis progress is valuable
FINAL() in assistant message	+2.0	Near-final attempts are important
ask_llm/llm_query in assistant message	+1.0	Sub-agent calls carry insights
Long output (>500 chars)	+1.0	Substantive output worth keeping
"(no output)" in user message	-1.0	Empty output is low value

Behaviour: - The first message (initial prompt) is always kept. - Dropped iterations are summarised: [N earlier iterations omitted: M error(s), K exploration(s)]. - Kept iterations maintain chronological order.

Guard Logic¶

CheckResultStep implements several guards:

Premature FINAL (Iteration 0)¶

If FINAL() is called on the first iteration, it is rejected. The sandbox is reset and the LLM receives feedback:

"You called FINAL() before exploring the data. Read the actual variables first, then call FINAL() with evidence-based analysis."

FINAL After Error¶

If FINAL() is called but the code execution had an error (result.success == False), it is rejected:

"Your code had an error. Fix the bug and try again. Do NOT call FINAL() until your code executes successfully."

Direct Response Fallback¶

When no code block is extracted, CheckResultStep attempts to parse the LLM response as direct JSON (stripping ```json fences). If valid, it's treated as a FINAL() value. If not, the LLM receives feedback requesting code.

Iteration Progress Header¶

Each feedback message includes an iteration counter: [Iteration N/max]. When approaching the limit (within 2 iterations), an urgency suffix is added: (approaching limit — finalize soon).

Fallback Synthesis¶

When config.enable_fallback_synthesis=True and max_iterations is reached:

A synthesis prompt is appended to the conversation history asking the LLM to call FINAL() with its best assessment.
The response is parsed for code containing FINAL() and executed in a fresh sandbox.
If no code is found, direct JSON parsing is attempted.
If synthesis fails, a basic timeout ReflectorOutput is returned with raw["timeout"] = True.

This is a recovery mechanism — it often salvages partial analysis that would otherwise be lost.

Traces Dict¶

The canonical data structure passed to the sandbox as the traces variable:

{
    "question": str,              # The question/task
    "ground_truth": str | None,   # Expected answer
    "feedback": str | None,       # Environment feedback
    "steps": [                    # Agent execution steps
        {
            "role": "agent",
            "reasoning": str,
            "answer": str,
            "skill_ids": list[str],
        }
    ],
}

rr_trace Output Schema¶

After run_loop() completes, RRStep enriches ReflectorOutput.raw["rr_trace"] with execution metadata:

{
    "iterations": [               # Per-iteration log
        {
            "iteration": int,     # 0-indexed
            "code": str | None,   # Code sent to sandbox
            "stdout": str | None, # Captured stdout
            "stderr": str | None, # Captured stderr
            "terminated": bool,   # Whether FINAL() was accepted
        },
        ...
    ],
    "subagent_calls": [           # Sub-agent call history
        {
            "call_number": int,
            "question": str,
            "context_length": int,
            "response_length": int,
            "mode": str,          # "analysis" or "deep_dive"
        },
        ...
    ],
    "total_iterations": int,
    "timed_out": bool,
}

This structure is consumed by RROpikStep for observability and can be inspected by users for debugging.

RROpikStep¶

Side-effect step for logging RR traces to Opik. Located in ace_next/rr/opik.py.

from ace_next.rr import RROpikStep

# Place after RRStep in the pipeline
steps = [..., rr_step, RROpikStep(project_name="my-project")]

Step Contract¶

requires = frozenset({"reflection"})
provides = frozenset()

Behaviour¶

Reads ctx.reflection.raw["rr_trace"] — the dict populated by RRStep.
Creates one Opik trace per RR invocation with child spans per iteration.
Gracefully degrades to a no-op when Opik is not installed or OPIK_DISABLED=true.
Explicit opt-in only — Opik is never auto-enabled.

Trace Hierarchy¶

rr_reflect (trace)
├── rr_iteration_0 (span)    ← code, stdout, stderr
├── rr_iteration_1 (span)
└── rr_iteration_2 (span)    ← FINAL called here

Environment Variables¶

Variable	Description
`OPIK_API_KEY`	API key for Opik authentication
`OPIK_WORKSPACE`	Opik workspace name
`OPIK_URL_OVERRIDE`	Custom Opik server URL
`OPIK_DISABLED=true`	Disable all Opik tracing
`OPIK_ENABLED=false`	Alternative disable signal

Metadata¶

Parent trace metadata includes:

{
    "total_iterations": int,
    "subagent_call_count": int,   # Only when sub-agent calls exist
    "subagent_calls": list[dict], # Full call history
}

flush()¶

Call flush() after the pipeline finishes to drain buffered traces before the process exits.

Public API¶

All exports from ace_next.rr:

from ace_next.rr import (
    # Core
    RRStep,                  # Main entry point (SubRunner + StepProtocol + ReflectorLike)
    RRConfig,                # Configuration (alias for RecursiveConfig)
    RRIterationContext,      # Per-iteration frozen context
    RROpikStep,              # Opik observability step (lazy-imported)

    # Inner pipeline steps
    LLMCallStep,
    ExtractCodeStep,
    SandboxExecStep,
    CheckResultStep,

    # Sandbox
    TraceSandbox,
    ExecutionResult,
    ExecutionTimeoutError,

    # Sub-agent
    SubAgentLLM,
    SubAgentConfig,
    CallBudget,
    create_ask_llm_function,

    # Trace
    TraceContext,
    TraceStep,
)

RROpikStep is lazy-imported via __getattr__ to avoid pulling in the opik package at module load time.