Data Flow
This page walks through the end‑to‑end pipeline from maps to outputs and points to the functions that implement each stage. Figure 2 provides a sequence diagram of the interactions between components in a typical benchmark-llm
run.
Stages
1. Map generation. We parse maps from config.yml
and render PNGs for quick inspection.
2. Prompt assembly. llm.prompts.generate_prompts
composes the instruction, contexts, representation, and output into a single prompt, inserting a JSON or text representation created by core_rust.generate_representations_py
.
3. LLM query. llm.ollama.query_llm
calls the local model through Ollama, performs rejection sampling, and reprompts to extract JSON‑only clusterings.
4. Parsing and validation. llm.clean.clean_with_regex_and_validate
converts responses into cluster lists, validates coverage, and deduplicates states.
5. Scoring. We compute llm.scoring.bisimulation_similarity
against the ideal abstraction from core_rust.generate_mdp
, using T
and R
matrices.
6. Selection. We choose the best‑scoring abstraction for each map.
7. MCTS evaluation. evaluation.mcts_llm_evaluation
runs ground, ideal, and LLM‑abstraction agents across a grid of budgets and depths and writes results to outputs/
.
sequenceDiagram
participant User
participant CLI as CLI (main.py)
participant PB as generate_prompts
participant LLM as query_llm (Ollama)
participant CLEAN as clean_with_regex_and_validate
participant SCORE as bisimulation_similarity
participant RUN as mcts_evaluation / mcts_llm_evaluation
participant RUST as core_rust (PyRunner, generate_mdp)
User->>CLI: run benchmark-llm
CLI->>PB: generate_prompts(compositions, prompts, world)
PB-->>CLI: prompt
CLI->>LLM: query_llm(prompt, runs, model, num_states)
LLM-->>CLI: raw_responses
CLI->>CLEAN: clean_with_regex_and_validate(raw_responses, num_states)
CLEAN-->>CLI: clusters
CLI->>RUST: generate_mdp(world)
RUST-->>CLI: T, R, ideal_abstraction
CLI->>SCORE: bisimulation_similarity(clusters, ideal_abstraction, T, R)
SCORE-->>CLI: similarity
CLI->>RUN: mcts_llm_evaluation(world, runner_configs,...)
RUN->>RUST: PyRunner.run(sim_limit, sim_depth, c, gamma,...)
RUST-->>RUN: results
RUN-->>CLI: CSVs/plots in outputs/
Figure 2: Sequence diagram of the benchmark-llm
command from prompt generation through MCTS evaluation and artifact writing.