Methodology

This section describes how abstractions are produced, scored, and used in planning. We first outline the abstraction type, then the prompt‑extraction pipeline, followed by the planning setup and the two evaluation metrics.

Abstraction type. We represent an abstraction as a partition of ground states into clusters. Each abstract state corresponds to a set of grid cells, and the simulator can deterministically map an abstract state and action back to an equivalent ground action in context. This choice keeps the environment stateless and composable, avoids introducing new parameters, and makes it straightforward to visualize and validate abstractions. The ideal abstraction is derived from the Rust core’s homomorphism routine (see src/core/abstraction/homomorphism.rs via core_rust.generate_mdp).

Prompt‑extraction pipeline. The Python layer composes prompts from a small library of fragments defined in config_prompts.yml and selected by compositions in config.yml. Each composition specifies an instruction, a necessary context, optional background contexts, a representation key (text, json, or adj), and an output specification. Given a world grid, llm_abstraction.llm.prompts.generate_prompts inserts the chosen representation generated by core_rust.generate_representations_py. We query models through Ollama using llm_abstraction.llm.ollama.query_llm, which performs lightweight rejection sampling and a reprompt step to obtain a clean JSON list of clusters. Finally, llm_abstraction.llm.clean.clean_with_regex_and_validate extracts, validates, and deduplicates clusters, ensuring that all states are covered.

Planning with abstractions. To assess planning utility, we compare three agents using the same budgets and parameters: a ground agent, an ideal‑abstraction agent, and an LLM‑abstraction agent. The evaluation code in llm_abstraction/evaluation/mcts.py constructs a PyRunner (exposed in core_rust) and runs sweeps over simulation limits and depths. The runner maps abstract choices back to ground actions on the fly, so the tree search happens in the abstract space while simulations remain faithful to the concrete MDP. All experiments write maps, CSVs, and plots to outputs/ to support reproducibility.

Model‑based similarity. We compute a similarity score that combines differences in abstract rewards with differences in abstract transition distributions, where transitions are compared using the 1‑Wasserstein distance and aggregated over actions with a Hausdorff‑style maximum. The score is 1/(1 + d), so identical abstractions obtain 1.0. The implementation is in llm_abstraction/llm/scoring.py and requires transition and reward matrices from core_rust.generate_mdp.

Performance metric. We run MCTS with a grid of simulation budgets and rollout depths and record average returns (and error estimates) for each agent. The Rust helper core_rust.max_returns provides a reference line for perfect planning under the chosen γ. We report both the level and the gain relative to the ground agent.

Composite score. To compare across models, prompts, and maps, we transform the model‑based scores and planning gains into standardized z‑values and compute a composite statistic. This combination highlights configurations that balance structural faithfulness with planning utility, and it underlies the rankings shown in Thesis → Experiments & Results.