Background

This section provides a brief overview of the concepts needed to understand the rest of the thesis: Markov decision processes, state abstraction with a focus on cluster partitions, similarity metrics based on bisimulation ideas, the basics of Monte Carlo Tree Search, and the LLM families we evaluate. Each topic is covered at greater depth in the Thesis PDF; what follows is a concise orientation.

Markov Decision Processes. We model each grid‑world environment as an MDP defined by a tuple ⟨S, A, P, R, γ⟩, where S is the set of states (grid tiles), A the set of actions (up, down, left, right), P the transition kernel, R the reward function, and γ the discount factor. Policies map states to actions, and the optimal policy maximizes expected discounted return. In our maps, the agent starts in the top‑left and aims to reach the bottom‑right goal, with obstacles that block movement. The Rust implementation in src/core/game/game_logic.rs provides a deterministic forward model and supports fast enumeration of states and transitions (see also core_rust.generate_mdp).

Abstraction and homomorphism. A state abstraction groups ground states into abstract states so that behavior is preserved as much as possible. Formally, homomorphism‑based approaches require that equivalent states induce similar successor distributions and rewards under corresponding actions. In simple grids, a lax or approximate bisimulation captures the intuition that symmetric positions relative to obstacles and the goal can be merged. We focus on cluster‑based abstractions—partitions of S—because they align with the stateless simulator and allow us to map abstract states back to ground states without ambiguity.

Similarity metrics. To compare an LLM‑proposed clustering with an ideal one, we use a metric that combines two ingredients (Thesis). The first is the absolute difference in expected rewards per action at the abstract level. The second is a distance between abstract transition distributions, computed using the 1‑Wasserstein (Earth Mover’s) distance with positions aligned by abstract‑state indices. We aggregate across actions with a Hausdorff‑style lift—taking the worst case—then transform distance into similarity via 1/(1 + d). The Python implementation appears in llm_abstraction/llm/scoring.py.

Monte Carlo Tree Search. MCTS alternates four phases: selection, expansion, simulation, and backpropagation. Abstraction can reduce the branching factor by representing multiple ground states with a single abstract node, which leads to fewer expansions and more focused simulations. Our Rust Runner toggles between abstract and ground decisions at every step, which lets the tree operate over cluster identifiers while simulations still happen in the concrete environment (src/core/runner.rs).

Models evaluated. The experiments compare LLaMA and Deepseek‑R1 families at several sizes under local inference via Ollama. Across runs, Deepseek‑R1 variants tend to outperform LLaMA on the composite metric in these domains, with prompt structure often outweighing raw model size. We return to the details in Thesis → Experiments & Results.