Inference-Time Control for Trustworthy Large Language Models

Introduction

Once a large language model is released, updating its behavior is often slow, costly, or impossible. Training-time alignment encodes safety, privacy, fairness, and factuality objectives into the parameters themselves, but deployment introduces context-specific risks that cannot all be anticipated in advance: changing policies, user-specific constraints, retrieval contamination, adversarial prompting, and jurisdiction-dependent requirements. Inference-time methods address this gap. They regulate model behavior at runtime, after training has concluded, by acting on the generation pipeline rather than on the parameters.

This work reviews the field as a single coherent control plane. Drawing on 230+ papers published between 2020 and 2026, we organize seven categories of inference-time methods into three tiers—External Controls (Context Engineering, Guardrails, Decoding Strategies), Internal Manipulations (Representation Engineering, Unlearning, Pruning), and System-Level Orchestration (Multi-Agent Systems)—and analyze each across four trustworthiness dimensions: safety, privacy, fairness, and factuality. Figure 1 shows the taxonomy and the points at which each tier attaches to the generation pipeline.

Three-tier taxonomy and pipeline attachment points for inference-time control of trustworthy LLMs — Figure 1. Taxonomy (top) and attachment points in an inference-time pipeline (bottom). The three tier colors— **cyan** (Tier 1, black-box), **lavender** (Tier 2, white-box), and **green** (Tier 3, system-level)—encode the level of model access each tier assumes.

The companion repository at github.com/leopoldwhite/Awesome-Inference-Time-Trustworthiness indexes more than 230 papers organized by these seven categories, with arXiv links and code references where available. The remainder of this page summarizes each tier with the corresponding figures from the paper, the meta-axis evaluation framework we adopt, a comparison with related work, and the open challenges we identify.

Tier 1: External Controls

External controls treat the model as a black box. They shape behavior by modifying inputs, the decoding process, or outputs, without accessing or modifying internal weights or activations. These methods are the most modular and widely applicable: they require no white-box access and can be deployed on proprietary, API-only models. In the pipeline they attach to the context-assembly, input/output policy check, and decoding stages.

The most modular and widely applicable layer—but also the easiest to bypass once an adversary adapts.

Context Engineering

Strategic prompt design, retrieval, and context shaping. Methods in this category guide outputs through rules, instructions, exemplars, retrieval-augmented context, or memory windows—without changing model parameters. The 32 papers we cover include prompt-based safety (e.g., system-prompt jailbreak defenses), retrieval-augmented factuality (Self-RAG, Chain-of-Verification), in-context fairness mitigation, and privacy-preserving inference (InferDPT).

Guardrails

External modules that inspect inputs or outputs against safety or policy constraints, blocking, redacting, or regenerating content when violations occur. Figure 2 organizes the design space into four mechanism types: rule-based (NeMo Guardrails, Llama Firewall), model-based classifiers (Perspective, ToxiChat), LLM-based judges—split between discriminative (LlamaGuard, ShieldGemma) and generative reasoning judges (GuardReasoner, R²-Guard)—and hybrid defense-in-depth pipelines that compose the previous three.

Typology of guardrail mechanisms: rule-based, model-based, LLM-based (discriminative + generative), and hybrid — Figure 2. A typology of guardrail mechanisms. Rule-based and model-based filters are fast but brittle; LLM-based judges are flexible but introduce latency; hybrid pipelines combine them in a defense-in-depth arrangement.

Decoding Strategies

Manipulation of token-level distributions during generation to promote desired attributes or suppress undesired ones. Figure 3 plots the design axes—constructive decoding (prompt- or model-pair contrast), dynamic / context-aware decoding, and reference-based decoding—against four trustworthiness targets, with the central pipeline showing how logits are reweighted, combined, and routed to safe or contrastive experts.

Decoding axes for trustworthy generation: constructive, dynamic, reference-based, and safety-signal decoding — Figure 3. Decoding-time control axes and the central logit-manipulation pipeline.

Tier 2: Internal Manipulations

Internal manipulations require white-box access. They intervene directly in the model's computation—modifying activations during a forward pass, suppressing targeted behavior or knowledge, or removing architectural components. Compared with external controls they offer finer-grained, more persistent behavioral changes, at the cost of needing access to representations or weights.

Finer behavioral control at the cost of weights and activations —and a verification problem we cannot yet solve.

Three categories share this tier (Figure 4): Representation Engineering shifts hidden states along pre-computed concept vectors, treating activations as controllable objects; Inference-Time Unlearning suppresses target knowledge or behavior at generation time through attention masking, activation gating, or in-context refusal triggers; Structural Pruning removes weights, neurons, or attention heads associated with unsafe or biased behavior, exposing the alignment fragility of the underlying network.

Representation engineering, inference-time unlearning, and structural pruning — Figure 4. Three internal-manipulation paradigms. Representation Engineering shifts activations along concept directions; Inference-Time Unlearning suppresses target knowledge; Structural Pruning removes the components that support unwanted behavior.

Tier 3: System-Level Orchestration

At the system level, trustworthiness becomes an emergent property of structured interaction among multiple agents rather than a property of any single model. Multi-agent systems coordinate LLMs through debate, cross-verification, role specialization, and iterative self-correction. The tool-use / agents loop spans context assembly, generation, and output checking, creating feedback cycles that enable collective reliability.

Trustworthiness as an emergent property of structured interaction —not a property of any single model.

Figure 5 cuts the design space by trustworthiness dimension: safety via role-diverse defenses and adversarial red-teaming (AutoDefense, PeerGuard, RedDebate, MASTER); factuality and faithfulness via structured debate with cross-verification and calibrated uncertainty (CFMAD, CausalGPT, COOPERATE/COMPETE); fairness via centralized debiasing with self-reflection (MSR, Multi-LLM); and privacy via specialized agents (Extractor, Executor, Checker; MASLEAK).

Multi-agent system architectures organized by safety, factuality, fairness, and privacy — Figure 5. Multi-agent architectures for LLM trustworthiness across the four dimensions.

Evaluation

Inference-time methods differ greatly in where and how they intervene —in prompts, in logits, in activations, in external filters, or in the surrounding workflow. As a result evaluation cannot be unified by one benchmark or one metric family. We adopt a meta-axis evaluation view (Figure 6): rows are the four trustworthiness dimensions (Safety, Privacy, Fairness, Factuality), columns are five complementary evaluation axes (Effectiveness / Behavioral Accuracy, Locality / Utility Preservation, Generality / Robustness / Adaptivity, Interpretability / Transparency, Efficiency / Latency / Cost), and each cell lists representative metrics specific to the dimension–axis intersection.

Related Work

Prior work either organizes the trustworthy-LLM literature by harm type, by a single solution family, or by efficiency considerations. To our knowledge no prior work unifies inference-time methods as a control plane across all four dimensions and all seven categories considered here. The matrix below mirrors Table 1 of the paper. Names are anonymized while under review.

Work	Trustworthiness Dimensions				Method Categories
Work	Safety	Privacy	Fairness	Factuality	Context	Guard.	Decoding	Repr.	Unlearn.	Pruning	MAS
Ours (this work)	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Inference Efficiency Survey	✕	✕	✕	✕	✕	✕	✓	✕	✕	✓	✕
Post-Training Capability Survey #1	✓	✓	✕	✓	✕	✓	✓	✓	✕	✓	✓
Post-Training Capability Survey #2	✓	✕	✓	✓	✕	✕	✓	✓	✕	✓	✓
Trustworthy LLM Survey #1	✓	✓	✓	✓	✓	✓	✓	✓	✕	✓	✕
Trustworthy LLM Survey #2	✓	✓	✓	✓	✓	✓	✕	✓	✓	✕	✕
Trustworthy Text-to-Image Survey	✓	✕	✓	✓	✓	✕	✕	✓	✕	✕	✕
Trustworthy LLM Agents Survey	✓	✓	✓	✓	✓	✕	✕	✕	✕	✕	✓
Unlearning Survey #1	✓	✓	✕	✕	✓	✓	✓	✓	✓	✕	✕
Unlearning Survey #2	✓	✓	✓	✓	✓	✕	✕	✓	✓	✓	✕

Open Challenges

Inference-time methods make trustworthiness modular and updatable, but each tier still leaves open problems. The paper discusses these and outlines directions for future work; we group them into four cross-cutting themes.

The frontier is not the next defense—it is making any defense verifiable, composable, and stable under adversaries we have not yet seen.

Brittleness under adaptive adversaries. Static guardrails and decoding constraints fail against adaptive jailbreaks; representation steering and unlearning can be reversed by adversarial probing. Robust deployment will require periodically re-evaluated, adaptive defenses rather than fixed thresholds.
The control–utility tradeoff. Stricter inference-time controls reduce helpfulness, fluency, or accuracy. Standardized benchmarks that explicitly trace the trust–utility Pareto frontier—rather than reporting a single operating point—are still missing.
Verification of removal. Unlearning, pruning, and editing can suppress information without truly erasing it, leaving models vulnerable to recovery under adversarial probing. The lack of formal removal guarantees blocks deployment in privacy-critical and regulatory settings.
Composition and emergent risk. How interventions interact when layered together is poorly understood. Multi-agent orchestration introduces novel failure modes such as group-think, cascading hallucinations, and amplified jailbreaks via the very debate channels intended to improve reliability.

🔔 News

2026-05 Paper available on preprints.org (DOI: 10.20944/preprints202605.1041.v1).

📚 Cite

BibTeX entry for the preprint version (DOI: 10.20944/preprints202605.1041.v1).

@article{bai2026inferencetime,
  title     = {Inference-Time Control for Trustworthy Large Language Models},
  author    = {Bai, Yuyang and Liu, Zheyuan and Yan, Han and Xu, Zhangchen and Wan, Yixin and Chen, Canyu and Wang, Zehong and Yuan, Xiangchi and Huang, Yue and Dou, Guangyao and Zhang, Yuji and Zhu, Hangxiao and Li, Zhuofeng and Li, Manling and Zhang, Xiangliang and Bansal, Mohit and Koyejo, Sanmi and Chang, Kai-Wei and Zhang, Yu and Jiang, Meng},
  journal   = {Preprints},
  year      = {2026},
  month     = {May},
  publisher = {Preprints},
  doi       = {10.20944/preprints202605.1041.v1},
  url       = {https://doi.org/10.20944/preprints202605.1041.v1}
}

📁 Repository

The companion repository hosts the full paper list (over 230 entries across the seven method categories), with arXiv links and code references where available. We welcome issues and pull requests for omissions.

Open the GitHub repository →

Introduction #

Tier 1: External Controls #

Context Engineering

Guardrails

Decoding Strategies

Tier 2: Internal Manipulations #

Tier 3: System-Level Orchestration #

Evaluation #

Related Work #

Open Challenges #

🔔 News #

📚 Cite #

📁 Repository #