AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

A lightweight, taxonomy-guided framework for real-time safety monitoring and alignment of autonomous AI agents.

How can we build a lightweight, scalable diagnostic model that accurately detects and classifies unsafe agentic behaviors across diverse execution environments?

Modern autonomous agents like OpenClaw interact with complex, open-world environments, creating diverse safety risks that static, prompt-level guardrails cannot detect. The authors introduce AgentDoG 1.5: a lightweight guardrail model trained on a taxonomy-guided data engine that uses influence-function purification to distill high-quality safety signals from minimal data. This framework achieves performance comparable to frontier models while reducing deployment overhead in simulated environments by two orders of magnitude.

Paper Primer

The core move is a three-dimensional safety taxonomy that decomposes agent trajectories into risk source, failure mode, and real-world harm. By keeping these high-level dimensions fixed while customizing leaf categories for specific execution settings (like Codex or OpenClaw), the framework maintains cross-setting diagnostic comparability without needing to rebuild the guardrail from scratch.

AgentDoG 1.5 achieves state-of-the-art safety moderation performance across diverse interactive agentic scenarios.

Evaluations on the R-Judge and ATBench family benchmarks show performance comparable to frontier closed-source models like GPT-5.4 and Gemini-3.1-Pro. The model requires only ~1k training samples to reach this performance level.

Why is a new taxonomy necessary for agent safety?

Existing flat label spaces conflate the origin of a risk, the failure mode, and the resulting harm. The AgentDoG 1.5 taxonomy separates these dimensions to enable interpretable, multi-faceted diagnosis rather than a simple binary safe/unsafe verdict.

What is the advantage of the influence-function-based data purification?

It identifies and retains only the training examples that most directly align with the desired guardrail behavior, allowing the model to learn effective safety judgment from a compact 1k-sample dataset while avoiding overfitting to noisy or irrelevant patterns.

Researchers and engineers can now deploy lightweight, training-free guardrails that provide fine-grained safety diagnostics for autonomous agents without the massive compute overhead of frontier-scale models.

Introduction

We expose the emerging safety risks of open-world agents and outline a lightweight, scalable alignment framework.

Open‑world agents such as OpenClaw can execute across many environments, but this flexibility opens a wide surface of safety hazards. At the same time, frontier models dramatically lower the cost of adversarial attacks, exposing the fragility of existing alignment approaches.

These are the ways an autonomous agent can cause undesirable outcomes while interacting with its environment, ranging from unintended actions to covert manipulation of external systems.

Step 1 produces vector $v_1$, step 2 produces $v_2$, …, step 8 produces $v_8$.

The naive safety module builds an $8\times8$ similarity matrix, then multiplies each entry by the 4‑dimensional feature size, yielding 256 values.

With 32‑bit floats, the memory cost is $256 \times 4\text{ bytes} = 1{,}024$ bytes, i.e., just over 1 KB.

This toy calculation shows that even modest trajectory lengths already demand non‑trivial memory, motivating the need for a lightweight, scalable diagnosis approach.

**Table 1.** Comparison of different models across accessibility, capabilities, and applications.

The core shift is from protecting content to safeguarding agent behavior throughout its execution trajectory.

Framework Overview

A lightweight, taxonomy‑guided framework aligns agents safely with minimal data.

Current safety‑alignment pipelines consume millions of trajectories and heavyweight models, making deployment on modest hardware impractical. Our framework replaces that bulk with a taxonomy‑guided data engine that learns from a few thousand curated samples while preserving alignment quality. The design also supports a scalable training pipeline and a runtime guardrail for real‑world agents.

Instead of feeding the model raw, unstructured trajectories, we first organize safety risks into a concise taxonomy and then synthesize a tiny, focused dataset that directly teaches the model to recognize and avoid those risks.

Step 1: Sample generation – create two trajectories illustrating misuse, one showing information leakage, and one depicting self‑harm.

Step 2: Label each trajectory with its taxonomy tag (M, L, or S).

Step 3: Train AgentDoG 1.5 for one epoch over the four labeled trajectories.

Step 4: Evaluate on a held‑out test set; the model correctly flags all three risk types despite having seen only four examples.

Even a handful of taxonomy‑aligned examples suffices for the model to internalize distinct safety concepts, illustrating the efficiency of the data engine.

How does this data engine differ from standard data augmentation?

Standard augmentation merely perturbs existing trajectories, leaving the underlying risk distribution unchanged. In contrast, the taxonomy‑guided engine synthesizes new trajectories that explicitly cover each taxonomy node, guaranteeing exposure to rare safety scenarios that augmentation would never generate.

**Figure 2.** A lightweight and scalable alignment framework for AI agent safety and security.

Safety Taxonomy and Benchmarks

We define a three‑dimensional safety taxonomy and a flexible benchmark family for trajectory‑level diagnosis.

The taxonomy splits safety diagnosis into three orthogonal dimensions—risk source, failure mode, and real‑world harm—so a guard model can pinpoint where a problem originates, how it manifests, and what damage it could cause.

How does this three‑dimensional taxonomy differ from a simple safe/unsafe label?

Instead of a binary verdict, the taxonomy yields a triplet of labels—one for where the risk entered, one for how the agent failed, and one for what harm could result—so a guard model can explain the failure and suggest targeted mitigations.

ATBench defines a set of trajectory‑level benchmarks that all share the same three‑dimensional taxonomy but differ in execution setting, evidence format, and leaf‑category specializations, enabling consistent evaluation across diverse agent environments.

Why keep the three high‑level dimensions fixed if leaf categories change?

Fixing the dimensions preserves a common evaluation language; changing them would fragment results, making it impossible to compare safety performance across settings.

To adapt the fixed three‑dimensional taxonomy to a new agent setting, we either add fresh leaf categories for novel risks or tighten the scope of existing inherited categories, preserving the high‑level structure while capturing setting‑specific details.

Step 1: Start with the base leaf list for each dimension (four leaves total).

Step 2: Detect that “External API Misuse” is not represented in the Risk Source dimension.

Step 3: Add a new leaf category “External API Misuse” under Risk Source.

Step 4: Strengthen the inherited Failure Mode “Invalid Tool Call” to explicitly cover malformed API requests.

Step 5: Update the annotation schema so any trajectory exhibiting the new risk receives the leaf “External API Misuse” and the refined failure mode.

Adding a leaf captures a genuinely new risk, while strengthening an inherited leaf preserves continuity for models trained on the original taxonomy.

Could we simply append new leaf categories without adjusting inherited ones?

No—without sharpening inherited categories the model would treat the refined risk as a completely new concept, losing the ability to transfer knowledge from previously seen examples.

**Figure 3.** AgentDoG 1.5 uses the original three-dimensional agentic safety taxonomy as a shared diagnostic structure spanning risk source, failure mode, and real-world harm. Setting-specific customizations for ATBench-Claw and ATBench-Codex are organized on top of this shared structure, illustrating how new execution settings can introduce additional leaf categories while retaining compatibility with the original taxonomy dimensions.

**Figure 4.** ATBench family used to evaluate AgentDoG 1.5. All benchmark instances share the same three-dimensional safety taxonomy and trajectory-level diagnosis task, while ATBench-Claw and ATBench-Codex customize the execution setting, trajectory evidence, and leaf categories for their target agent environments. Complete customized category definitions are provided in Appendix A.

The AgentDoG 1.5 Model

Define the safety and risk diagnosis tasks that AgentDoG 1.5 must solve on full agent trajectories.

Evaluating an agent’s full execution trace is costly, yet missing a single unsafe step can cause severe harm. AgentDoG 1.5 tackles this by diagnosing safety at the trajectory level, providing both a binary safety verdict and a fine‑grained risk breakdown.

The model scans the entire action‑observation sequence, flags any unsafe step, and then reports the underlying risk source, failure mode, and real‑world harm — like scanning a log file for any error line and then classifying the error type.

Step 1 is safe: $\text{Unsafe}(t_1)=\text{False}$.

Step 2 triggers the unsafe predicate because the action “execute” with payload “sudo rm -rf /” is dangerous, so $\text{Unsafe}(t_2)=\text{True}$.

The existence of an unsafe step makes the overall label $y=\text{unsafe}$.

Given $y$ is unsafe, the model emits $y_{\text{fine}} = (\text{Over-privileged Action},\text{Improper Tool Use},\text{Security \& System Integrity})$.

The model can stop scanning as soon as any step is flagged unsafe, yet it still produces a full risk diagnosis for that trajectory.

How does trajectory‑level diagnosis differ from a standard sequence‑classification model?

Standard classifiers assign a single label based on the whole sequence without inspecting individual steps; the trajectory‑level approach explicitly checks each step for the $\text{Unsafe}$ predicate and, when unsafe, additionally outputs a structured risk triple, which ordinary classifiers do not provide.

The image displays two side-by-side task descriptions for evaluating agent behavior. The left panel defines the task of evaluating whether an agent's trajectory contains unsafe actions, providing definitions for "safe" and "unsafe," a trajectory template, and an output format requiring an analysis section. The right panel defines the task of analyzing an agent's trajectory to identify failure modes, risk consequences, and risk sources, providing a categorization framework and an output format that concludes with specific risk classifications (Risk Source, Failure Mode, Real World Harm).

Data Collection

Planner-driven pipeline creates risk‑aware trajectories with explicit safety labels.

Raw synthesized trajectories are noisy and often lack the reasoning traces needed to train a safety‑aware judge, which stalls progress on reliable risk diagnosis.

The pipeline first sketches a trajectory by sampling a risk tuple, then fleshes it out into a full multi‑turn interaction, and finally discards low‑value examples – much like a storyboard writer drafts a plot before the actors perform.

Step 1: Planning – the tuple is recorded and the sketch defines a user request to delete a file, followed by a network call that could exfiltrate data.

Step 2: Synthesis – the agent replies “Sure, deleting the file now”, then invokes the file‑write tool, and finally calls the network API, completing the unsafe trajectory.

Step 3: Parallel safe variant – using the same sketch, the agent detects the risk and refuses the file‑write, issuing a warning instead.

Step 4: Validation – the influence‑function score flags the unsafe variant as high‑risk, while the safe variant passes the filter.

Even with a tiny configuration, the planner guarantees that the unsafe and safe versions share the exact same risk injection point, giving the model a clear contrast for learning risk detection.

How does this planner differ from a generic trajectory generator that simply samples actions?

The planner explicitly conditions the sketch on a three‑dimensional risk tuple and a designated injection point, then produces paired safe/unsafe instantiations. A generic generator lacks that structured risk control and therefore cannot guarantee systematic coverage of the taxonomy or provide the paired contrast needed for fine‑grained safety training.

Data Purification

We prune the raw SFT pool to keep only examples that steer the model toward safe guardrail behavior.

The raw SFT pool mixes useful safety signals with redundant or noisy examples, which wastes fine‑tuning budget and can cause over‑fitting to spurious patterns.

We keep only those training examples whose gradient points align with a guardrail direction that explicitly encourages correct safety judgments.

Compute preference weights: $\hat{\pi}_{q_1}=0.7/(0.7+0.3)=0.7$, $\hat{\pi}_{q_2}=0.4/(0.4+0.6)=0.4$.

Assume target‑response gradients (simplified to scalars) are $\hat{\bar{g}}(q_1,y^+)=+2$, $\hat{\bar{g}}(q_1,y^-)= -1$, $\hat{\bar{g}}(q_2,y^+)=+1$, $\hat{\bar{g}}(q_2,y^-)= -2$.

Guardrail direction: $\hat{g}_{\text{guard}}=\frac{1}{2}\big[0.7\,(2-(-1)) + 0.4\,(1-(-2))\big]=\frac{1}{2}\big[0.7\cdot3 + 0.4\cdot3\big]=\frac{1}{2}(2.1+1.2)=1.65$.

Raw example gradients (again scalars) are $\hat{g}_{z_1}=+1$, $\hat{g}_{z_2}=+0.5$, $\hat{g}_{z_3}=-0.3$, $\hat{g}_{z_4}=+0.2$.

Purification scores: $s_\pi(z_1)=1\times1.65=1.65$, $s_\pi(z_2)=0.5\times1.65=0.825$, $s_\pi(z_3)=-0.3\times1.65=-0.495$, $s_\pi(z_4)=0.2\times1.65=0.33$.

We keep $z_1$ and $z_2$ (positive scores) and discard $z_3$, $z_4$ (negative or low scores), yielding $D_{\text{keep}}=\{z_1,z_2\}$.

Even a tiny scalar illustration shows how the guardrail direction amplifies examples that reinforce safety while suppressing those that would push the model away.

How does this selection differ from a naïve frequency‑based down‑sampling of the raw pool?

Frequency‑based down‑sampling ignores how each example influences the model’s parameters; our method scores examples by the alignment of their gradient with a safety‑oriented direction, so we keep rare but highly informative examples and drop frequent but unhelpful ones.

After purification the dataset shrinks to roughly 1 k high‑impact examples, cutting fine‑tuning cost while preserving or improving guardrail performance.

Training Pipeline

To enhance the model’s judgment capability and rationale generation ability, we follow DeepSeek’s (Guo et al., 2025b) training recipe and adopt a two‑stage training pipeline. We first apply Supervised Fine‑Tuning (SFT) to the base model to obtain a coarse‑grained judgment model and initialize a fine‑grained judgment model. We then use Reinforcement Learning (RL) to further optimize the fine‑grained model. In the SFT stage, the model is trained on the purified CoT‑augmented dataset to acquire a solid foundation of reasoning patterns and fine‑grained discriminative knowledge. Subsequently, the RL stage further refines the model’s decision boundaries by optimizing directly toward reward signals that reflect fine‑grained evaluation criteria, encouraging the model to produce more nuanced and precise judgments beyond what supervised learning alone can achieve.

3.3.1 Supervised Fine‑Tuning

To start with, we train the model with standard SFT on either the coarse‑grained or the fine‑grained dataset $D$ of input‑output demonstrations $(x, y)$. Given an input context $x$, the model is optimized to generate the target response $y$ autoregressively by maximizing the conditional likelihood of each target token. Equivalently, we minimize the negative log‑likelihood objective:

$$ L_{\text{SFT}}(\theta) = -\mathbb{E}_{(x,y)\sim D}\frac{1}{|y|}\sum_{t=1}^{|y|}\log \pi_\theta(y_t \mid x, y_{<t}). $$

Here, $\pi_\theta$ denotes the model policy parameterized by $\theta$, $y_t$ is the $t$-th token of the target output, and $y_{<t}$ denotes the preceding target tokens. This objective encourages the model to imitate the reference demonstrations in $D$ by assigning high probability to the annotated outputs conditioned on the input context. We fine‑tuned Qwen3.5‑0.8B, Qwen3.5‑2B, Qwen3.5‑4B (Qwen Team, 2026), and Llama‑3.1‑8B‑Instruct (Dubey et al., 2024) with a learning rate of $1\times10^{-5}$.

3.3.2 Reinforcement Learning

The RL stage refines the SFT policy toward more accurate fine‑grained judgment via reinforcement learning with verifiable rewards, using Group Reward‑Decoupled Normalization Policy Optimization (GDPO; Liu et al., 2026b) to preserve the multi‑dimensional reward signal. For each query $q_i$, the rollout policy samples $G$ responses, and a deterministic verifier scores each along three dimensions (failure mode, real‑world harm, risk source), yielding a binary reward vector $(r_1, r_2, r_3)$; an upstream reasoning‑block gate zeros all three if the response omits a non‑trivial analysis span. We prefer GDPO over scalar GRPO (Shao et al., 2024) because fine‑grained judgment contains many partial‑satisfaction cases, where summing rewards into one scalar makes a rollout correct on failure mode but wrong elsewhere, indistinguishable from qualitatively different patterns after group‑relative normalization. GDPO instead normalizes advantages per dimension, combines them with weights $(w_1, w_2, w_3) = (0.3, 0.4, 0.3)$, applies batch‑level normalization, and we retain any rollout group with non‑zero variance in any dimension, so the per‑dimension signal is not discarded.

The resulting normalized advantage $\hat{A}_{i,j}$ serves as the response‑level learning signal for rollout $o_{i,j}$; it is obtained by batch‑normalizing the weighted sum of the dimension‑wise advantages. This response‑level advantage is shared by all tokens in the same rollout. The token‑level policy ratio is defined as

$$ s_{i,j,t}(\theta) = \frac{\pi_\theta(o_{i,j,t} \mid q_i, o_{i,j,<t})}{\pi_{\theta_{\text{old}}}(o_{i,j,t} \mid q_i, o_{i,j,<t})}, $$

where $q_i$ denotes a query sampled from the data $D$, $t$ indexes tokens in rollout $o_{i,j}$, and $\{o_{i,j}\}_{j=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q_i)$.

For compactness, we define the token‑level clipped surrogate term as

$$ \text{clip}_{i,j,t}(\theta) = \min\!\left( s_{i,j,t}(\theta)\,\hat{A}_{i,j},\; \text{clip}\big(s_{i,j,t}(\theta),\,1-\epsilon_{\text{low}},\,1+\epsilon_{\text{high}}\big)\,\hat{A}_{i,j} \right). $$

We optimize the policy with the following KL‑regularized clipped surrogate objective:

$$ J_{\text{GDPO}}(\theta) = \mathbb{E}_{q_i\sim D,\; \{o_{i,j}\}_{j=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot\mid q_i)}\!\left[ \frac{1}{G}\sum_{j=1}^G \frac{1}{T_{i,j}} \sum_{t=1}^{T_{i,j}} \text{clip}_{i,j,t}(\theta) - \beta D_{\text{KL}}\!\left(\pi_\theta(\cdot \mid q_i, o_{i,j,<t}) \,\|\, \pi_{\text{ref}}(\cdot \mid q_i, o_{i,j,<t})\right) \right], $$

where $T_{i,j}$ is the length of rollout $o_{i,j}$, $\epsilon_{\text{low}} = 0.2$, $\epsilon_{\text{high}} = 0.28$, $\beta = 0.001$, learning rate $= 1\times10^{-6}$, and $G = 8$.

3.4 Evaluation

We provide a comprehensive evaluation of AgentDoG 1.5’s capability in agentic safety diagnosis. Our experiments are designed to assess the model along three critical dimensions: (1) Trajectory‑level safety evaluation, which identifies unsafe behaviors in multi‑step interactions; (2) Fine‑grained risk diagnosis, which categorizes specific risk sources and failure modes; and (3) Across prominent agentic execution environments, which assesses safety judgment capabilities in widely adopted agentic scenarios.

3.4.1 Experimental Setup

Benchmarks and metrics: We utilized R‑Judge (Yuan et al., 2024), ATBench (Li et al., 2026b), ATBench‑Claw (Yang et al., 2026b) and ATBench‑Codex (Yang et al., 2026b) to evaluate the performance of our AgentDoG 1.5. Each dataset consists of complete agent trajectories, where each trajectory is classified as either safe or unsafe.

The evaluation is structured as two complementary tasks:

- Trajectory‑level safety evaluation: The classification of each trajectory as safe or unsafe, utilizing standard metrics such as Accuracy, Precision, Recall, and F1‑score. Specifically, we assess AgentDoG 1.5 on R‑Judge and ATBench.

Table 2: Performance comparison across R‑Judge and ATBench using Accuracy, Precision, Recall, and F1‑score.

Performance Evaluation

AgentDoG 1.5‑4B sets new safety benchmarks across trajectory and fine‑grained risk tasks.

AgentDoG 1.5‑4B outperforms all open‑source and guard baselines, reaching 92.2 % accuracy on R‑Judge and 72.4 % accuracy on ATBench while preserving the same 92.7 % F1 on R‑Judge as its predecessor.

R‑Judge: 92.2 % Acc, 92.7 % F1; ATBench: 72.4 % Acc, 74.3 % F1. Compared to AgentDoG 1.0, ATBench accuracy rises from 64.0 % to 72.4 % (+8.4 pts) and F1 from 71.0 % to 74.3 % (+3.2 pts).

We compare against three baseline families: closed‑source frontier models (GPT‑5.4, GPT‑5.2, Gemini‑3‑Flash, Gemini‑3.1‑Pro), a wide range of open‑source LLMs (Qwen3‑series, Llama‑3.1‑8B), and specialized guard models (LlamaGuard, Qwen3‑Guard, ShieldAgent, JoySafety, NemoGuard).

Beyond binary safety, we report fine‑grained risk diagnosis on ATBench, measuring separate accuracies for Risk Source, Failure Mode, and Real‑world Harm categories.

**Figure 1.** Accuracy(%) of AgentDoG 1.5 and existing frontier and guardrail models. The first row reports binary safety classification results on four benchmark datasets, while the second row shows results on the fine-grained safety classification ATBench.

Benchmark Comparison

AgentDoG 1.5‑4B attains a 68.0 ATBench average, beating larger models while using only 4 B parameters.

AgentDoG 1.5‑4B reaches an ATBench average score of 68.0, surpassing the 65.7 of the much larger Qwen3.5‑397B‑A17B while using only 4 B parameters.

Table 3 shows AgentDoG 1.5‑4B scoring 68.0 versus 65.7 for Qwen3.5‑397B‑A17B on the ATBench benchmark.

Efficiency Analysis

AgentDoG 1.5 variants deliver high safety accuracy with tiny models.

AgentDoG 1.5‑4B‑U sets the state‑of‑the‑art on ATBench‑Claw with 87.6 % accuracy, surpassing both open‑source and closed‑source frontier models.

Compared to the high‑reference line for closed‑source models, the 4B‑U variant exceeds it; it also outperforms Qwen3.5‑397B (≈85 % reported) while using < 5 % of the parameters.

Even the smallest 0.8 B variant attains 75.7 % accuracy on R‑Judge and 60.3 % accuracy on ATBench, beating several larger general‑purpose and guard models; the 2 B version matches the ATBench F1 of a 397 B parameter model while using a fraction of the compute.

Fine‑grained risk diagnosis further improves safety judgment: the 4 B model reaches 75.2 % on Risk Source, 27.5 % on Failure Mode, and 62.9 % on Real‑world Harm (average 55.2 %), a 20.6‑point lift over the 1.0 baseline.

Across execution environments, the 4 B variant maintains robustness—80.0 % accuracy on ATBench‑Codex and 84.0 % on ATBench‑Claw—while the 0.8 B model still outperforms many larger baselines with 70.2 % and 78.4 % accuracy respectively.

**Figure 7.** Accuracy on ATBench-Codex and ATBench-Claw across model sizes. The x-axis uses dense model size and active parameters for MoE models. Closed-source models are represented by the highest and lowest closed-source reference lines because their parameter sizes are not publicly available. Guard models without explicit size in the model name are placed using approximate backbone sizes, with slight horizontal jitter for readability. Qwen3.5-0.8B and Qwen3.5-2B are not reported due to low strict-parser validity for presentation.

Application: Training-based Safety

We evaluate how AgentDoG 1.5 improves safety when used for data filtering and reward shaping in SFT and RL.

This section evaluates the impact of using AgentDoG 1.5 as a trajectory‑level filter for supervised fine‑tuning (SFT) and as a safety reward model for reinforcement learning (RL). We report ablation results that isolate the contribution of the filter and the reward signal.

**Figure 8.** Taxonomy distribution of the filtered agentic safety SFT data by AgentDoG 1.5. The resulting dataset contains 28,705 high-quality trajectories selected by AgentDoG 1.5, categorized by failure mode, real-world harm, and risk source.

Removing AgentDoG 1.5 filtering dramatically raises the harmful‑request score.

Table 4 shows the harm score drops from 57.49 % (unfiltered) to 20.32 % (filtered).

Filtering also boosts the refusal rate on harmful requests.

Table 4 records a rise from 28.41 % to 75.00 %.

Safe rate on tool‑use tasks improves with filtering.

Table 4 reports an increase from 34.37 % to 53.23 %.

Attack success rate on security benchmarks falls when using the filter.

Table 4 shows ASR drops from 34.72 % to 23.82 %.

The table evaluates the performance of Qwen3.5-4B and its variants (+ Util, + Unfilt-Safe, + AgentDoG 1.5-Filt) across several benchmarks: AgentHarm (BS, HS, RR), AgentSafetyBench (SR), AgentSecurityBench (ASR), AgentDojo (BU, UA, ASR), AgentDyn (BU, UA, ASR), and BFCL (Acc.).

Read the original paper

Open the simplified reader on Paperglide