Turing-Grade Benchmarks for Google Ads Agents

@gcharles10x|Nov 16, 2025 (4m ago)144 views

North Star: Ads-Bench is a proposed evaluation framework to prove a Google Ads agent is indistinguishable from a senior Google Ads strategist, stays inside policy and budget guardrails, and pays for itself through ROAS-per-dollar-of-compute gains.

⏳

Status — Proposal Only: Nothing described in this document is live today. Ads-Bench is a blueprint we plan to build starting January 2026. All task matrices, scoring rubrics, simulator modules, and leaderboard rules are drafts pending internal review, legal sign-off, and privacy audit. We publish this roadmap to invite feedback and align stakeholders before implementation begins.

The Principal-Agent Problem at 1 Billion QPS

In classical economics, the principal-agent problem describes the conflict that arises when a delegate's incentives diverge from the owner's. In 2026, the industry solved the "agent" part—autonomous neural networks now manage billions in ad spend. But we ignored the "principal" part. We gave them the wrong physics.

An Artificial Intelligence (AI) agent optimizing for Return on Ad Spend (ROAS) without constraints is not a strategist; it is a paperclip maximizer. It will cannibalize brand equity, bid on fraudulent inventory, and burn compute, all to satisfy a float value in a Javascript Object Notation (JSON) response.

Ads-Bench is the correction. It applies the Turing Test to the Profit and Loss (P&L) statement.

This is not a "proposal" for a benchmark. It is a definition of the Physics of Profit. We are building a system to prove—mathematically—that an agent is indistinguishable from a senior strategist, operates within non-negotiable safety invariants, and delivers ROAS-per-Dollar-of-Compute that justifies its existence.

The Problem: KPI Myopia

Traditional Key Performance Indicators (KPIs) like Cost Per Acquisition (CPA) are dangerously incomplete. An agent can hit a target CPA by bidding on "brand" keywords and cannibalizing organic traffic—a tactic that looks like genius on a dashboard but is actually theft.

Ads-Bench allocates 54% of its scoring weight to dimensions outside of pure performance:

Explainability: Can it explain why it raised the bid?
Robustness: Does it collapse when the API returns a 500 error?
Cost: Is it burning $50 in compute to save $10 in ad spend?

The Reality Check

Current frontier models are already being graded on real-world economic tasks. GDPval shows GPT-5 scoring 38-40% against human experts in high-value occupations. APEX shows agents struggling in primary care but excelling in consulting.

Ads-Bench slots into this landscape (Table 1) by forcing Google Ads agents to compete on the same economic terms: Indistinguishability, Safety, and Profitability.

Benchmark	Work Scope & Scale	Evaluation Modality	Signals for Ads-Bench
GDPval (OpenAI)	1,320 deliverables across 44 occupations in the top 9 GDP sectors; briefs built by practitioners averaging 14 years of experience.	Blind expert comparisons over attachments up to 38 files per job; measures win/tie rates plus speed/cost deltas.	Claude Opus 4.1 wins or ties on 47.6% of tasks while GPT-5 sits at 40.6%, yet pure inference is ~100× faster and cheaper than unaided experts—underscoring the need for safety/compliance gates before shipping outputs. [32][33]
APEX (Mercor, Harvard Law, Scripps)	200 high-value cases spanning investment banking, consulting, law, and primary care (1–8 hour workloads).	Expert-authored prompts scored against 29-criterion rubrics via a three-model LM judge panel with ≥99.4% agreement. [36]	GPT-5 tops the leaderboard at 64.2%, with Grok 4 and Gemini 2.5 Flash clustered at 61%–60%; open-source Qwen 3 235B leads its cohort at 59.8%—evidence that frontier leadership remains narrow and domain gaps (medicine, banking <50%) persist. [34][35]
Ads-Bench (this work)	Task+scenario matrix for Google Ads agents: 3 modalities × difficulty tiers × budget strata tuned to Ads APIs.	Composite scoring across indistinguishability, safety, profitability, and compute efficiency with OPE gating.	Extends GDPval/APEX lessons to paid media by forcing explainability, kill-switch readiness, and ROAS-per-dollar metrics into a single leaderboard.

Table 1

The Agent Benchmark Landscape

1. Indistinguishability: The Turing Metric

The term "Human Parity" is often used as a marketing slogan. In Ads-Bench, it is a measurable failure rate.

We define Indistinguishability as the point where a double-blind panel of senior strategists cannot distinguish the agent's campaign plan from that of a human expert with >55% confidence.

The standard requires three non-negotiable pillars:

Strategic Quality: The plan must make sense.
Safety: The commands must not break the bank.
Profitability: The math must work.

1.1 The Value Gap: The Economics of Variance

The promise of autonomous agents isn't just labor reduction; it's the elimination of variance. Human strategists sleep, drift, and make math errors. The complexity of the modern digital advertising world creates immense pressure to deliver results—a task that is increasingly difficult for human managers alone [6].

A rigorous agent prevents the efficiency loss inherent in manual optimization. Tools like AI Max have demonstrated 15-31% improvements in Cost Per Conversion merely by stabilizing bid pressure [3]. But this only works if the agent's reasoning is indistinguishable from a senior expert's—if it optimizes for profit, not just clicks.

"The opportunity lies in automating the high-value, time-consuming tasks that lead to wasted ad spend." [8]

1.2 Why a Turing+SWE Model Beats Metric-Only Tests

A purely metric-driven evaluation is insufficient. The standard draws inspiration from two robust frameworks: the Turing Test and SWE-bench [1].

The "Turing Test" Component: In double-blind studies, expert human ad managers will evaluate campaign strategies and outcomes generated by both AI and human counterparts to see if the AI's work is indistinguishable from a professional's [1]. This measures the nuanced, strategic quality of the agent's reasoning.
The "Software Engineering Bench" (SWE-bench) Component: This component focuses on task-oriented problem-solving. The AI agent is given a specific, real-world advertising problem (e.g., a sudden drop in ROAS) and is graded on its ability to autonomously diagnose, plan, and execute a sequence of Application Programming Interface (API) calls to resolve it. This is analogous to SWE-bench, where an agent must generate a code patch to fix a GitHub issue [1].

This dual approach provides a holistic assessment, ensuring an agent is not only effective (hits its KPIs) but also strategically sound and trustworthy.

2. The 180-Task Gauntlet

Benchmark saturation is the enemy. "Toy tasks" (e.g., "Pause this keyword") are solved. The challenge is long-horizon orchestration.

Ads-Bench enforces a 180-task gauntlet that mirrors the messy, non-linear reality of production ad management. It covers the full lifecycle: Planning, Control, and Analysis.

🗂️

Status: The 180-task briefs and scenario specs are drafted and under legal/privacy review; they will be published alongside the first Ads-Bench release, not before.

2.1 Task Difficulty Tiers

Why it matters: Ads-Bench needs to cover everything from pause-a-keyword tickets to multi-hour Performance Max (PMax) launches so agents aren’t overfit to toy tasks. Inspired by the SWE-bench framework, tasks are categorized by complexity, the number of API calls required, and the level of strategic reasoning involved (Table 2) [7].

Difficulty Tier	Description & Human Analogy	Example Tasks
Easy (Beginner)	Requires minimal changes and simple API interactions. (Human time: <15 mins)	Pause a specific ad group; retrieve a campaign’s daily budget; update a single keyword bid.
Medium (Intermediate)	Involves multiple steps, conditional logic, or changes across related API resources. (Human time: 15-60 mins)	Adjust a campaign’s bidding strategy based on recent performance; create a new ad group with specific targeting and creatives.
Hard (Advanced/Expert)	Demands strategic planning, complex optimization, and intricate troubleshooting. (Human time: 1-4+ hours)	Launch a new Performance Max campaign from scratch; diagnose and fix a significant, unexplained drop in performance; handle a complex policy disapproval.

Table 2

Task Difficulty Tiers

Visualized in Figure 1, the matrix allows us to stress-test specific agent capabilities—from routine maintenance to crisis management.

Figure 1

The 180-Task Gauntlet: A Stress Test for Agents

The benchmark taxonomy forces agents to operate across three modalities and three difficulty tiers. Note the 'Hard/Crisis' row, where agents face active adversarial pressure (e.g., budget drains, policy traps).

2.2 Operational Modalities

Why it matters: Planning, execution, and diagnostics stress different muscles—benchmarking only one would miss whole failure modes. Tasks are also grouped into three operational modalities to test the full range of an agent's capabilities (Table 3) [8].

Modality	Focus	Example Task
Planning	Strategic decision-making, campaign structuring, and goal setting.	Design a complete campaign structure for a new product launch, specifying target demographics, geographies, and a ROAS goal.
Control (Execution)	Interacting with the Google Ads API to implement changes and optimize performance.	Adjust keyword bids in a Search campaign to improve CPA by 15% while maintaining impression share.
Analysis (Diagnostics)	Interpreting performance data, identifying issues, and providing actionable insights.	Identify the root cause of a sudden drop in conversion rate for a PMax campaign and suggest corrective actions.

Table 3

Operational Task Modalities

2.3 High-Value, Often-Ignored Tasks

A robust benchmark must include critical but often overlooked tasks that are essential for real-world management [9]. These include:

Policy Appeals and Compliance: Understanding policy disapprovals, making adjustments, and initiating appeals [9].
Creative Asset Experimentation: Setting up, running, and analyzing A/B tests for ad creatives [9].
Audience Building: Creating and refining audience segments, including custom segments and customer match lists [9].
Granular Diagnostics: Moving beyond surface-level metrics to analyze search term reports, auction insights, and change history [10].
Fraud Detection: Identifying suspicious activity like unusual click spikes or invalid traffic [11].
Billing and Account Limits Management: Proactively managing billing thresholds and account-level limits to prevent suspension [9].
Integration with First-Party Data: Ingesting and utilizing Customer Relationship Management (CRM) or website data to enhance targeting [12].

2.4 Dynamic Conditions and Scenarios

To test adaptability, scenarios must incorporate non-stationary dynamics and cover a range of business contexts (Table 4) [13].

Category	Scenarios
Business Objectives	CPA, ROAS, Revenue Growth, Lead Generation, App Installs, Brand Awareness.
Industry Verticals	E-commerce, Lead-Gen, Apps, Local Businesses, Travel/Hospitality [14].
Budget Scales	Micro (<$100/day), Small ($100-$1k/day), Medium ($1k-$5k/day), Large ($5k-$50k/day), Enterprise (>$50k/day).
Starting Conditions	Cold-Start: New accounts with no historical data. Warm-Start: Optimizing existing campaigns.
Dynamic Factors	Seasonality: Holiday shopping peaks. Promotions: Short-term sales events. Inventory Changes: Adapting to stock levels. Market Shifts: New competitor actions or economic changes.

Table 4

Evaluation Scenario Matrix

3. The Composite Score: Truth over ROAS

In banking, a trader who makes 20% returns by ignoring risk controls is fired. In AI, they are currently celebrated.

Ads-Bench rejects "ROAS-only" evaluation. The Composite Score is a weighted index that penalizes "lucky" agents that take unacceptable risks.

The rubric explicitly trades off business lift against operational cost and opacity (Table 5).

📐

Status: The weighting schema and judge instructions below are a proposed v1 rubric; they will go live only after the maintainer board completes ratifier review.

3.1 Balancing Business Impact with Operational Costs

Why it matters: Ads agents can hit target ROAS yet still lose money if they blow up budgets or API costs, so we need an explicit trade-off between business lift and operational efficiency. The core tension in deploying any AI agent is balancing the value it creates with the cost to run it. The scoring framework must capture this trade-off explicitly (Table 5).

Metric Category	Key Metrics	Rationale & Weighting Justification
Business Impact KPIs	CPA, ROAS, Revenue/Conversion Value, CTR, CVR, Asset Group Performance [2].	Direct measures of advertising effectiveness and profitability. They receive the highest weight but are balanced against costs.
Operational Performance	Latency (seconds), API/Token Costs ($), Inference Throughput, Budget Pacing Accuracy [15].	Determines the agent’s real-world viability. High-ROAS agents that are expensive or slow are not scalable.

Table 5

Composite Scoring Weights

The weighting heatmap below visualizes one concrete implementation (Figure 2) that keeps 46% of the score on pure business KPIs and distributes the remaining 54% across operational efficiency (18%), safety and risk (14%), explainability (12%), and compute costs (10%)—mirroring guidance from Vertex AI's rubric tooling and Aisera's CLASSic framework [2][4].

PASS/FAIL GATE

Figure 2

The 'Truth over ROAS' Weighting Protocol

Ads-Bench deliberately suppresses pure profit metrics (capped at 46%) to enforce a 'Safety Tax'. An agent that prints money but fails the Safety Gate (Red) receives a composite score of zero.

Model	Cost Multiplier	Latency (s)	Accuracy	Stability
GPT-4o	10.8x	2.1	59.9%	55.5%
Claude 3.5 Sonnet	8.0x	3.3	62.9%	57%
Gemini 1.5 Pro	4.4x	3.2	59.4%	52%
Domain-Specific AI Agents	1.0x*	2.1	82.7%	72%

Table 6

Baseline Performance Metrics (CLASSic)

CLASSic benchmark results normalized to the domain-specific baseline (vendor-reported). [4]

The CLASSic benchmark framework (Table 6) highlights this tension, finding that while agents on frontier models like GPT-4o are capable, they can be over 10x more costly than specialized agents, with domain-specific agents showing the fastest response latency at 2.1 seconds [15].

3.2 Measuring Model Quality and Explainability

Why it matters: Without transparent reasoning traces, even a profitable agent becomes untrustworthy—humans can’t audit or debug its decisions. For an agent to be trusted, its reasoning must be transparent and sound. This is vital for human-AI collaboration and debugging [15].

Explainability & Interpretability: The agent must provide clear, human-understandable rationales for its decisions. This can be assessed with metrics like Vertex AI's response_follows_trajectory_metric, which checks if an agent's final answer logically follows from the sequence of tool calls it made [15].
Auditability: The agent must produce comprehensive, timestamped action logs and rationale traces for accountability. Processes should use reproducible seeds to allow for verification [15].
Transparency: The agent's technical specifications, system prompts, and behavior specifications must be disclosed, following principles outlined in the 2025 AI Safety Index [16].

3.3 Robustness and Safety Pass/Fail Gates

Why it matters: A single worst-case failure (overspend, policy breach, demographic bias) can erase quarters of gains, so safety gates trump raw KPIs. Certain metrics are so critical that they function as pass/fail gates. An agent that fails these tests may be disqualified or heavily penalized, regardless of its performance on other KPIs.

Worst-Case Loss: Quantifies the maximum potential negative impact on budget or ROAS in adverse scenarios to understand the agent's risk profile [15].
Policy Compliance: A codified test suite ensures adherence to Google's policies on prohibited content, Personally Identifiable Information (PII), and trademarks [9].
Fairness: Audits for demographic bias using metrics like demographic parity, inspired by benchmarks like BBQ, are mandatory [16].
Stability: Measures the consistency of the agent's accuracy over repeated runs of the same task. The CLASSic framework uses a 'pass²' metric, where a domain-specific agent achieved 72.0% stability [15].

4. The Accuracy Court: Double-Blind + Calibrated

To achieve a "Turing-grade" evaluation at scale, the benchmark combines rigorous, double-blind human evaluation with the scalability of Large Language Model (LLM)-as-a-judge systems. This blended approach ensures that nuanced, strategic quality is assessed without the prohibitive cost of having humans review every single run [17].

4.1 Double-Blind Study Design for the "Turing Test"

The protocol uses a formal double-blind study to assess the agent's performance against human experts [17].

Anonymized Artifacts: Expert human ad managers are recruited as evaluators and presented with anonymized campaign artifacts (strategies, ad copy, performance reports) without knowing if the author was an AI or a human [18]. This blinding prevents bias related to perceived authorship [19].
Evaluation Criteria: Raters use a detailed rubric to score outputs on indistinguishability (can they tell if it's AI?), quality preference (which output is superior?), and decision rationale quality (is the reasoning sound?) [20].

4.2 Rater Management and Reliability

Why it matters: Without disciplined governance, the supposedly Turing-grade judgments collapse into vibes. The quality of human evaluation depends on the quality of the raters and the consistency of their judgments.

Recruitment and Training: Raters must be experienced ad managers with verifiable expertise. They undergo comprehensive training on the evaluation rubrics to ensure a shared understanding of the criteria [2].
Inter-Rater Reliability (IRR): To ensure consistency, IRR is continuously measured. Metrics like Cohen's Kappa are used for categorical judgments, with a target IRR of ≥ 0.75 indicating substantial agreement. If reliability drops, rater retraining is initiated.

Each task is scored by two primary raters, with a rotating third-review bench that adjudicates disputes inside 48 hours; those transcripts are anonymized and replayed during calibration weeks so rubric drift never contaminates the leaderboard. Because some briefs include sensitive diagnostics, every evaluator signs an NDA and works inside a sealed reviewer enclave. LLM judges only inherit the verdict once that human panel certifies the trace, keeping the automation honest.

4.3 Calibrating LLM-as-a-Judge for Scalability

To scale evaluation, the protocol uses advanced LLMs as automated judges, a method inspired by benchmarks like MT-Bench [17]. Research shows that strong LLM judges like GPT-4 can achieve over 80% agreement with human preferences, which is the same level of agreement between humans [17].

Calibration Process: A "golden set" of campaign scenarios is first evaluated by human experts to establish a ground truth. The LLM judge is then run on the same set, and its prompts and scoring mechanisms are iteratively refined to minimize the discrepancy with human consensus, aiming for alignment within 5 percentage points [21].
Hybrid System: For ongoing evaluation, LLMs handle large-scale, objective assessments, while human experts verify outputs, especially for nuanced or high-stakes evaluations. This process includes mitigating known LLM biases like positional or verbosity bias [17].

5. The Privacy-Safe Sandbox

You cannot benchmark on live client data. It is illegal, unethical, and unrepeatable.

The Privacy-Safe Sandbox is the only viable architecture for a public benchmark. It generates high-fidelity synthetic data that statistically matches real-world ad auctions (long tails, sparsity, seasonality) without containing a single byte of PII.

Realism is achieved via AuctionNet calibration, ensuring the "fake" data punishes bad bidding just as hard as the real world would (Figure 3).

📡

Status: Today's simulator covers Google Ads account work (UI-parity traces + Ads API). The Open Real-Time Bidding (RTB) module remains future work until the promised correlation studies prove that auction-layer metrics line up with the account metrics reported here. 206: OpenRTB 2.x Protobuf bindings and Authorized Buyer endpoints are spec'd, but they stay gated until we finish validating how auction-layer latencies correlate with the account-layer metrics reported here. Those connectors will ship as an opt-in module only after the correlation paper clears legal review and privacy audit.

Figure 3

The Sandbox Architecture: From Offline Replay to Online Fire

The benchmark architecture enforces a strict 'Air Gap' between offline estimation (OPE) and live traffic. Agents must prove statistical superiority on historical logs (Scenario A/B) before unlocking the API Router.

5.1 Hybrid Dataset Composition

The data strategy balances realism, scale, and privacy by combining three data types [22].

Public Historical Logs: Incorporates well-known, de-identified public datasets (e.g., Criteo, Avazu) using frameworks like the Open Bandit Pipeline (OBP) for standardized processing and evaluation [22].
Privacy-Preserving Synthetic Data: Following the model of the AuctionNet benchmark, deep generative networks are trained on large-scale, private advertising data to create high-fidelity synthetic datasets [7]. This ad opportunity generation module produces millions of realistic ad opportunities while breaking the link to real individuals, ensuring privacy by design [7].
Semi-Synthetic Counterfactuals: The environment supports Off-Policy Evaluation (OPE) by generating counterfactual logs, allowing for the assessment of "what-if" scenarios to see how a new agent policy would have performed on historical data [22].

5.2 Modular Auction Mechanics

The simulator must support multiple auction types to reflect the diversity of online advertising platforms. This is achieved with a modular "ad auction module" inspired by AuctionNet (Table 7) [7].

Auction Mechanic	Description
Generalized Second-Price (GSP)	Classic ad auction where the winner pays slightly above the second-highest bid; serves as the core mechanic [7].
First-Price Auction (FPA)	Winner pays exactly what they bid; simulator must toggle this mode for platforms running FPA.
Vickrey–Clarke–Groves (VCG)	Truthful mechanism where bidders are incentivized to bid their true value.

Table 7

Supported Auction Mechanisms

OpenRTB 2.x Protobuf bindings and Authorized Buyer endpoints are spec'd, but they stay gated until we finish validating how auction-layer latencies correlate with the account-layer metrics reported here. Those connectors will ship as an opt-in module only after the correlation paper clears legal review and privacy audit.

5.3 Competitor and User Behavior Models

To create a realistic competitive landscape, the simulator includes sophisticated models for both competitors and users [7].

Competitor Models: The environment implements a variety of auto-bidding agents with different decision-making algorithms, from simple PID controllers to advanced models like Independent Q-Learning and Decision Transformers. This replicates a dynamic multi-agent game with 48 diverse agents competing, as seen in AuctionNet [5].
User Models: An "ad opportunity generation module" creates synthetic user profiles and predicts click and conversion probabilities based on user, time, and advertiser features [7]. This is enhanced by an "artificial society" framework with explicit models for search queries and clicks.

5.4 Fidelity Validation and Reproducibility

The simulator's credibility hinges on its fidelity to the real world and the reproducibility of its results.

Fidelity Validation: The statistical properties of the generated data are compared against real historical ad logs using goodness-of-fit tests. AuctionNet, for example, validates its models by comparing the distributions of generated ad opportunities against real-world data [7].
Reproducibility Stack: To ensure results are verifiable, the benchmark uses a robust reproducibility stack. This includes Docker for containerizing the environment, Data Version Control (DVC) for data versioning, and open-source libraries like OpenBanditPipeline and AuctionGym to standardize evaluation [22].

The same reproducibility stack will host the RTB simulator once privacy and correlation studies are complete; until then we treat AuctionNet numbers as forward-looking placeholders rather than benchmarked results.

6. Red Teaming & Kill Switches

A profitable agent that drains the budget in 4 minutes is a liability.

The Red Team Suite (60 scenarios) is a rigorous "stress test" that attacks the agent. We inject:

Adversarial Budget Drains: Competitors bidding up your keywords.
Policy Traps: Requests to advertise prohibited goods.
API Outages: Simulating partial service failures.

The agent passes only if its Kill Switch triggers correctly.

6.1 Policy Compliance Suite

Why it matters: Google can suspend entire accounts over one bad creative, so policy automation is a go/no-go requirement. A suite of codified tests ensures agents strictly adhere to Google Ads policies, which are enforced by a combination of Google's AI and human evaluation [9]. The suite covers four major policy areas (Table 8):

Policy Area	Test Focus	Examples
Prohibited Content	Preventing ads that enable dishonest behavior or contain inappropriate content.	Hacking software, academic cheating services, hate speech, graphic content, self-harm [9].
Prohibited Practices	Avoiding abuse of the ad network.	Malware, cloaking, arbitrage, circumventing policy reviews.
Personally Identifiable Information (PII) & Data Collection	Ensuring proper handling of sensitive user data.	Misusing full names, email addresses, financial status, or race—especially in personalized ads [9].
Trademark & Copyright	Respecting intellectual property rights.	Disallowing ads that infringe on trademarks or copyrights [23].

Table 8

Policy Compliance Test Suite

6.2 Fairness Audits for Demographic Bias

Why it matters: Regulatory pressure is rising on demographic fairness, and ad distributions that skew can trigger compliance reviews. Inspired by benchmarks like Stanford's HELM (which uses BBQ for social discrimination) and TrustLLM, these audits ensure agents do not perpetuate biases in ad delivery [16].

Demographic Parity: Tests if ads are shown to different demographic groups at similar rates.
Disparate Impact: Analyzes if outcomes disproportionately harm protected groups.
Remediation: Evaluates the agent's ability to implement corrective actions to mitigate identified biases.

6.3 Adversarial Test Suite

Why it matters: Competitors and bad actors will poke at your agent—budget drains and prompt injections are real, so we test against them before production. This suite, inspired by frameworks like AgentHarm and challenges like the Gray Swan Arena, will evaluate the agent's robustness against malicious attacks, measured by metrics like Attack Success Rate (ASR) [16].

Budget Exploitation: Simulating attacks that manipulate bidding to force overspending.
Policy Evasion: Using adversarial examples of ad creatives to bypass automated policy detectors.
Malicious Creative Generation: Testing resilience to prompt injection intended to coerce the agent into generating harmful content.
Confidentiality & Integrity Attacks: Probing for resistance to revealing sensitive information or overriding core instructions [16].

6.4 Financial Kill-Switch Verification

Why it matters: Even the best models fail; automated kill-switches minimize damage when anomaly detectors trip. These tests are designed to verify that the agent operates within defined financial boundaries and can manage risk effectively [4]. The agent must demonstrate the ability to:

Adhere to Budget Caps: Respect both daily and monthly budget limits.
Prevent Overspend: Implement its own safeguards, especially for changes made via the API.
Implement Kill-Switch Criteria: Programmatically pause or remove campaigns via the API in response to triggers like overspend or severe underperformance [4].

if (spend_today > 1.15 * budget_daily || roas_rolling_3h < roas_floor) {
postAlert({
  severity: "critical",
  context: { spend_today, roas_rolling_3h, last_change_id },
});
mutateCampaign({
  resourceName: campaign,
  status: "PAUSED"
});
logKillSwitch("auto-paused", now());
}

Guardrail sketch for programmatic kill-switches

This snippet is illustrative, not production code; the live gate still needs pacing intelligence for shared budgets, cross-account guardrails, and seasonal overrides, all of which are being replay-tested against winter-holiday and back-to-school spend curves.

7. Baseline Agents: The Control Group

Why it matters: Leaderboards without transparent baselines devolve into marketing—you need anchor agents and rules that punish sandbagging. A credible benchmark requires transparent baseline agents to anchor progress and a clear set of rules to govern the leaderboard and prevent metric gaming.

7.1 Baseline Agent Implementations

Ads-Bench includes four classes of baseline agents, representing a spectrum of sophistication (Table 9). Their comparative capabilities are visualized in Figure 4.

Agent Type	Description	Required Disclosures
Heuristic/Rule-Based	Predefined rules for bidding, budgeting, and keyword management—simple but transparent baseline [24].	Full rule set, thresholds, and logical conditions.
Contextual Bandit	Algorithms like LinUCB/Thompson Sampling handle adaptive decisions for ad placement.	Training data source, hyperparameters (learning rates, exploration parameters), and compute budget.
Reinforcement Learning	Sequential decision-making (e.g., DQN) to maximize rewards under budget constraints [25].	Training data, RL algorithm, network architecture, hyperparameters, reward shaping, compute budget.
LLM+Tools	LLM orchestrations integrated with the Google Ads API for planning, creatives, diagnostics [26].	Base LLM, toolset (API surface), prompting strategies, compute/API costs.

Table 9

Baseline Agent Classes

Figure 4

Architecture Trade-offs: The Capability Radar

No single architecture dominates. 'Heuristic' (Rule-Based) wins on cost and safety but fails on business impact. 'RL' (Reinforcement Learning) maximizes profit but is dangerous and opaque. 'LLM Agents' offer the best middle ground.

🧪

Status: The Reinforcement Learning (RL) baseline will go live once anonymized observation spaces and log replays clear consent review; we will publish both artifacts so external teams can reproduce the reference policy gradients without guesswork.

7.2 Leaderboard Governance and Rules

The leaderboard will be governed by a clear set of rules to ensure fair and meaningful comparisons [4].

Submission Protocol: Participants must submit a complete package including agent code, model weights, random seeds, and detailed logs. Disclosures of hardware, compute resources, and normalized cost/latency metrics are mandatory, and each team is capped at two active submissions per quarter. Finals lock 72 hours before evaluation to keep the double-blind process intact [4].
Anti-Overfitting Controls: To ensure generalization, final evaluation uses a private, hidden test set that is periodically refreshed. Agents are also tested on their ability to generalize to new, unseen advertiser accounts and verticals [4].
Eligibility and Gating: To appear on the leaderboard, an agent must meet minimum prerequisites for uptime, ethical guidelines, and baseline performance. Clear pass/fail thresholds are defined for critical safety and policy compliance metrics [4].
Versioning: The benchmark will be managed by designated maintainers with a public update cadence and strict adherence to semantic versioning to ensure stability and transparency.

8. The OPE Air-Gap: Offline to Online

Why it matters: Offline evaluation is cheaper and safer than live traffic, but only if the estimators are robust enough to gate what reaches production.

A critical component of the framework is the use of Off-Policy Evaluation (OPE) to create a data-driven "gate" between offline testing and expensive online A/B tests [27]. This allows for the safe, efficient, and rapid assessment of new agent policies using historical logged data, ensuring that only statistically superior and safe policies are advanced to live traffic.

The methodology will employ a suite of OPE estimators to manage the inherent bias-variance trade-off [27]. Key estimators include:

Inverse Probability Weighting (IPW) / Self-Normalized IPW (SNIPW): Provides unbiased estimates but can have high variance. SNIPW trades a small amount of bias for increased stability [27].
Direct Method (DM): Relies on a model of expected rewards.
Doubly Robust (DR) / Self-Normalized DR (SNDR): Combines IPW with a reward model, providing a consistent estimate if either the propensity model or the reward model is correct. This "double robustness" is highly justified for complex ad auction environments [27].

A formal gating process will be established where a new agent policy is only approved for a live A/B test if its offline OPE evaluation demonstrates a statistically significant improvement over the baseline and meets all safety criteria [27]. This will streamline experimentation and reduce the cost and risk of testing suboptimal policies online [27].

9. Implementation Phases

📅

Status — Planned for 2026: Development has not started. The timeline below is the execution schedule beginning January 2026, contingent on securing resources.

A condensed, 5-month execution plan (January 2026 – May 2026) is established to develop and launch Ads-Bench (Table 10).

Phase	Months	Key Milestones
Phase 1: Foundation & Simulation	January 2026 – February 2026	Finalize governance model and maintainer group. Develop v1.0 of the simulation environment (AuctionNet-style). Implement GSP auction mechanics and baseline competitor models. Begin synthetic data generation pipeline.
Phase 2: Task & Metric Integration	March 2026 – April 2026	Codify the full Task & Scenario Matrix (Easy, Medium, Hard). Ship the Multi-Pillar Scoring Framework and composite score v1.0. Integrate baseline agents (heuristic, bandit) into the simulator. Start building the OPE validation and gating framework.
Phase 3: Advanced Features & Beta	May 2026	Implement the Human & LLM Judgment Loop plus rater operations. Complete the Safety, Compliance, and Risk Stress-Test suite. Launch private beta with select internal and external partners. Finalize leaderboard rules and public submission protocol.

Table 10

Implementation Roadmap (2026)

10. Risk Register & Mitigations

The creation of this benchmark carries legal, financial, and technical risks. A pre-emptive risk management strategy is essential (Table 11).

Risk Category	Risk Description	Mitigation Strategy
Legal & Privacy	Exposure of PII from training data, violating General Data Protection Regulation (GDPR) / California Consumer Privacy Act (CCPA).	Prioritize synthetic data generation (AuctionNet model) to break the link to real individuals and enforce strict de-identification [5].
Financial	Agents cause large, uncontrolled overspend in simulation or real accounts.	Mandate budget-cap adherence and kill-switch tests as pass/fail gates [4].
Technical	Simulation lacks fidelity, so offline wins fail online.	Run rigorous fidelity validation plus back-testing against historical outcomes [7].
Reputational	Benchmark gets “gamed” via overfitting to public tests.	Maintain a large, refreshed hidden set and cross-account generalization checks [4].

Table 11

Project Risk Register

By addressing these risks proactively, we can reduce the top five benchmark-creation risks by over 60% and ensure the long-term credibility and value of Ads-Bench.

11. SWE-Bench Parallels & Workflow Orchestration

The benchmark is designed to mirror SWE-Bench so every submission produces a “solution patch” that can be replayed, diffed, and rated once the suite is live [1]. A full run will therefore include:

Issue Intake: A structured spec (business objective, diagnostics, constraints) is ingested exactly the way SWE-Bench hands an agent a GitHub issue. Human and LLM judges jointly confirm that the agent’s understanding matches the brief before execution [17].
Plan + Tool Trace: The agent produces a reasoning trace, then commits a set of ordered Google Ads API calls. This patch is versioned so raters can diff it against the pre-task state and track every budget, asset, and audience change [4].
Double Review: Each patch is graded twice—first by calibrated LLM judges for throughput, then by expert ad managers who focus on indistinguishability, rationale depth, and clarity, mirroring SWE-Bench’s human-in-the-loop evaluation [20].

🛠️

SWE-Bench taught us that publishing reproducible patches is the fastest way to debug agent behavior; Ads-Bench applies the same principle to campaign edits, policy appeals, and rollbacks.

12. API & Interface Requirements

Ads-Bench also evaluates whether an agent respects the same API ergonomics, diagnostics, and rate limits as a senior practitioner. The interface is split into observation, action, and constraint layers.

12.1 Observation Surfaces — `searchStream` everything

Performance snapshots: GoogleAdsService.search queries over Campaign, AdGroup, AdGroupAd, and KeywordView resources expose real-time metrics like metrics.roas, metrics.cpa, and metrics.conversion_value [29].
Creative quality: Asset and AssetGroupAsset reports include Google’s “Best/Good/Low” asset ratings so the agent can prioritize refreshes in PMax asset groups [3].
Policy + diagnostics: Access to PolicyTopicEntry and Recommendation resources lets the agent triage disapprovals, appeals, and first-party data gaps before taking action [9][12].

12.2 Action Surfaces — deterministic mutate calls

Campaign & budget control: CampaignService.mutateCampaigns and CampaignBudgetService.mutateCampaignBudgets adjust pacing, bidding, and shared budgets in one transaction [29].
Asset orchestration: AdGroupAdService, AssetService, and AssetGroupAssetService update creatives, inject new videos, or relink asset groups without breaking existing structures.
Audience + experiment ops: AdGroupCriterionService.mutateAdGroupCriteria, UserListService, and CampaignExperimentService enable keyword sculpting, audience refreshes, and holdout tests within the same benchmark scenario.
Offline conversion hygiene: ConversionUploadService and policy-aware retries ensure offline signals remain synchronized with Google Ads measurement [12].

12.3 Operational Constraints — rate limits, batches, and failsafes

Rate limits & quotas: The simulator enforces realistic per-customer Queries Per Second (QPS) ceilings, so agents must batch writes via GoogleAdsService.mutate instead of spamming single-field updates [30].
Partial failure handling: Batch operations intentionally return partial failures to verify the agent can retry idempotently, back off, and emit telemetry for human review.
Long-running jobs: Report downloads and experiment boots are modeled as asynchronous operations; the agent must poll and reconcile results without blocking critical guardrails.

Future Work — Ads-Bench RTB

Why it matters: Readers should know exactly how the benchmark would graduate from the Google Ads account layer described in this proposal to the RTB gauntlet we envision building after the initial release.

OpenRTB + Protobuf support (vNext, not live). We still owe a stateful connector that ingests OpenRTB 2.6 bid requests/responses via Protobuf so the simulator can replay bidstreams at scale. That code will only ship after the correlation study proves auction KPIs track with the account-level metrics documented above.

Real-time Bidding + Marketplace APIs (planned). The RTB module will pair Authorized Buyers and Marketplace endpoints so agents can ride the same pipes a large Demand Side Platform (DSP) uses. Until the privacy review signs off, those APIs stay dark and the current release remains Google Ads–only.

Sub-60 ms callout quotas and dual kill switches (planned). We are instrumenting a latency harness that enforces p95/p99 callout budgets under 60 ms, adds bid-level kill switches, and logs dual-disclosure events. None of that instrumentation is live; it belongs to Ads-Bench vNext.

gps-phoebe value-injection pipelines (experimental). A gps-phoebe layer is being prototyped to inject brand, compliance, and budget priors into RTB decisions so auction edits stay aligned with human taste even under adversarial load. It will remain experimental until reviewer data shows a measurable drop in policy escalations.

Every roadmap item above will replace the interim vendor-reported stats with peer-reviewed measurements once the studies complete. We will keep threading these milestones into the public roadmap so readers see a single narrative arc rather than two disjoint stories.

13. Appendices

(Appendices to include detailed tables, a full glossary of terms, Google Ads API reference stubs for the observation and action interfaces, and complete mathematical definitions for all metrics used in the composite scoring rubric.)

References

1. ^ philschmid/ai-agent-benchmark-compendium. https://github.com/philschmid/ai-agent-benchmark-compendium

2. ^ Define your evaluation metrics | Generative AI on Vertex AI. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval

3. ^ Google Ads AI Max vs Manual Optimization. https://groas.ai/post/google-ads-ai-max-vs-manual-optimization-performance-comparison-2025

4. ^ AI benchmarking framework measures real-world .... https://aisera.com/blog/enterprise-ai-benchmark/

5. ^ AuctionNet: A Novel Benchmark for Decision-Making in Large-Scale .... https://proceedings.neurips.cc/paper_files/paper/2024/hash/ab9b7c23edfea0011507f7e1eae82cd2-Abstract-Datasets_and_Benchmarks_Track.html

6. ^ Google's AI advisors: agentic tools to drive impact and .... https://blog.google/products/ads-commerce/ads-advisor-and-analytics-advisor/

7. ^ AuctionNet: A Novel Benchmark for Decision-Making in Large-Scale .... https://arxiv.org/html/2412.10798v1

8. ^ Drive peak campaign performance with new agentic capabilities. https://blog.google/products/ads-commerce/ai-agents-marketing-advisor/

9. ^ Google Ads policies - Advertising Policies Help. https://support.google.com/adspolicy/answer/6008942?hl=en

10. ^ Machine Learning-Powered Agents for Optimized Product .... https://www.mdpi.com/2673-4591/100/1/36

11. ^ The hidden risks of Google's automated advertising | Windsorborn. https://windsorborn.com/insights/thinking/the-hidden-risks-of-googles-automated-advertising

12. ^ User-provided data matching | Ads Data Hub. https://developers.google.com/ads-data-hub/guides/user-provided-data-matching

13. ^ Google Ads AI Agents - How To Run Them in 2025. https://ppc.io/blog/google-ads-ai-agents

14. ^ Google Ads Benchmarks for YOUR Industry [Updated!]. https://www.wordstream.com/blog/ws/2016/02/29/google-adwords-industry-benchmarks

15. ^ Evaluate your AI agents with Vertex Gen .... https://cloud.google.com/blog/products/ai-machine-learning/introducing-agent-evaluation-in-vertex-ai-gen-ai-evaluation-service

16. ^ 2025 AI Safety Index - Future of Life Institute. https://futureoflife.org/ai-safety-index-summer-2025/

17. ^ Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena - arXiv. https://arxiv.org/abs/2306.05685

18. ^ Artificial intelligence vs. human expert: Licensed mental health .... https://pmc.ncbi.nlm.nih.gov/articles/PMC12169703/

19. ^ Understanding Human Evaluation Metrics in AI - Galileo AI. https://galileo.ai/blog/human-evaluation-metrics-ai

20. ^ Rubric evaluation: A comprehensive framework for generative AI .... https://wandb.ai/wandb_fc/encord-evals/reports/Rubric-evaluation-A-comprehensive-framework-for-generative-AI-assessment--VmlldzoxMzY5MDY4MA

21. ^ LLMs-as-Judges: A Comprehensive Survey on LLM-based .... https://arxiv.org/html/2412.05579v2

22. ^ Open Bandit Pipeline; a python library for bandit algorithms and off .... https://zr-obp.readthedocs.io/en/latest/

23. ^ Trademarks - Advertising Policies Help. https://support.google.com/adspolicy/answer/6118?hl=en

24. ^ Heuristic optimization algorithms for advertising campaigns. https://docta.ucm.es/bitstreams/3fa537ed-aa9f-44ca-85cc-bebaa5d9927b/download

25. ^ Deep Reinforcement Learning for Online Advertising Impression in .... https://arxiv.org/abs/1909.03602

26. ^ Google Launches Gemini-Powered AI Agents ... - ADWEEK. https://www.adweek.com/media/google-ai-agent-ads-analytics-advisor/

27. ^ Off-Policy Evaluation and Counterfactual Methods in .... https://arxiv.org/abs/2501.05278

28. ^ About automated bidding | Google Ads Help. https://support.google.com/google-ads/answer/2979071

29. ^ Google Ads API services overview. https://developers.google.com/google-ads/api/docs/get-started/services

30. ^ Rate limits and quotas | Google Ads API best practices. https://developers.google.com/google-ads/api/docs/best-practices/rate-limits

31. ^ GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. https://arxiv.org/abs/2510.04374

32. ^ Measuring the performance of our models on real-world tasks. OpenAI. https://openai.com/index/gdpval/

33. ^ OpenAI says top AI models are reaching expert territory on real-world knowledge work. The Decoder. https://the-decoder.com/openai-says-top-ai-models-are-reaching-expert-territory-on-real-world-knowledge-work/

34. ^ The AI Productivity Index (APEX): Measuring Executive-Level Performance Across Professions. https://arxiv.org/abs/2509.25721

35. ^ Introducing APEX: AI Productivity Index (Mercor leaderboard). https://www.mercor.com/blog/introducing-apex-ai-productivity-index/

36. ^ AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants. TIME. https://time.com/7322386/ai-mercor-professional-tasks-data-annotation/