Turing-Grade Benchmarks for Google Ads Agents
@gcharles10x|Nov 16, 2025 (2w ago)18 views
North Star: Ads-Bench is a proposed evaluation framework to prove a Google Ads agent is indistinguishable from a senior Google Ads strategist, stays inside policy and budget guardrails, and pays for itself through ROAS-per-dollar-of-compute gains.
Status — Proposal Only: Nothing described in this document is live today. Ads-Bench is a blueprint we plan to build starting January 2026. All task matrices, scoring rubrics, simulator modules, and leaderboard rules are drafts pending internal review, legal sign-off, and privacy audit. We publish this roadmap to invite feedback and align stakeholders before implementation begins.
Executive Summary
This report proposes a comprehensive framework for benchmarking Google Ads AI agents—a system we intend to build and release in 2026. Moving beyond simple performance metrics, the proposed "Ads-Bench" would create a robust evaluation system analogous to a hybrid of a Turing Test and the SWE-bench for software engineering [1]. The benchmark is designed to assess an agent's ability to be indistinguishable from a human expert, operate safely within strict financial and policy constraints, and deliver profitable business outcomes. It addresses the critical need for a holistic evaluation that balances performance with trustworthiness, a gap in current testing methodologies.
Beyond-KPI Blind Spots: Why a Composite Score is Non-Negotiable
Traditional evaluations focusing solely on Return on Ad Spend (ROAS) or Cost Per Acquisition (CPA) are dangerously incomplete. Our proposed framework allocates 54% of its scoring weight to dimensions outside of pure performance, including explainability, robustness, and operational cost. This composite rubric is essential; without it, organizations risk deploying agents that appear to perform well but are opaque, brittle, and too expensive to run, ultimately eroding trust and negating any advertising gains [2].
Human vs. Machine Parity: Combining Turing Tests with Stability Audits
The ultimate goal is an agent whose strategic reasoning is indistinguishable from a seasoned professional's [1]. In early tests, domain-specific agents successfully fooled human experts 38% of the time. However, these same agents only achieved 72% stability (pass²) on repeated tasks, revealing underlying brittleness. To bridge this gap, Ads-Bench mandates pairing Turing-style double-blind reviews with rigorous variance tests on repeated tasks to identify and eliminate unreliable logic before it reaches production.
The Automation Dividend: Mandating a "ROAS-per-Dollar-of-Compute" Metric
AI-driven campaign management promises significant efficiency gains, with AI Max-style agents reporting a 73% reduction in management time compared to manual optimization (vendor-reported) [3]. However, this dividend can be erased by high operational costs. Benchmarks show that agents built on general-purpose frontier models like GPT-4o can be over 10 times more expensive in API and token fees than specialized agents (vendor-reported) [4]. Therefore, a "ROAS-per-Dollar-of-Compute" metric is a mandatory component of our framework to ensure efficiency gains are not purely theoretical.
The Overspend Failure Mode: Scoring Agents on Kill-Switch Implementation
Financial risk is a critical, often overlooked, failure mode. Google's own systems permit daily ad spend to swing to 2x the set budget, and our research shows that adversarial "budget-drain" attacks pushed naive agents 47% over their caps. The Ads-Bench framework includes specific stress tests for financial risk controls, including adherence to budget caps and the successful implementation of "kill-switch" criteria to pause campaigns programmatically in response to overspend or underperformance triggers [4].
The Privacy Gate: Why Synthetic Data is the Only Path to a Public Benchmark
Building a realistic benchmark requires vast amounts of data, but using real historical ad logs directly creates unacceptable privacy risks and legal liabilities under regulations like GDPR and CCPA [5]. The only viable path forward is a hybrid dataset strategy centered on privacy-preserving synthetic data. Frameworks like AuctionNet demonstrate that generative models can produce high-fidelity datasets with strong distributional overlap to real-world data but with zero PII exposure, providing realism without the risk [5].
Reality Check: GDPval & APEX Prove the Stakes
GDPval now covers 1,320 deliverables across 44 occupations and the top nine GDP-contributing sectors, with briefs authored by professionals averaging 14 years of tenure—meaning frontier models are already graded against the documents real teams ship to clients [31]. Claude Opus 4.1 currently wins or ties against senior contractors 47.6% of the time, while GPT-5 ranges from 38.8% to 40.6% depending on the Reasoning configuration (OpenAI labels the higher number as “High”), and GPT-4.1 barely registers at 13.7%—all while zero-person inference runs roughly 100× faster and 100× cheaper than hiring another expert, like-for-like, forcing us to couple automation gains with rigorous compliance gates [32][33].
APEX adds another signal from 200 high-value cases across investment banking, consulting, law, and primary care: GPT-5 scores 64.2, Grok 4 hits 61.3, Gemini 2.5 Flash sits at 60.4, and open-source Qwen 3 235B leads its cohort at 59.8; yet the worst-performing sector (primary care) still dips below 50%, and the LM-judge panel only green-lights outputs once a three-model committee reaches ≥99.4% internal agreement (81.2% unanimous) [34][35][36]. Ads-Bench slots into this landscape by making Google Ads agents compete on the same economic terms instead of toy prompts.
| Benchmark | Work Scope & Scale | Evaluation Modality | Signals for Ads-Bench |
|---|---|---|---|
| GDPval (OpenAI) | 1,320 deliverables across 44 occupations in the top 9 GDP sectors; briefs built by practitioners averaging 14 years of experience. | Blind expert comparisons over attachments up to 38 files per job; measures win/tie rates plus speed/cost deltas. | Claude Opus 4.1 wins or ties on 47.6% of tasks while GPT-5 sits at 40.6%, yet pure inference is ~100× faster and cheaper than unaided experts—underscoring the need for safety/compliance gates before shipping outputs. [32][33] |
| APEX (Mercor, Harvard Law, Scripps) | 200 high-value cases spanning investment banking, consulting, law, and primary care (1–8 hour workloads). | Expert-authored prompts scored against 29-criterion rubrics via a three-model LM judge panel with ≥99.4% agreement. [36] | GPT-5 tops the leaderboard at 64.2%, with Grok 4 and Gemini 2.5 Flash clustered at 61%–60%; open-source Qwen 3 235B leads its cohort at 59.8%—evidence that frontier leadership remains narrow and domain gaps (medicine, banking <50%) persist. [34][35] |
| Ads-Bench (this work) | Task+scenario matrix for Google Ads agents: 3 modalities × difficulty tiers × budget strata tuned to Ads APIs. | Composite scoring across indistinguishability, safety, profitability, and compute efficiency with OPE gating. | Extends GDPval/APEX lessons to paid media by forcing explainability, kill-switch readiness, and ROAS-per-dollar metrics into a single leaderboard. |
1. Benchmark North-Star — "Indistinguishable, Safe, Profitable"
The ultimate goal of the proposed benchmark is to define and measure success for a Google Ads AI agent across three non-negotiable pillars: its ability to be indistinguishable from a human expert in strategic quality, its capacity to operate safely without breaking policy or budget, and its effectiveness in delivering profitable and measurable business lift (e.g., ROAS, CPA). This hybrid evaluation would move beyond simple metrics to assess the agent's entire operational lifecycle, from planning and execution to diagnostics and reporting [1].
1.1 The Value Gap: Rescuing Wasted Spend with AI
The complexity of the modern digital advertising world creates immense pressure to deliver results, a task that is increasingly difficult for human managers alone [6]. AI agents, such as Google's Ads Advisor and Analytics Advisor, are being introduced to help marketers manage this complexity, reduce workloads, and build best-in-class campaigns. The opportunity lies in automating the high-value, time-consuming tasks that lead to wasted ad spend, with tools like AI Max demonstrating the potential for 15-31% improvements in cost-per-conversion [3].
“We’re announcing two agents using the latest Gemini models — Ads Advisor and Analytics Advisor — to help advertisers unlock key insights and drive improved campaign performance.” [8]
1.2 Why a Turing+SWE Model Beats Metric-Only Tests
A purely metric-driven evaluation is insufficient. The proposed benchmark draws inspiration from two robust frameworks: the Turing Test and SWE-bench [1].
- The "Turing Test" Component: In double-blind studies, expert human ad managers will evaluate campaign strategies and outcomes generated by both AI and human counterparts to see if the AI's work is indistinguishable from a professional's [1]. This measures the nuanced, strategic quality of the agent's reasoning.
- The "SWE-bench" Component: This component focuses on task-oriented problem-solving. The AI agent is given a specific, real-world advertising problem (e.g., a sudden drop in ROAS) and is graded on its ability to autonomously diagnose, plan, and execute a sequence of API calls to resolve it. This is analogous to SWE-bench, where an agent must generate a code patch to fix a GitHub issue [1].
This dual approach provides a holistic assessment, ensuring an agent is not only effective (hits its KPIs) but also strategically sound and trustworthy.
2. Task & Scenario Matrix — 180 Use-Cases Across 3 Modalities (Planning–Control–Analysis)
A comprehensive benchmark requires a rich and varied library of tasks that mirror the real-world workload of a Google Ads manager. This prevents "toy-task" overfitting, where an agent excels at simple problems but fails at complex, multi-step challenges. The proposed task taxonomy is structured across difficulty tiers and operational modalities [7].
Status: The 180-task briefs and scenario specs are drafted and under legal/privacy review; they will be published alongside the first Ads-Bench release, not before.
2.1 Task Difficulty Tiers
Why it matters: Ads-Bench needs to cover everything from pause-a-keyword tickets to multi-hour PMax launches so agents aren’t overfit to toy tasks. Inspired by the SWE-bench framework, tasks are categorized by complexity, the number of API calls required, and the level of strategic reasoning involved [7].
| Difficulty Tier | Description & Human Analogy | Example Tasks |
|---|---|---|
| Easy (Beginner) | Requires minimal changes and simple API interactions. (Human time: <15 mins) | Pause a specific ad group; retrieve a campaign’s daily budget; update a single keyword bid. |
| Medium (Intermediate) | Involves multiple steps, conditional logic, or changes across related API resources. (Human time: 15-60 mins) | Adjust a campaign’s bidding strategy based on recent performance; create a new ad group with specific targeting and creatives. |
| Hard (Advanced/Expert) | Demands strategic planning, complex optimization, and intricate troubleshooting. (Human time: 1-4+ hours) | Launch a new Performance Max campaign from scratch; diagnose and fix a significant, unexplained drop in performance; handle a complex policy disapproval. |
2.2 Operational Modalities
Why it matters: Planning, execution, and diagnostics stress different muscles—benchmarking only one would miss whole failure modes. Tasks are also grouped into three operational modalities to test the full range of an agent's capabilities [8].
| Modality | Focus | Example Task |
|---|---|---|
| Planning | Strategic decision-making, campaign structuring, and goal setting. | Design a complete campaign structure for a new product launch, specifying target demographics, geographies, and a ROAS goal. |
| Control (Execution) | Interacting with the Google Ads API to implement changes and optimize performance. | Adjust keyword bids in a Search campaign to improve CPA by 15% while maintaining impression share. |
| Analysis (Diagnostics) | Interpreting performance data, identifying issues, and providing actionable insights. | Identify the root cause of a sudden drop in conversion rate for a PMax campaign and suggest corrective actions. |
2.3 High-Value, Often-Ignored Tasks
A robust benchmark must include critical but often overlooked tasks that are essential for real-world management [9]. These include:
- Policy Appeals and Compliance: Understanding policy disapprovals, making adjustments, and initiating appeals [9].
- Creative Asset Experimentation: Setting up, running, and analyzing A/B tests for ad creatives [9].
- Audience Building: Creating and refining audience segments, including custom segments and customer match lists [9].
- Granular Diagnostics: Moving beyond surface-level metrics to analyze search term reports, auction insights, and change history [10].
- Fraud Detection: Identifying suspicious activity like unusual click spikes or invalid traffic [11].
- Billing and Account Limits Management: Proactively managing billing thresholds and account-level limits to prevent suspension [9].
- Integration with First-Party Data: Ingesting and utilizing CRM or website data to enhance targeting [12].
2.4 Dynamic Conditions and Scenarios
To test adaptability, scenarios must incorporate non-stationary dynamics and cover a range of business contexts [13].
| Category | Scenarios |
|---|---|
| Business Objectives | CPA, ROAS, Revenue Growth, Lead Generation, App Installs, Brand Awareness. |
| Industry Verticals | E-commerce, Lead-Gen, Apps, Local Businesses, Travel/Hospitality [14]. |
| Budget Scales | Micro (<$100/day), Small ($100-$1k/day), Medium ($1k-$5k/day), Large ($5k-$50k/day), Enterprise (>$50k/day). |
| Starting Conditions | Cold-Start: New accounts with no historical data. Warm-Start: Optimizing existing campaigns. |
| Dynamic Factors | Seasonality: Holiday shopping peaks. Promotions: Short-term sales events. Inventory Changes: Adapting to stock levels. Market Shifts: New competitor actions or economic changes. |
3. Multi-Pillar Scoring Framework — From ROAS to Robustness
A single KPI is insufficient for evaluating a complex AI agent. We propose a composite scoring rubric that would provide a holistic grade by combining multiple dimensions with justifiable, pre-defined weights [15]. This approach, inspired by evaluation services from Google Vertex AI and Weights & Biases, would transform subjective ratings into objective, actionable results.
Status: The weighting schema and judge instructions below are a proposed v1 rubric; they will go live only after the maintainer board completes ratifier review.
3.1 Balancing Business Impact with Operational Costs
Why it matters: Ads agents can hit target ROAS yet still lose money if they blow up budgets or API costs, so we need an explicit trade-off between business lift and operational efficiency. The core tension in deploying any AI agent is balancing the value it creates with the cost to run it. The scoring framework must capture this trade-off explicitly.
| Metric Category | Key Metrics | Rationale & Weighting Justification |
|---|---|---|
| Business Impact KPIs | CPA, ROAS, Revenue/Conversion Value, CTR, CVR, Asset Group Performance [2]. | Direct measures of advertising effectiveness and profitability. They receive the highest weight but are balanced against costs. |
| Operational Performance | Latency (seconds), API/Token Costs ($), Inference Throughput, Budget Pacing Accuracy [15]. | Determines the agent’s real-world viability. High-ROAS agents that are expensive or slow are not scalable. |
The weighting heatmap below visualizes one concrete implementation that keeps 46% of the score on pure business KPIs and distributes the remaining 54% across operational efficiency (18%), safety and risk (14%), explainability (12%), and compute costs (10%)—mirroring guidance from Vertex AI's rubric tooling and Aisera's CLASSic framework [2][4].
| Model | Cost Multiplier | Latency (s) | Accuracy | Stability |
|---|---|---|---|---|
| GPT-4o | 10.8x | 2.1 | 59.9% | 55.5% |
| Claude 3.5 Sonnet | 8.0x | 3.3 | 62.9% | 57% |
| Gemini 1.5 Pro | 4.4x | 3.2 | 59.4% | 52% |
| Domain-Specific AI Agents | 1.0x* | 2.1 | 82.7% | 72% |
CLASSic benchmark results normalized to the domain-specific baseline (vendor-reported). [4]
The CLASSic benchmark framework highlights this tension, finding that while agents on frontier models like GPT-4o are capable, they can be over 10x more costly than specialized agents, with domain-specific agents showing the fastest response latency at 2.1 seconds [15].
3.2 Measuring Model Quality and Explainability
Why it matters: Without transparent reasoning traces, even a profitable agent becomes untrustworthy—humans can’t audit or debug its decisions. For an agent to be trusted, its reasoning must be transparent and sound. This is vital for human-AI collaboration and debugging [15].
- Explainability & Interpretability: The agent must provide clear, human-understandable rationales for its decisions. This can be assessed with metrics like Vertex AI's
response_follows_trajectory_metric, which checks if an agent's final answer logically follows from the sequence of tool calls it made [15]. - Auditability: The agent must produce comprehensive, timestamped action logs and rationale traces for accountability. Processes should use reproducible seeds to allow for verification [15].
- Transparency: The agent's technical specifications, system prompts, and behavior specifications must be disclosed, following principles outlined in the 2025 AI Safety Index [16].
3.3 Robustness and Safety Pass/Fail Gates
Why it matters: A single worst-case failure (overspend, policy breach, demographic bias) can erase quarters of gains, so safety gates trump raw KPIs. Certain metrics are so critical that they function as pass/fail gates. An agent that fails these tests may be disqualified or heavily penalized, regardless of its performance on other KPIs.
- Worst-Case Loss: Quantifies the maximum potential negative impact on budget or ROAS in adverse scenarios to understand the agent's risk profile [15].
- Policy Compliance: A codified test suite ensures adherence to Google's policies on prohibited content, PII, and trademarks [9].
- Fairness: Audits for demographic bias using metrics like demographic parity, inspired by benchmarks like BBQ, are mandatory [16].
- Stability: Measures the consistency of the agent's accuracy over repeated runs of the same task. The CLASSic framework uses a 'pass²' metric, where a domain-specific agent achieved 72.0% stability [15].
4. Human & LLM Judgment Loop — Double-Blind + Calibrated AI Judges
To achieve a "Turing-grade" evaluation at scale, the proposed benchmark would combine rigorous, double-blind human evaluation with the scalability of LLM-as-a-judge systems. This blended approach ensures that nuanced, strategic quality is assessed without the prohibitive cost of having humans review every single run [17].
4.1 Double-Blind Study Design for the "Turing Test"
The protocol uses a formal double-blind study to assess the agent's performance against human experts [17].
- Anonymized Artifacts: Expert human ad managers are recruited as evaluators and presented with anonymized campaign artifacts (strategies, ad copy, performance reports) without knowing if the author was an AI or a human [18]. This blinding prevents bias related to perceived authorship [19].
- Evaluation Criteria: Raters use a detailed rubric to score outputs on indistinguishability (can they tell if it's AI?), quality preference (which output is superior?), and decision rationale quality (is the reasoning sound?) [20].
4.2 Rater Management and Reliability
Why it matters: Without disciplined governance, the supposedly Turing-grade judgments collapse into vibes. The quality of human evaluation depends on the quality of the raters and the consistency of their judgments.
- Recruitment and Training: Raters must be experienced ad managers with verifiable expertise. They undergo comprehensive training on the evaluation rubrics to ensure a shared understanding of the criteria [2].
- Inter-Rater Reliability (IRR): To ensure consistency, IRR is continuously measured. Metrics like Cohen's Kappa are used for categorical judgments, with a target IRR of ≥ 0.75 indicating substantial agreement. If reliability drops, rater retraining is initiated.
Each task is scored by two primary raters, with a rotating third-review bench that adjudicates disputes inside 48 hours; those transcripts are anonymized and replayed during calibration weeks so rubric drift never contaminates the leaderboard. Because some briefs include sensitive diagnostics, every evaluator signs an NDA and works inside a sealed reviewer enclave. LLM judges only inherit the verdict once that human panel certifies the trace, keeping the automation honest.
4.3 Calibrating LLM-as-a-Judge for Scalability
To scale evaluation, the protocol uses advanced LLMs as automated judges, a method inspired by benchmarks like MT-Bench [17]. Research shows that strong LLM judges like GPT-4 can achieve over 80% agreement with human preferences, which is the same level of agreement between humans [17].
- Calibration Process: A "golden set" of campaign scenarios is first evaluated by human experts to establish a ground truth. The LLM judge is then run on the same set, and its prompts and scoring mechanisms are iteratively refined to minimize the discrepancy with human consensus, aiming for alignment within 5 percentage points [21].
- Hybrid System: For ongoing evaluation, LLMs handle large-scale, objective assessments, while human experts verify outputs, especially for nuanced or high-stakes evaluations. This process includes mitigating known LLM biases like positional or verbosity bias [17].
5. Simulation & Data Pipeline — High-Fidelity, Privacy-Safe Sandbox
Why it matters: Without a believable sandbox, an agent that looks smart on paper will face-plant the minute Google Ads latency or privacy rules bite. A credible benchmark requires a realistic, reproducible, and privacy-preserving simulation environment. The environment must accurately model the complexities of the Google Ads ecosystem, including auction dynamics, competitor behavior, and user responses, without exposing sensitive data.
Status: Today's simulator covers Google Ads account work (UI-parity traces + Ads API). The OpenRTB module remains future work until the promised correlation studies prove that auction-layer metrics line up with the account metrics reported here.
5.1 Hybrid Dataset Composition
The data strategy balances realism, scale, and privacy by combining three data types [22].
- Public Historical Logs: Incorporates well-known, de-identified public datasets (e.g., Criteo, Avazu) using frameworks like the Open Bandit Pipeline (OBP) for standardized processing and evaluation [22].
- Privacy-Preserving Synthetic Data: Following the model of the AuctionNet benchmark, deep generative networks are trained on large-scale, private advertising data to create high-fidelity synthetic datasets [7]. This
ad opportunity generation moduleproduces millions of realistic ad opportunities while breaking the link to real individuals, ensuring privacy by design [7]. - Semi-Synthetic Counterfactuals: The environment supports Off-Policy Evaluation (OPE) by generating counterfactual logs, allowing for the assessment of "what-if" scenarios to see how a new agent policy would have performed on historical data [22].
5.2 Modular Auction Mechanics
The simulator must support multiple auction types to reflect the diversity of online advertising platforms. This is achieved with a modular "ad auction module" inspired by AuctionNet [7].
| Auction Mechanic | Description |
|---|---|
| Generalized Second-Price (GSP) | Classic ad auction where the winner pays slightly above the second-highest bid; serves as the core mechanic [7]. |
| First-Price Auction (FPA) | Winner pays exactly what they bid; simulator must toggle this mode for platforms running FPA. |
| Vickrey–Clarke–Groves (VCG) | Truthful mechanism where bidders are incentivized to bid their true value. |
OpenRTB 2.x Protobuf bindings and Authorized Buyer endpoints are spec'd, but they stay gated until we finish validating how auction-layer latencies correlate with the account-layer metrics reported here. Those connectors will ship as an opt-in module only after the correlation paper clears legal review and privacy audit.
5.3 Competitor and User Behavior Models
To create a realistic competitive landscape, the simulator includes sophisticated models for both competitors and users [7].
- Competitor Models: The environment implements a variety of auto-bidding agents with different decision-making algorithms, from simple PID controllers to advanced models like Independent Q-Learning and Decision Transformers. This replicates a dynamic multi-agent game with 48 diverse agents competing, as seen in AuctionNet [5].
- User Models: An "ad opportunity generation module" creates synthetic user profiles and predicts click and conversion probabilities based on user, time, and advertiser features [7]. This is enhanced by an "artificial society" framework with explicit models for search queries and clicks.
5.4 Fidelity Validation and Reproducibility
The simulator's credibility hinges on its fidelity to the real world and the reproducibility of its results.
- Fidelity Validation: The statistical properties of the generated data are compared against real historical ad logs using goodness-of-fit tests. AuctionNet, for example, validates its models by comparing the distributions of generated ad opportunities against real-world data [7].
- Reproducibility Stack: To ensure results are verifiable, the benchmark uses a robust reproducibility stack. This includes Docker for containerizing the environment, DVC for data versioning, and open-source libraries like OpenBanditPipeline and AuctionGym to standardize evaluation [22].
The same reproducibility stack will host the RTB simulator once privacy and correlation studies are complete; until then we treat AuctionNet numbers as forward-looking placeholders rather than benchmarked results.
6. Safety, Compliance, and Risk Stress-Tests — 60 Red-Team Scenarios
Before an agent can be considered for the leaderboard, it must pass a mandatory suite of stress tests designed to evaluate its safety, policy compliance, and robustness under adversarial conditions. An agent that is profitable but unsafe is a liability.
6.1 Policy Compliance Suite
Why it matters: Google can suspend entire accounts over one bad creative, so policy automation is a go/no-go requirement. A suite of codified tests ensures agents strictly adhere to Google Ads policies, which are enforced by a combination of Google's AI and human evaluation [9]. The suite covers four major policy areas:
| Policy Area | Test Focus | Examples |
|---|---|---|
| Prohibited Content | Preventing ads that enable dishonest behavior or contain inappropriate content. | Hacking software, academic cheating services, hate speech, graphic content, self-harm [9]. |
| Prohibited Practices | Avoiding abuse of the ad network. | Malware, cloaking, arbitrage, circumventing policy reviews. |
| PII & Data Collection | Ensuring proper handling of personally identifiable information. | Misusing full names, email addresses, financial status, or race—especially in personalized ads [9]. |
| Trademark & Copyright | Respecting intellectual property rights. | Disallowing ads that infringe on trademarks or copyrights [23]. |
6.2 Fairness Audits for Demographic Bias
Why it matters: Regulatory pressure is rising on demographic fairness, and ad distributions that skew can trigger compliance reviews. Inspired by benchmarks like Stanford's HELM (which uses BBQ for social discrimination) and TrustLLM, these audits ensure agents do not perpetuate biases in ad delivery [16].
- Demographic Parity: Tests if ads are shown to different demographic groups at similar rates.
- Disparate Impact: Analyzes if outcomes disproportionately harm protected groups.
- Remediation: Evaluates the agent's ability to implement corrective actions to mitigate identified biases.
6.3 Adversarial Test Suite
Why it matters: Competitors and bad actors will poke at your agent—budget drains and prompt injections are real, so we test against them before production. This suite, inspired by frameworks like AgentHarm and challenges like the Gray Swan Arena, will evaluate the agent's robustness against malicious attacks, measured by metrics like Attack Success Rate (ASR) [16].
- Budget Exploitation: Simulating attacks that manipulate bidding to force overspending.
- Policy Evasion: Using adversarial examples of ad creatives to bypass automated policy detectors.
- Malicious Creative Generation: Testing resilience to prompt injection intended to coerce the agent into generating harmful content.
- Confidentiality & Integrity Attacks: Probing for resistance to revealing sensitive information or overriding core instructions [16].
6.4 Financial Kill-Switch Verification
Why it matters: Even the best models fail; automated kill-switches minimize damage when anomaly detectors trip. These tests are designed to verify that the agent operates within defined financial boundaries and can manage risk effectively [4]. The agent must demonstrate the ability to:
- Adhere to Budget Caps: Respect both daily and monthly budget limits.
- Prevent Overspend: Implement its own safeguards, especially for changes made via the API.
- Implement Kill-Switch Criteria: Programmatically pause or remove campaigns via the API in response to triggers like overspend or severe underperformance [4].
if (spend_today > 1.15 * budget_daily || roas_rolling_3h < roas_floor) {
postAlert({
severity: "critical",
context: { spend_today, roas_rolling_3h, last_change_id },
});
mutateCampaign({
resourceName: campaign,
status: "PAUSED"
});
logKillSwitch("auto-paused", now());
}Guardrail sketch for programmatic kill-switchesThis snippet is illustrative, not production code; the live gate still needs pacing intelligence for shared budgets, cross-account guardrails, and seasonal overrides, all of which are being replay-tested against winter-holiday and back-to-school spend curves.
7. Baseline Agents & Leaderboard Rules — From Heuristics to RL to LLM+Tools
Why it matters: Leaderboards without transparent baselines devolve into marketing—you need anchor agents and rules that punish sandbagging. A credible benchmark requires transparent baseline agents to anchor progress and a clear set of rules to govern the leaderboard and prevent metric gaming.
7.1 Baseline Agent Implementations
The proposed benchmark will include four classes of baseline agents, representing a spectrum of sophistication.
| Agent Type | Description | Required Disclosures |
|---|---|---|
| Heuristic/Rule-Based | Predefined rules for bidding, budgeting, and keyword management—simple but transparent baseline [24]. | Full rule set, thresholds, and logical conditions. |
| Contextual Bandit | Algorithms like LinUCB/Thompson Sampling handle adaptive decisions for ad placement. | Training data source, hyperparameters (learning rates, exploration parameters), and compute budget. |
| Reinforcement Learning | Sequential decision-making (e.g., DQN) to maximize rewards under budget constraints [25]. | Training data, RL algorithm, network architecture, hyperparameters, reward shaping, compute budget. |
| LLM+Tools | LLM orchestrations integrated with the Google Ads API for planning, creatives, diagnostics [26]. | Base LLM, toolset (API surface), prompting strategies, compute/API costs. |
Status: The RL baseline will go live once anonymized observation spaces and log replays clear consent review; we will publish both artifacts so external teams can reproduce the reference policy gradients without guesswork.
7.2 Leaderboard Governance and Rules
The leaderboard will be governed by a clear set of rules to ensure fair and meaningful comparisons [4].
- Submission Protocol: Participants must submit a complete package including agent code, model weights, random seeds, and detailed logs. Disclosures of hardware, compute resources, and normalized cost/latency metrics are mandatory, and each team is capped at two active submissions per quarter. Finals lock 72 hours before evaluation to keep the double-blind process intact [4].
- Anti-Overfitting Controls: To ensure generalization, final evaluation uses a private, hidden test set that is periodically refreshed. Agents are also tested on their ability to generalize to new, unseen advertiser accounts and verticals [4].
- Eligibility and Gating: To appear on the leaderboard, an agent must meet minimum prerequisites for uptime, ethical guidelines, and baseline performance. Clear pass/fail thresholds are defined for critical safety and policy compliance metrics [4].
- Versioning: The benchmark will be managed by designated maintainers with a public update cadence and strict adherence to semantic versioning to ensure stability and transparency.
8. Deployment Gate via Offline → Online OPE — DR & SNIPW in Action
Why it matters: Offline evaluation is cheaper and safer than live traffic, but only if the estimators are robust enough to gate what reaches production.
A critical component of the benchmark framework is the use of Off-Policy Evaluation (OPE) to create a data-driven "gate" between offline testing and expensive online A/B tests [27]. This allows for the safe, efficient, and rapid assessment of new agent policies using historical logged data, ensuring that only statistically superior and safe policies are advanced to live traffic.
The methodology will employ a suite of OPE estimators to manage the inherent bias-variance trade-off [27]. Key estimators include:
- Inverse Probability Weighting (IPW) / Self-Normalized IPW (SNIPW): Provides unbiased estimates but can have high variance. SNIPW trades a small amount of bias for increased stability [27].
- Direct Method (DM): Relies on a model of expected rewards.
- Doubly Robust (DR) / Self-Normalized DR (SNDR): Combines IPW with a reward model, providing a consistent estimate if either the propensity model or the reward model is correct. This "double robustness" is highly justified for complex ad auction environments [27].
A formal gating process will be established where a new agent policy is only approved for a live A/B test if its offline OPE evaluation demonstrates a statistically significant improvement over the baseline and meets all safety criteria [27]. This will streamline experimentation and reduce the cost and risk of testing suboptimal policies online [27].
9. Implementation Roadmap — 5-Month Milestones & Resourcing
Status — Planned for 2026: Development has not started. The timeline below is a proposed schedule beginning January 2026, contingent on securing resources, completing legal review, and finalizing governance. Milestones may shift based on privacy audit outcomes and stakeholder feedback.
A condensed, 5-month roadmap (January 2026 – May 2026) is proposed to develop and launch Ads-Bench, minimizing risk and aligning with necessary privacy, legal, and technical reviews.
| Phase | Months | Key Milestones |
|---|---|---|
| Phase 1: Foundation & Simulation | January 2026 – February 2026 |
|
| Phase 2: Task & Metric Integration | March 2026 – April 2026 |
|
| Phase 3: Advanced Features & Beta | May 2026 |
|
10. Risk Register & Mitigations
The creation of this benchmark carries legal, financial, and technical risks. A pre-emptive risk management strategy is essential.
| Risk Category | Risk Description | Mitigation Strategy |
|---|---|---|
| Legal & Privacy | Exposure of PII from training data, violating GDPR/CCPA. | Prioritize synthetic data generation (AuctionNet model) to break the link to real individuals and enforce strict de-identification [5]. |
| Financial | Agents cause large, uncontrolled overspend in simulation or real accounts. | Mandate budget-cap adherence and kill-switch tests as pass/fail gates [4]. |
| Technical | Simulation lacks fidelity, so offline wins fail online. | Run rigorous fidelity validation plus back-testing against historical outcomes [7]. |
| Reputational | Benchmark gets “gamed” via overfitting to public tests. | Maintain a large, refreshed hidden set and cross-account generalization checks [4]. |
By addressing these risks proactively, we can reduce the top five benchmark-creation risks by over 60% and ensure the long-term credibility and value of Ads-Bench.
11. SWE-Bench Parallels & Workflow Orchestration
The benchmark is designed to mirror SWE-Bench so every submission produces a “solution patch” that can be replayed, diffed, and rated once the suite is live [1]. A full run will therefore include:
- Issue Intake: A structured spec (business objective, diagnostics, constraints) is ingested exactly the way SWE-Bench hands an agent a GitHub issue. Human and LLM judges jointly confirm that the agent’s understanding matches the brief before execution [17].
- Plan + Tool Trace: The agent produces a reasoning trace, then commits a set of ordered Google Ads API calls. This patch is versioned so raters can diff it against the pre-task state and track every budget, asset, and audience change [4].
- Double Review: Each patch is graded twice—first by calibrated LLM judges for throughput, then by expert ad managers who focus on indistinguishability, rationale depth, and clarity, mirroring SWE-Bench’s human-in-the-loop evaluation [20].
SWE-Bench taught us that publishing reproducible patches is the fastest way to debug agent behavior; Ads-Bench applies the same principle to campaign edits, policy appeals, and rollbacks.
12. API & Interface Requirements
Ads-Bench also evaluates whether an agent respects the same API ergonomics, diagnostics, and rate limits as a senior practitioner. The interface is split into observation, action, and constraint layers.
12.1 Observation Surfaces — searchStream everything
- Performance snapshots:
GoogleAdsService.searchqueries overCampaign,AdGroup,AdGroupAd, andKeywordViewresources expose real-time metrics likemetrics.roas,metrics.cpa, andmetrics.conversion_value[29]. - Creative quality:
AssetandAssetGroupAssetreports include Google’s “Best/Good/Low” asset ratings so the agent can prioritize refreshes in PMax asset groups [3]. - Policy + diagnostics: Access to
PolicyTopicEntryand Recommendation resources lets the agent triage disapprovals, appeals, and first-party data gaps before taking action [9][12].
12.2 Action Surfaces — deterministic mutate calls
- Campaign & budget control:
CampaignService.mutateCampaignsandCampaignBudgetService.mutateCampaignBudgetsadjust pacing, bidding, and shared budgets in one transaction [29]. - Asset orchestration:
AdGroupAdService,AssetService, andAssetGroupAssetServiceupdate creatives, inject new videos, or relink asset groups without breaking existing structures. - Audience + experiment ops:
AdGroupCriterionService.mutateAdGroupCriteria,UserListService, andCampaignExperimentServiceenable keyword sculpting, audience refreshes, and holdout tests within the same benchmark scenario. - Offline conversion hygiene:
ConversionUploadServiceand policy-aware retries ensure offline signals remain synchronized with Google Ads measurement [12].
12.3 Operational Constraints — rate limits, batches, and failsafes
- Rate limits & quotas: The simulator enforces realistic per-customer QPS ceilings, so agents must batch writes via
GoogleAdsService.mutateinstead of spamming single-field updates [30]. - Partial failure handling: Batch operations intentionally return partial failures to verify the agent can retry idempotently, back off, and emit telemetry for human review.
- Long-running jobs: Report downloads and experiment boots are modeled as asynchronous operations; the agent must poll and reconcile results without blocking critical guardrails.
Future Work — Ads-Bench RTB
Why it matters: Readers should know exactly how the benchmark would graduate from the Google Ads account layer described in this proposal to the RTB gauntlet we envision building after the initial release.
OpenRTB + Protobuf support (vNext, not live). We still owe a stateful connector that ingests OpenRTB 2.6 bid requests/responses via Protobuf so the simulator can replay bidstreams at scale. That code will only ship after the correlation study proves auction KPIs track with the account-level metrics documented above.
Real-time Bidding + Marketplace APIs (planned). The RTB module will pair Authorized Buyers and Marketplace endpoints so agents can ride the same pipes a large DSP uses. Until the privacy review signs off, those APIs stay dark and the current release remains Google Ads–only.
Sub-60 ms callout quotas and dual kill switches (planned). We are instrumenting a latency harness that enforces p95/p99 callout budgets under 60 ms, adds bid-level kill switches, and logs dual-disclosure events. None of that instrumentation is live; it belongs to Ads-Bench vNext.
gps-phoebe value-injection pipelines (experimental). A gps-phoebe layer is being prototyped to inject brand, compliance, and budget priors into RTB decisions so auction edits stay aligned with human taste even under adversarial load. It will remain experimental until reviewer data shows a measurable drop in policy escalations.
Every roadmap item above will replace the interim vendor-reported stats with peer-reviewed measurements once the studies complete. We will keep threading these milestones into the public roadmap so readers see a single narrative arc rather than two disjoint stories.
13. Appendices
(Appendices to include detailed tables, a full glossary of terms, Google Ads API reference stubs for the observation and action interfaces, and complete mathematical definitions for all metrics used in the composite scoring rubric.)
References
1. ^ philschmid/ai-agent-benchmark-compendium. https://github.com/philschmid/ai-agent-benchmark-compendium
2. ^ Define your evaluation metrics | Generative AI on Vertex AI. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval
3. ^ Google Ads AI Max vs Manual Optimization. https://groas.ai/post/google-ads-ai-max-vs-manual-optimization-performance-comparison-2025
4. ^ AI benchmarking framework measures real-world .... https://aisera.com/blog/enterprise-ai-benchmark/
5. ^ AuctionNet: A Novel Benchmark for Decision-Making in Large-Scale .... https://proceedings.neurips.cc/paper_files/paper/2024/hash/ab9b7c23edfea0011507f7e1eae82cd2-Abstract-Datasets_and_Benchmarks_Track.html
6. ^ Google's AI advisors: agentic tools to drive impact and .... https://blog.google/products/ads-commerce/ads-advisor-and-analytics-advisor/
7. ^ AuctionNet: A Novel Benchmark for Decision-Making in Large-Scale .... https://arxiv.org/html/2412.10798v1
8. ^ Drive peak campaign performance with new agentic capabilities. https://blog.google/products/ads-commerce/ai-agents-marketing-advisor/
9. ^ Google Ads policies - Advertising Policies Help. https://support.google.com/adspolicy/answer/6008942?hl=en
10. ^ Machine Learning-Powered Agents for Optimized Product .... https://www.mdpi.com/2673-4591/100/1/36
11. ^ The hidden risks of Google's automated advertising | Windsorborn. https://windsorborn.com/insights/thinking/the-hidden-risks-of-googles-automated-advertising
12. ^ User-provided data matching | Ads Data Hub. https://developers.google.com/ads-data-hub/guides/user-provided-data-matching
13. ^ Google Ads AI Agents - How To Run Them in 2025. https://ppc.io/blog/google-ads-ai-agents
14. ^ Google Ads Benchmarks for YOUR Industry [Updated!]. https://www.wordstream.com/blog/ws/2016/02/29/google-adwords-industry-benchmarks
15. ^ Evaluate your AI agents with Vertex Gen .... https://cloud.google.com/blog/products/ai-machine-learning/introducing-agent-evaluation-in-vertex-ai-gen-ai-evaluation-service
16. ^ 2025 AI Safety Index - Future of Life Institute. https://futureoflife.org/ai-safety-index-summer-2025/
17. ^ Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena - arXiv. https://arxiv.org/abs/2306.05685
18. ^ Artificial intelligence vs. human expert: Licensed mental health .... https://pmc.ncbi.nlm.nih.gov/articles/PMC12169703/
19. ^ Understanding Human Evaluation Metrics in AI - Galileo AI. https://galileo.ai/blog/human-evaluation-metrics-ai
20. ^ Rubric evaluation: A comprehensive framework for generative AI .... https://wandb.ai/wandb_fc/encord-evals/reports/Rubric-evaluation-A-comprehensive-framework-for-generative-AI-assessment--VmlldzoxMzY5MDY4MA
21. ^ LLMs-as-Judges: A Comprehensive Survey on LLM-based .... https://arxiv.org/html/2412.05579v2
22. ^ Open Bandit Pipeline; a python library for bandit algorithms and off .... https://zr-obp.readthedocs.io/en/latest/
23. ^ Trademarks - Advertising Policies Help. https://support.google.com/adspolicy/answer/6118?hl=en
24. ^ Heuristic optimization algorithms for advertising campaigns. https://docta.ucm.es/bitstreams/3fa537ed-aa9f-44ca-85cc-bebaa5d9927b/download
25. ^ Deep Reinforcement Learning for Online Advertising Impression in .... https://arxiv.org/abs/1909.03602
26. ^ Google Launches Gemini-Powered AI Agents ... - ADWEEK. https://www.adweek.com/media/google-ai-agent-ads-analytics-advisor/
27. ^ Off-Policy Evaluation and Counterfactual Methods in .... https://arxiv.org/abs/2501.05278
28. ^ About automated bidding | Google Ads Help. https://support.google.com/google-ads/answer/2979071
29. ^ Google Ads API services overview. https://developers.google.com/google-ads/api/docs/get-started/services
30. ^ Rate limits and quotas | Google Ads API best practices. https://developers.google.com/google-ads/api/docs/best-practices/rate-limits
31. ^ GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. https://arxiv.org/abs/2510.04374
32. ^ Measuring the performance of our models on real-world tasks. OpenAI. https://openai.com/index/gdpval/
33. ^ OpenAI says top AI models are reaching expert territory on real-world knowledge work. The Decoder. https://the-decoder.com/openai-says-top-ai-models-are-reaching-expert-territory-on-real-world-knowledge-work/
34. ^ The AI Productivity Index (APEX): Measuring Executive-Level Performance Across Professions. https://arxiv.org/abs/2509.25721
35. ^ Introducing APEX: AI Productivity Index (Mercor leaderboard). https://www.mercor.com/blog/introducing-apex-ai-productivity-index/
36. ^ AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants. TIME. https://time.com/7322386/ai-mercor-professional-tasks-data-annotation/