SRE Multi-Agent Platform - Powered by Google ADK
Production-grade Site Reliability Engineering (SRE) using a multi-agent AI architecture. Engineered with Google Agent Development Kit (ADK), deployed on GCP, with full observability via Cloud Trace, Langfuse, LangSmith, Prometheus, and Grafana.
Executive Summary
Modern SRE is traditionally reactive β humans get paged, scramble to find dashboards, and manually triage alerts. This platform demonstrates a paradigm shift: AI agents performing the heavy lifting of infrastructure management and incident response.
- 15 distinct AI agents across 2 specialized squads β Squad A (Proactive Platform Engineers) and Squad B (Reactive SRE Responders), orchestrated by a central routing agent
- Service Onboarding (Day 0/1): Auto-generates architecture blueprints, SLO recommendations, and monitoring configurations
- Platform Health (Day 2+): Proactively monitors infrastructure, API quotas, and metric ingestion pipelines
- P1 Incident Response: Executes a 5-phase sequential pipeline activating 9 agents β from parallel triage to remediation, communications, and automated postmortem generation
- Enterprise Observability: Every agent action is traced, logged, and scored β full span hierarchies (tool latency, token counts, agent delegation chains) visible in Cloud Trace and Langfuse
Table of Contents
- Architecture Overview
- Agent Design
- Infrastructure Stack
- Observability Stack
- Agent Evaluation Framework
- Observability Tool Comparison
- Deployment Architecture
- Conceptual Repository Structure
- Roadmap
- Consulting & Implementation
Architecture Overview
End-to-End System Data Flow
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GCP INFRASTRUCTURE LAYER β
β β
β Target Services Observability Backend β
β βββββββββββββββββββββ ββββββββββββββββββββββββββββββββββ β
β β App Containers βββscrapeββ>β Prometheus β β
β β /metrics endpoint β β Grafana β β
β βββββββββββββββββββββ β Alertmanager β β
β ββββββββββββββββ¬ββββββββββββββββββ β
β β alert fired (SLO breach) β
β Serverless Tooling v β
β ββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ β
β β Cloud Functions β β Pub/Sub Event Bus β β
β β (Prometheus, Logs, etc)β ββββββββββββββββ¬ββββββββββββ β
β ββββββββββββ¬ββββββββββββββ β β
β β HTTP calls from agents β triggers β
β v v β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β AGENT ORCHESTRATION LAYER β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββ β β
β β β SRE Platform Agent (root) β β β
β β β Classifies intent -> routes to β β β
β β β Squad A (proactive) or β β β
β β β Squad B (reactive/incident) β β β
β β ββββββββββββββββ¬ββββββββββββββββββββββ β β
β β β β β
β β βββββββββββββββ΄ββββββββββββββββ β β
β β v v β β
β β ββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β β
β β β SQUAD A β β SQUAD B β β β
β β β (SequentialAgent) β β (SequentialAgent pipeline) β β β
β β β β β β β β
β β β > Architect β β Phase 1: Acknowledge β β β
β β β > Simulator β β Phase 2: Triage <- ParallelAgent β β β
β β β > Watchdog β β βββ Investigator β β β
β β β > Tuner β β βββ Dependency Analyst β β β
β β ββββββββββββββββββββββ β βββ Security Analyst β β β
β β β βββ Logs Analyst β β β
β β β Phase 3: RCA (synthesis) β β β
β β β Phase 4: Operator (CONFIRM) + β β β
β β β Comms Officer β β β
β β β Phase 5: Scribe -> Postmortem β β β
β β ββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β OBSERVABILITY LAYER (Cross-cutting - all layers active simultaneously) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ADK Web UI -> Real-time trace tree, session state, YAML event stream β β
β β OTel Spans -> Cloud Trace & Langfuse (execution trees, token counts) β β
β β Structured Logs-> Cloud Logging (JSON, duration_ms, status per tool) β β
β β Custom Metrics -> Cloud Monitoring (duration, call counts, error rates) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Agent State Machine & Execution Flow

Key Architectural Decisions
| Decision | Rationale |
|---|---|
| Sequential Pipelines | Guarantees phase ordering in incident response β triage must complete before remediation begins |
| Parallel Triage | Investigator, Security, Dependency, and Logs agents run simultaneously, drastically reducing MTTR |
| Session State Handoffs | Agents share data via decoupled session.state rather than direct API calls β high fault tolerance |
| CONFIRM Safety Gate | Hard lock on the Operator agent β no infrastructure mutation without explicit human approval |
| Asynchronous Telemetry | Metric exporters run in background queues to prevent observability overhead from delaying agent response |
| OTLP for Langfuse | Reuses ADK's existing OpenTelemetry provider β zero additional instrumentation on the agent code |
Agent Design
Squad A - Proactive Platform Engineers
Handles proactive optimization, new service onboarding, alert tuning, and system health checks.
- Architecture Agent β Detects tech stacks from repositories, generates SLO YAML configs, and drafts monitoring infrastructure as code (Terraform)
- Simulator Agent β Backtests new alert configurations against historical telemetry to prevent alert fatigue before rules go live
- Data Health Watchdog β Monitors the monitoring system itself: validates metric ingestion rates, checks API quotas, and pings agent infrastructure health
- Tuner Agent β Analyzes historical alert noise and metric distributions to dynamically recommend P95/P99 threshold adjustments
Squad B - Reactive SRE Responders
Handles reactive P1/P2 incident response via a strict 5-phase SequentialAgent pipeline.
Phase 1: Acknowledge
- Triggers PagerDuty on-call escalation and spins up a dedicated Slack incident channel
Phase 2: Triage (4 agents run in parallel)
- Investigator β Executes PromQL queries and calculates composite service health scores
- Dependency Analyst β Checks external API health and upstream cloud provider status
- Security Analyst β Audits Cloud Armor/WAF rules for active attack patterns
- Logs Analyst β Fetches recent error logs and correlates them with distributed trace IDs
Phase 3: Root Cause Analysis
- The Commander LLM synthesizes all parallel triage output into a root cause hypothesis
Phase 4: Remediation
- Operator β Executes infrastructure fixes (e.g., Kubernetes rollback, WAF rule update) β requires explicit human
CONFIRM - Comms Officer β Updates public status pages and broadcasts internal Slack notifications
Phase 5 Postmortem
- Scribe β Auto-drafts a structured incident timeline and creates follow-up Jira tickets
The CONFIRM Safety Gate
The Operator agent features a hard safety lock. For any mutable infrastructure action (Kubernetes resource changes, WAF rule updates), the system prompt enforces:
- Propose the exact action to be taken with full context
- Halt execution and wait for explicit human
CONFIRMinput in the next message - Execute only upon receiving the
CONFIRMkeyword β never autonomously
This is the most critical production safety pattern in the system. Agents are first responders; humans remain decision-makers.
Infrastructure Stack
The platform is designed to interface with modern cloud-native environments:
| Component | Technology | Role |
|---|---|---|
| Compute & Workloads | Cloud Run, GKE, Compute Engine | Host target services and agent runtime |
| Telemetry Generation | Prometheus + Cloud Logging | Scrapes /metrics endpoints, collects native logs |
| Event Routing | Alertmanager β Pub/Sub | Routes SLO breach alerts to trigger agent workflows |
| Agent Tools | Cloud Run Functions | Serverless HTTP functions called by agents (PromQL, Logs, Health) |
| Secrets | Secret Manager | API keys, tokens β never hardcoded |
| CI/CD | Cloud Build | Auto-deploys agent updates on code push |
Observability Stack (7-Layer)
This platform implements one of the most comprehensive observability patterns available for multi-agent systems. Every action is logged, traced, and scored.
| Layer | Tool | What It Captures |
|---|---|---|
| 1 | ADK Web UI | Real-time trace tree, full prompt/response, session state β best for development |
| 2 | Structured Logs | JSON logs per tool call: duration_ms, status, agent, call_id |
| 3 | Cloud Trace | GCP distributed tracing β deep agent hierarchy + GenAI token counts |
| 4 | Agent Evaluation | Automated scoring against 10 golden scenarios (tool accuracy + keyword coverage) |
| 5 | LangSmith | Tool-level tracing β useful for LangChain/LangGraph team comparisons |
| 6 | Langfuse | OTLP export β cost dashboards, P95 latency, usage capacity planning |
| 7 | Cloud Monitoring | Async custom metrics: tool_calls_total, tool_duration_ms for Grafana |
Example Telemetry Profile
When an agent performs a full system check, the resulting OpenTelemetry span waterfall captures every sub-action:
Span Waterfall (example run):
invocation ββββββββββββββββββββ ~50s
invoke_agent sre_platform [GenAI] ββββββββββββββββββββ
call_llm ββββββββββββββββββββ
generate_content [gemini-2.5-flash] ββββββββββββ
invoke_agent squad_a_coordinator ββββββββββββ
invoke_agent watchdog ββββββββ
watchdog.query_internal_metrics βββ ~7s β custom span
watchdog.alert_platform_team β ~1s β custom span
watchdog.check_gcp_quotas ββ ~11s β custom span
watchdog.ping_agent_runtime β ~2ms β custom span
All spans appear simultaneously in Cloud Trace and Langfuse via a single OTLP export. The ADK Web UI shows them in real-time during execution.
Agent Evaluation Framework
Testing LLMs requires a shift from unit tests to probabilistic evaluation. This platform includes an automated scoring pipeline.
Scoring Formula
Tool Accuracy = (expected tools called) / (total expected tools)
Keyword Coverage = (expected keywords in response) / (total expected keywords)
Overall Score = (Tool Accuracy + Keyword Coverage) / 2
PASS = Overall Score β₯ 0.70 (production gate: β₯ 0.90)
What Gets Tested
- Tool Accuracy β Did the agent select the optimal tool set for the scenario?
- Keyword Coverage β Did the final synthesis contain the required technical context?
- Pass/Fail Gates β CI/CD pipelines require β₯90% overall score across 10 golden scenarios (e.g.,
latency_incident,security_incident,resource_saturation) before any prompt update is deployed
Observability Tool Comparison
For enterprise implementations, the observability stack is tailored to the client's ecosystem:
| Client Profile | Recommended Stack |
|---|---|
| Enterprise / GCP-Native | Cloud Trace + Cloud Monitoring + Native Logs |
| Multi-Cloud / Framework-Agnostic | Langfuse (self-hosted or cloud) + custom metrics |
| LangChain / LangGraph Teams | LangSmith native integration |
| Strict Compliance / Data Sovereignty | Langfuse self-hosted within client VPC |
| Small Company / Budget-Conscious | Langfuse cloud free tier + ADK Web UI |
Deployment Architecture
Development (Current)
Agents run locally via adk web, connecting to GCP services (Cloud Trace, Langfuse) over the internet.
Production Target (Cloud Run)
Cloud Run (containerized ADK agent)
βββ stdout logs β Cloud Logging (automatic)
βββ Cloud Logging β Cloud Monitoring log-based metrics
βββ OTel spans β Cloud Trace (native, low latency)
βββ OTel spans β Langfuse (same OTLP code, zero changes)
Deploying to Cloud Run requires zero code changes β the observability stack auto-upgrades to native Cloud Logging ingestion.
Conceptual Repository Structure
Note: This represents the architectural structure of the implementation. Specific file names and internal logic are proprietary.
Multi-Agent-SRE-Platform/
βββ core_engine/
β βββ agent_router/ β Root intent classification & routing
β βββ observability/ β OTel setup, @observe decorator, async metric queues
β βββ memory_management/ β Session state and context window handlers
β
βββ squads/
β βββ proactive_squad_a/ β Architect, Simulator, Watchdog, Tuner
β βββ reactive_squad_b/ β 5-phase pipeline: Acknowledge β Triage β RCA β Remediate β Postmortem
β
βββ integrations/
β βββ cloud_providers/ β GCP / AWS API wrappers
β βββ monitoring_tools/ β Prometheus, Grafana, Datadog hooks
β βββ communication/ β Slack, PagerDuty, Jira integration logic
β
βββ evaluation/
β βββ golden_scenarios/ β JSONL datasets for agent testing (10 scenarios)
β βββ scoring_engine/ β Automated CI/CD pass/fail logic
β
βββ infrastructure_as_code/ β Terraform modules for deploying the agent framework
Roadmap
- Cross-Incident Memory β Implement Vertex AI RAG Memory so agents correlate a current incident with a similar one resolved months ago
- Multi-Human Approval β Expand the CONFIRM gate to require M-of-N approvals for highly destructive operations
- A/B Prompt Testing β Integrate Langfuse Datasets to scientifically compare prompt strategies against MTTR metrics
- LLM-as-a-Judge β Upgrade evaluation from keyword matching to semantic scoring using a critic LLM
- Domain Extensions β Apply the same architecture to Customer Support, Finance/Compliance, and HR/Onboarding use cases
Consulting & Implementation
An SRE team doesn't scale linearly with incidents. Agents do.
This architecture is available as a blueprint for enterprise implementation. We specialize in designing, building, and safely deploying multi-agent AI systems tailored to your unique infrastructure and operational workflows.
Our Principles:
- Agents as First Responders, Humans as Decision-Makers β Safety gates (CONFIRM loops) are mandatory for all mutable infrastructure actions
- Observability is Non-Negotiable β You cannot trust what you cannot measure; we instrument every token and API call
- Evaluation Gates Deployments β No AI logic reaches production without passing a rigorous, automated evaluation suite
If you're exploring multi-agent automation for your DevOps, SRE, or Platform Engineering teams β feel free to reach out to discuss architecture reviews or custom implementations.
Write to us at: contact@qubitlyventures.com