AVEVA Core AI Services Platform

Problem Space

Standardising AI consumption across AVEVA's product teams meant solving three coupled problems at once: a shared cloud-native service surface that teams could safely build on, a measurement layer that could tell whether AI changes actually helped or hurt, and a way to expose a heterogeneous backend estate — REST, GraphQL, OData, and existing MCP servers — as well-shaped tools that agents could call without getting confused. Without all three, the platform would have been either an unmeasurable demo or a thin pass-through that left every product team to reinvent its own evaluation and tool-wrapping logic.

Architecture & Patterns

AKS-based cloud-native platform with versioned public APIs and SDKs and secure-by-design controls for enterprise consumption.
LLM-as-Judge evaluation framework grading candidate outputs across correctness, faithfulness, tool-call accuracy, and schema compliance — with judge prompts versioned and calibrated against human-graded samples to prevent drift.
Auto-tuning model routing: the same evaluation set runs across a panel of candidate models, and the routing loop picks the cheapest one clearing the quality threshold — FinOps for inference.
Synthetic query generation and randomised holdout sampling, so evaluation distributions are broader than any one person's intuition and demonstrably excluded from any fine-tuning or prompt-tuning step.
Domain-specific Profunctor that wraps REST, GraphQL, OData, and existing MCP services and re-projects them as MCP tools — one adapter layer, four input families, one consistent output shape.
Dynamic MCP-side metadata (descriptions, tags, parameter names) exposed as a tunable layer separate from the underlying service.
Tool-confusion as a first-class metric: wrong-tool selection, malformed parameter shapes, and ambiguous-pick retry rates are measured per metadata variant and gate which framings get promoted.
Defence-in-depth agentic guard rails: intent classifier in front of the LLM, tool-level allow-list, and output-side action guard — so a single failure mode does not escalate into an out-of-scope action.
Refusal correctness graded as a first-class axis by the LLM-as-Judge harness (false-allow / false-refuse rates per model and per prompt variant).
Diagnosed-and-fixed catalog of long-running agentic failure modes: context degradation, specification drift, runtime tool-selection errors, cascade failures across multi-step agents, and silent failures.
LLM-as-Judge harness reused as a regression suite — every diagnosed failure mode lives on as a named scenario set so future prompt or model changes cannot silently reintroduce it.
Multi-agent / multi-sub-agent context architecture: per-sub-agent isolation with scoped views per role contract, just-in-time context retrieval at invocation, explicit hierarchical hand-off contracts between parent and sub-agents, and boundary-level context-budget pruning.
Production-grade observability embedded in agentic workflows.

Tools & Stack

Azure, AKS, Python, MCP, OpenAPI, GraphQL, OData, pytest

Business Outcomes

Established a reusable platform foundation that enabled consistent, secure consumption of AI capabilities across multiple product teams.
Reduced LLM inference cost by ~30% via auto-tuned model routing backed by measured quality rather than vendor marketing.
Reduced prompt and model evaluation cycle time by ~60% through the reusable LLM-as-Judge framework and tabulated metric grids.
Improved detection of weak AI responses by ~35% through structured, multi-axis evaluation criteria.
Reduced duplicated AI platform engineering by ~40% by absorbing evaluation and service-to-MCP wrapping into a shared layer rather than leaving product teams to reinvent it.
Constrained the agentic chatbot to a sanctioned task allow-list, producing honest predictable refusals for everything else — verified via graded refusal correctness.
Stabilised long-running agentic workflows by diagnosing and fixing five distinct classes of advanced failure and locking each as a permanent regression scenario in the evaluation harness.
Made multi-sub-agent compositions reliably scalable by treating context as a first-class system primitive — each sub-agent saw only what it needed at the point it needed it, eliminating wholesale transcript forwarding and the unbounded cost, leakage, and confusion that produces.

Reusable Narrative Snippets

Shared cloud-native AI services platform on Azure / AKS consumed across product teams and partners, providing stable versioned APIs and SDKs and absorbing the cross-cutting work — security, evaluation, tool wrapping — that would otherwise have been reinvented in every product team.

Built an LLM-as-Judge evaluation framework producing per-prompt and per-model metric grids across correctness, faithfulness, tool-call accuracy and schema compliance, paired with an auto-tuning loop that selected the cheapest model meeting the quality threshold — cutting inference cost ~30% — and synthetic query generation plus randomised holdout sampling that kept golden-set overfitting out of the score.

Designed a domain-specific Profunctor that wraps REST, GraphQL, OData and existing MCP services as MCP tools with dynamically tunable descriptions, tags and parameter names, and wired the Profunctor into the LLM-as-Judge harness so metadata variants were promoted or retired against a measured tool-confusion metric rather than authored once and left to rot.

Helped build and define a defence-in-depth guard-rail layer for the agentic chatbot — an intent classifier that mapped requests to an allow-listed task taxonomy or short-circuited to a refusal, a tool-level allow-list so off-list capabilities weren't in the agent's surface, and an output-side second-pass guard — with refusal correctness graded as a first-class axis in the LLM-as-Judge framework.

Debugged advanced agentic failure modes across long-running multi-step agents — context degradation, specification drift, runtime tool-selection errors, cascade failures and silent failures — and turned each diagnosed-and-fixed mode into a named regression scenario in the LLM-as-Judge evaluation harness, so future prompt and model changes could not silently reintroduce previously fixed classes of failure.

Designed advanced context architectures for multi-agent / multi-sub-agent systems so each agent received only the context it needed at the point of need — combining per-sub-agent isolation with scoped views, just-in-time retrieval at invocation, explicit hierarchical hand-off contracts between parent and sub-agents, and boundary-level context-budget pruning. Context became a first-class system primitive rather than a side-effect of how agents happen to be wired.

Source Notes

Derived from role responsibilities and achievements in config/madu_profile.json; reconciled with JobVia export (madu_alikor_export.json).
Confidence: high