Principal AI Engineer - Cloud & Platform Services

Company Context

AVEVA is a global industrial software leader delivering secure, scalable platform services for industrial and engineering software.

Summary

Principal AI Engineer - Cloud & Platform Services at Aveva, 2025-09 to present.

Responsibilities

Led the design and delivery of cloud-native Core AI Services forming a shared platform consumed across product teams and partners.
Engineered fault-tolerant platform services on Azure, including AKS-based deployments and operational patterns for reliability and scale.
Designed secure, versioned public APIs and SDKs to standardise access to AI capabilities across teams and integrations.
Embedded authentication, authorisation and data protection controls to meet enterprise security and privacy requirements.
Drove adoption of agentic coding best practices as a key contributor to the Hypervelocity Engineering initiative, establishing patterns for AI-assisted development using GitHub Copilot CLI, Claude Code, and ChatGPT Codex.
Designed and developed agentic workflows that accelerate both software engineering delivery and business process automation.
Built an LLM-as-Judge evaluation framework that produced per-prompt and per-model metrics — turning prompt and model changes from subjective judgement into measurable deltas across correctness, faithfulness, tool-call accuracy and schema compliance.
Built an auto-tuning loop that ran candidate prompts and queries across a panel of models and picked the cheapest one meeting the quality threshold, significantly reducing inference spend without regressing answer quality.
Designed a domain-specific Profunctor — a typed adapter that wraps any REST, GraphQL, OData or existing MCP service and re-projects it as an MCP tool with dynamically tunable metadata (descriptions, tags, parameter names) — and a paired evaluation framework that measured whether metadata reshaping actually reduced agent tool-confusion.
Helped build and define defence-in-depth agentic guard rails for the chatbot — an intent classifier in front of the LLM that mapped each request to an allow-listed task taxonomy or short-circuited to a refusal, a tool-level allow-list so off-list capabilities weren't even present in the agent's surface, and an output-side guard that inspected proposed actions before execution — with refusal correctness graded as a first-class axis in the LLM-as-Judge framework.
Diagnosed and remediated advanced agentic failure modes in long-running multi-step agents — context degradation as transcript clutter crowded out signal, specification drift across multi-step tasks, runtime tool-selection errors distinct from metadata-driven confusion, cascade failures where one corrupted step poisoned downstream agents, and silent failures that returned confident but wrong output — and turned each fix into a permanent named regression scenario in the LLM-as-Judge evaluation harness.
Designed advanced context architectures for multi-agent / multi-sub-agent systems so each agent received the right context — and only the right context — at the point of need. Combined per-sub-agent context isolation with scoped views per role, just-in-time context retrieval against authoritative sources at invocation time, explicit hierarchical hand-off contracts so parents passed down only what a sub-agent needed, and boundary-level context-budget management so summarisation and pruning happened at each hop rather than letting history accumulate unbounded.

Evaluation Framework

A meaningful share of this work was building the measurement layer that everything else stood on. Without a harness, every prompt tweak, model swap, or routing change is a gut call dressed up as engineering. I treated evaluation as a first-class platform capability — prompt-, model-, and routing-changes had to defend themselves against numbers, not vibes, before they shipped.

The framework is LLM-as-Judge at the core: judge prompts grade candidate outputs across multiple axes — correctness, faithfulness to the underlying data, tool-call accuracy, and schema compliance — and the judges themselves are versioned and calibrated against human-graded samples so they can't silently drift. Each run produces a tabulated metric grid (model × prompt-variant × question-set) that makes regressions obvious and improvements legible. On top of that I built an auto-tuning loop: the same evaluation set runs across a panel of candidate models, scores are ranked, and the loop picks the cheapest model that still clears the quality threshold. That single change — treating model selection as FinOps backed by measured quality rather than vendor marketing — drove the ~30% inference-cost reduction the platform now ships with.

The hardest part was making sure we were measuring generalisation rather than memorisation. Hand-curated golden sets quietly become test sets you train against — prompts get tuned until they pass the known questions, and the score stops meaning what it used to. I introduced two countermeasures. The first is synthetic query generation: tooling that programmatically synthesises candidate questions across the domain so the evaluation surface is broader than any one person's intuition. The second is randomised holdout sampling — every evaluation run pulls a fresh sample of queries that were demonstrably excluded from any fine-tuning or prompt-tuning step. Together they keep score deltas honest: when a number moves, it reflects real behaviour change rather than a tighter fit to the same handful of questions we've already optimised against.

Service-to-MCP Profunctor

The agents we were enabling had to call across a heterogeneous estate — internal REST endpoints, GraphQL schemas, OData feeds for engineering data, and several existing MCP servers. The obvious path — write a bespoke MCP adapter per service — would have ossified into N copies of the same plumbing, each diverging in subtle ways and each carrying its own bugs. So I built one layer instead: a domain-specific Profunctor that takes a service interface in any of those four shapes and projects it as an MCP tool.

A Profunctor, in plain terms, lets you transform both the inputs and the outputs of a function-like thing — which is exactly what bridging two protocol surfaces requires. One layer, four input families (REST / GraphQL / OData / MCP), one consistent MCP-shaped output. More importantly, the adapter exposes the MCP-side surface — tool description, tags, parameter names — as a tunable layer separate from the underlying service. Agents reason about which tool to call partly via those strings, so the same backend can be wrapped in several framings and tested independently without touching the service it points at.

That tunability only matters if you can measure whether your changes help, so the Profunctor wires straight into the evaluation harness. Description, tag, and parameter-name variants run through the LLM-as-Judge framework with a specific metric of interest: tool-confusion — the rate of wrong-tool selection, malformed parameter shapes, and ambiguous-pick retries. Metadata changes that didn't move the needle were retired; ones that did got promoted, with the supporting numbers sitting next to them in the run log rather than living in someone's head. The result is a closed loop: heterogeneous services come in one side, well-shaped MCP tools come out the other, and the metadata on those tools is continuously tuned against measured agent behaviour rather than authored once and left to rot.

Agentic Guard Rails

An agent that can attempt anything is one that fails in unbounded ways. Enterprise consumers of the platform needed a clear contract: a fixed list of tasks the chatbot performs, and an honest, predictable refusal for everything else. I helped build and define that contract — and crucially, helped define it as a defence-in-depth layer rather than a single text-layer constraint. A sufficiently determined prompt will talk its way around a system-prompt instruction; it cannot talk its way around a tool that isn't there.

The guard rails sit in three independent layers. First, an intent classifier in front of the LLM: every incoming request is mapped onto the allow-listed task taxonomy or short-circuited to a refusal before the model ever sees free-form input. Second, a tool-level allow-list: the agent is only ever handed the MCP tools needed for its sanctioned tasks, so off-list capabilities aren't merely discouraged — they aren't present in the surface the agent can reason over. Third, an output-side guard: a second-pass check inspects the proposed action before any tool is actually invoked, blocking anything that drifted off-scope despite the earlier gates. Three independent layers means a single failure mode — a classifier miss, a prompt-injection attempt, an unexpected emergent behaviour — doesn't escalate into an out-of-scope action.

Guard-rail behaviour is only trustworthy if it's measured, so we made refusal correctness a first-class graded axis in the LLM-as-Judge framework. Off-list prompts must be cleanly refused — with the right refusal copy, not a hallucinated attempt or a partial answer that quietly does the wrong thing; on-list prompts must actually be served rather than being over-refused into uselessness. The same metric grid that reports answer correctness and tool-call accuracy now reports false-allow and false-refuse rates per model and per prompt variant. Guard-rail changes therefore ship the same way prompt changes do — with a defensible metric grid behind them, not a "looks fine on the demo" pass — and we can spot drift, regressions, and over-refusal the moment they appear in the numbers.

Debugging Advanced Agentic Failures

Long-running multi-step agents fail in ways that single-turn LLM tests don't catch. Five classes recurred often enough across the platform to deserve named, separate attention — and they were where most of the hands-on debugging time actually went.

The first was context degradation. As transcripts and tool outputs accumulate over a long loop, the signal needed for the current decision gets crowded out by historical noise the agent must still scan past. The symptom is an agent that started a task competently and ends it incoherently for no visible reason. The fix wasn't bigger context windows — it was rolling summarisation at structured checkpoints, deliberate context windowing that drops what's no longer relevant, and retrieval-based re-grounding against authoritative sources rather than against the agent's own scrolled-back history.

The second was specification drift. Multi-step tasks quietly redefine the original goal: a constraint gets forgotten, one sub-goal is satisfied at the expense of another, or the agent reinterprets what "done" means each turn until it lands somewhere adjacent to the ask rather than on it. Mitigated by pinning the specification into a stable context region that doesn't roll off, and by adding periodic re-grounding checkpoints that force the agent back to the original ask before it commits to the next step.

The third was runtime tool-selection — distinct from the metadata-driven tool-confusion the Profunctor addressed. Even with perfectly tuned descriptions, agents pick the wrong tool for the step at hand, build redundant chains where one call would do, or fail to reach for a relevant tool that is sitting right there in the surface. Addressed with structured planning prompts that force an explicit plan before execution, per-step tool-call traces flagged where the chosen tool's output didn't actually satisfy the step's intent, and runtime gates that demand justification when a low-utility tool is picked.

The fourth was cascade failure. In multi-step or multi-agent flows, a single upstream error corrupts every downstream consumer — and the final output looks confident because each downstream agent did its job correctly given the corrupted input. The danger is precisely that nothing throws. Mitigated by per-step provenance, confidence propagation so downstream consumers can see how trustworthy their inputs are, and detectors that flag any output that depends on an already-flagged upstream step rather than letting corruption travel silently through the pipeline.

The fifth was the hardest: silent failures. Output that is schema-valid, tone-confident, and quietly wrong. No exception, no retry, no obvious signal — just an answer that looks right and isn't. Mitigated with structural consistency checks, retrieval-grounded verification of factual claims, and second-pass validators (often LLM-as-Judge-shaped) that compare the proposed answer against expected shapes and sources before it's ever surfaced to a user.

Verifying the Fixes

Diagnosing a failure mode once is necessary but not sufficient — the same class of failure will quietly reappear after the next prompt tweak or model swap if nothing is defending against it. For each of the five classes above I built a named scenario set inside the LLM-as-Judge framework: adversarial questions or multi-step tasks engineered to deterministically reproduce the failure. Fix the underlying cause, watch the relevant axis on the metric grid move, and the scenario set then stays in the suite as a permanent regression test. Future prompt swaps, model swaps, and routing changes can't silently reintroduce a class of failure that's already been diagnosed and fixed — if they try, the relevant scenario set lights up in the next evaluation run.

The deeper point of this was to stop agentic reliability from being a heroics-driven activity dependent on whoever happened to remember the last incident. Each named failure mode promoted from a debug session into the harness is one more piece of institutional memory the platform now defends automatically — and the backlog of "advanced agentic failures we ever encountered" becomes a finite, shrinking list rather than an open-ended source of surprises. That's the difference between an agentic platform that demos well and one that other teams can actually build on with a straight face.

Multi-Agent Context Architecture

Multi-agent and multi-sub-agent systems have a context problem distinct from single-agent loops. Bundling the whole transcript and forwarding it to every sub-agent guarantees confusion, leakage, and unbounded cost; starving sub-agents of context guarantees they pick the wrong action. The right answer sits in the middle and has to be designed deliberately — at Aveva I built advanced context architectures that treated context as a first-class system primitive rather than as a side-effect of how agents happen to be wired together.

The first mechanism is per-sub-agent context isolation. Each sub-agent has its own scoped context view defined by its role contract: what it can see, what it can't, what shape its inputs arrive in. Sibling sub-agents don't share context unless something explicitly hands it across; agent roles get a contract not unlike a function signature, and the rest of the system's working state stays out of their sight. That is what stops one sub-agent's prior reasoning from quietly steering an unrelated sub-agent in the next branch, and it is what makes the system reason-aboutable rather than emergently entangled.

The second is just-in-time retrieval at invocation. Rather than bundling everything an agent might conceivably need up front, context is assembled and injected at the moment a sub-agent is actually invoked. That includes pulling against authoritative sources at invocation time rather than relying on whatever happened to be in the parent's history — the agent doesn't reason against stale scrolled-back state, it reasons against fresh scoped material retrieved for the specific call. This shrinks token cost and tightens relevance in the same move.

The third is hierarchical hand-off contracts. Parent agents pass down only the slice of context each sub-agent needs, through an explicit hand-off contract. The hand-off is data, not transcript: a structured payload with the inputs, references, and constraints the sub-agent must respect — not a copy of the parent's working state. That makes the parent–child boundary a real interface you can reason about, version, and evaluate, rather than a fuzzy "whatever the parent had in scope".

The fourth is boundary-level budget management. Every agent boundary enforces a context budget. Summarisation, structured pruning, and dropping material that is no longer load-bearing happen at each hop rather than at some optimistic final step. Combined with the other three mechanisms, context shape is enforced architecturally — agents see what they need to see, when they need to see it, and the system doesn't degrade as the call tree gets deeper or wider. That is what made multi-sub-agent compositions a viable building block on the platform rather than an unbounded reliability hazard.

Outcomes

Established a reusable platform foundation that enabled consistent, secure consumption of AI capabilities across multiple product teams.
Improved platform adoption by providing stable, versioned APIs and SDKs that reduced integration friction between teams.
Shaped organisational best practices for agentic coding, enabling engineering teams to leverage AI-powered development tools effectively and consistently.
Established a reusable agentic AI platform foundation, enabling secure AI workflows 50% faster.
Improved detection of weak AI responses by 35% through structured evaluation criteria.
Reduced AI evaluation cycle time by 60% through reusable evaluation workflows.
Embedded production-grade observability into agentic workflows, reducing debugging time by 45%.
Enabled data-driven LLM model selection, reducing inference cost by 30%.
Reduced duplicated AI platform engineering effort by 40%.
Built an automated LLM evaluation framework, reducing prompt/model evaluation cycles by 60%.
Reduced manual AI output review effort by 50% through automated judgement mechanisms.
Improved release confidence by making API contracts explicit and versioned.
Built an LLM-as-Judge evaluation framework producing per-prompt and per-model metric grids across correctness, faithfulness, tool-call accuracy and schema compliance — with judge prompts versioned and calibrated against human-graded samples to prevent drift.
Auto-tuned model routing reduced inference cost by ~30% by always picking the cheapest model meeting the evaluation threshold — treating model selection as FinOps backed by measured quality rather than vendor marketing.
Introduced synthetic query generation and randomised holdout sampling, eliminating overfitting against hand-curated golden sets and making score deltas reflect real generalisation.
Built a domain-specific Profunctor that wrapped REST, GraphQL, OData and existing MCP services as MCP tools with dynamically tunable metadata, and used the evaluation framework to measure and reduce agent tool-confusion.
Bounded the agentic chatbot to a defensible task allow-list via a three-layer guard — intent classifier in front of the LLM, tool-level allow-list, and output-side action guard — so off-list requests produced honest refusals rather than unbounded attempts.
Made refusal correctness a graded axis in the LLM-as-Judge framework, with false-allow and false-refuse rates tracked per model and per prompt variant so guard-rail changes shipped against numbers rather than vibes.
Diagnosed and fixed five classes of advanced agentic failure — context degradation, specification drift, runtime tool-selection errors, cascade failures across multi-step agents, and silent failures producing confident but wrong output.
Used the LLM-as-Judge framework as a regression suite for those failure modes — each fix locked in as a named scenario set so future model and prompt changes could not silently reintroduce a previously diagnosed class of failure.
Designed advanced multi-agent / multi-sub-agent context architectures with per-sub-agent isolation, just-in-time retrieval at invocation, explicit hierarchical hand-off contracts, and boundary-level context-budget pruning — so each agent saw only the context it needed at the point it needed it.
Made context a first-class system primitive across agentic compositions, eliminating wholesale transcript forwarding and the unbounded cost, leakage and confusion failure modes that produces — and making multi-sub-agent composition a viable building block rather than a reliability hazard.

Reusable CV Bullets

Led the design and delivery of cloud-native Core AI Services forming a shared platform consumed across product teams and partners.
Engineered fault-tolerant platform services on Azure, including AKS-based deployments and operational patterns for reliability and scale.
Designed secure, versioned public APIs and SDKs to standardise access to AI capabilities across teams and integrations.
Embedded authentication, authorisation and data protection controls to meet enterprise security and privacy requirements.
Drove adoption of agentic coding best practices as a key contributor to the Hypervelocity Engineering initiative, establishing patterns for AI-assisted development using GitHub Copilot CLI, Claude Code, and ChatGPT Codex.
Designed and developed agentic workflows that accelerate both software engineering delivery and business process automation.
Established a reusable platform foundation that enabled consistent, secure consumption of AI capabilities across multiple product teams.
Improved platform adoption by providing stable, versioned APIs and SDKs that reduced integration friction between teams.
Shaped organisational best practices for agentic coding, enabling engineering teams to leverage AI-powered development tools effectively and consistently.
Established a reusable agentic AI platform foundation, enabling secure AI workflows 50% faster.
Improved detection of weak AI responses by 35% through structured evaluation criteria.
Reduced AI evaluation cycle time by 60% through reusable evaluation workflows.
Embedded production-grade observability into agentic workflows, reducing debugging time by 45%.
Enabled data-driven LLM model selection, reducing inference cost by 30%.
Reduced duplicated AI platform engineering effort by 40%.
Built an automated LLM evaluation framework, reducing prompt/model evaluation cycles by 60%.
Reduced manual AI output review effort by 50% through automated judgement mechanisms.
Improved release confidence by making API contracts explicit and versioned.
Built an LLM-as-Judge evaluation framework producing per-prompt and per-model metric grids across correctness, faithfulness, tool-call accuracy and schema compliance — with judge prompts versioned and calibrated against human-graded samples.
Built an auto-tuning loop that ran candidate prompts across a panel of models and picked the cheapest one meeting the evaluation threshold, reducing inference cost by ~30%.
Introduced synthetic query generation and randomised holdout sampling to eliminate overfitting against hand-curated golden sets and keep evaluation score deltas honest.
Designed a domain-specific Profunctor wrapping REST, GraphQL, OData and existing MCP services as MCP tools with dynamically tunable descriptions, tags, and parameter names.
Built a tool-confusion measurement framework that routed MCP metadata variants through the LLM-as-Judge harness, promoting metadata changes that demonstrably reduced wrong-tool selection and parameter-shape errors.
Helped build and define defence-in-depth agentic guard rails — intent classifier in front of the LLM, tool-level allow-list, and output-side action guard — that constrained the chatbot to a sanctioned task list and produced honest refusals for everything else.
Made refusal correctness a graded axis in the LLM-as-Judge evaluation framework, treating false-allow and false-refuse rates as ship-blocking metrics alongside correctness and tool-call accuracy.
Debugged and remediated advanced agentic failure modes in long-running multi-step agents — context degradation, specification drift, runtime tool-selection errors, cascade failures between agents, and silent failures producing confident but wrong output — with each fix verified through the LLM-as-Judge evaluation harness.
Built named regression scenarios into the LLM-as-Judge framework for every diagnosed agentic failure mode, so future prompt or model changes could not silently reintroduce a previously fixed class of failure.
Designed advanced context architectures for multi-agent / multi-sub-agent systems so each agent received the right context and only the right context at the point of need — combining per-sub-agent isolation with scoped views, just-in-time retrieval at invocation, explicit hierarchical hand-off contracts, and boundary-level context-budget pruning.
Treated context as a first-class system primitive across agentic compositions rather than a side-effect of how agents happen to be wired — enforcing context shape architecturally so multi-sub-agent compositions stayed reliable as call trees deepened.

Evidence / Source Notes

Source: config/madu_profile.json → work_experience[]; reconciled with JobVia export (madu_alikor_export.json).
Confidence: high