What's Happening?
Jenn Tejada, the Executive Chair of PagerDuty, highlighted the risks associated with AI agents as they transition from experimental to production environments. In an interview with Forbes, Tejada emphasized the challenges posed by 'model drift,' a failure
mode where AI systems deviate from their intended performance over time. Unlike traditional software crashes, model drift can go unnoticed until significant issues arise, as flawed actions accumulate. Tejada pointed to the substantial increase in hyperscaler AI infrastructure spending, projected to reach $725 billion in 2026, as evidence of the rapid integration of AI into production systems. She advocated for the use of AIOps platforms to monitor AI agents alongside traditional infrastructure, allowing for human intervention before minor failures escalate into major outages, such as the AWS incident in October 2025.
Why It's Important?
The integration of AI agents into production systems represents a significant shift in how technology is deployed and managed. The potential for model drift and other failure modes poses a risk to the stability and reliability of these systems, which are increasingly critical to business operations. The substantial investment in AI infrastructure underscores the importance of addressing these risks proactively. For engineering and Site Reliability Engineering (SRE) teams, the ability to detect and mitigate AI agent failures is crucial to maintaining service continuity and preventing costly outages. This development highlights the need for robust monitoring and intervention strategies as AI becomes more embedded in operational workflows.
What's Next?
As AI continues to be integrated into production environments, companies are likely to invest more in AIOps platforms and other monitoring tools to manage the risks associated with AI agent failures. Engineering teams may need to develop new skills and processes to effectively instrument for model drift and other AI-specific failure modes. Additionally, the industry may see increased collaboration between AI developers and operations teams to ensure that AI systems are both innovative and reliable. The ongoing evolution of AI technology will likely drive further advancements in monitoring and intervention strategies, as well as regulatory scrutiny to ensure the safe deployment of AI in critical applications.















