What's Happening?
Jenn Tejada, Executive Chair of PagerDuty, has raised concerns about the risks associated with AI systems as they transition from experimental phases to production environments. In an interview, Tejada highlighted the potential for failure modes such
as model drift, which can be difficult to detect and may lead to significant operational disruptions. She pointed to the substantial increase in AI infrastructure spending, estimated at $725 billion for 2026, as evidence of the rapid adoption of AI technologies. Tejada emphasized the need for AIOps platforms to monitor AI agents closely to prevent small failures from escalating into major outages, referencing a significant AWS outage in October 2025 as a cautionary example.
Why It's Important?
The shift of AI technologies from experimental to production stages represents a critical juncture for businesses and industries relying on these systems. The potential for undetected failures, such as model drift, poses significant risks to operational stability and can lead to costly disruptions. As companies invest heavily in AI infrastructure, the importance of robust monitoring and intervention mechanisms becomes paramount to ensure the reliability and effectiveness of AI deployments. This development highlights the need for organizations to prioritize AI governance and risk management strategies to safeguard against potential failures.
What's Next?
Organizations are likely to invest in advanced monitoring tools and AIOps platforms to better manage AI systems and mitigate risks associated with agentic failures. Engineering and SRE teams may focus on developing strategies to detect and address model drift and other failure modes proactively. The industry could see increased collaboration between AI developers and infrastructure providers to enhance the resilience and reliability of AI systems. Additionally, there may be a push for regulatory frameworks to guide the safe and ethical deployment of AI technologies in various sectors.















