AI for DevOps: Cutting Cost, Reducing Risk, Speeding Delivery
Organizations collectively spend billions every month on DevOps processes, yet bad code still makes it into production, causing downtime, additional time/money, and reputational harm. With so much at stake, it would seem to be a natural fit for automation through AI and machine learning. There’s at least one company developing it, but it’s probably not a name you would guess.
You don’t have to look very far to find evidence of DevOps disasters. CRN has a list of the biggest cloud outages so far in 2022, which includes big names like Google Cloud, Apple, and IBM. And who can forget the big Slack outage that occurred in February?
The underlying cause of all these outages are different. In some cases, it’s a network configuration error, in others, a database update gone bad. DNS errors remain commonplace, and fat fingers have yet to be banished from the IT kingdom.
But upon closer inspection, there is a common theme across many, if not the majority, of these stories: An erroneous change was moved to production when it shouldn’t have (we’ll give Google some slack on the severed undersea cable that impacted its service in June, but we’re wondering why Microsoft didn’t detect the power oscillations that caused fans to automatically shut down in an Azure data center sooner).
None of this is easy. Modern software development is extremely complex, and there are a thousand moving parts that must be synchronized. The process of moving software from development the production–which touches development and operations and is collectively termed DevOps–is rife with complexity and potential tripwires. The practice of letting tech professionals pick their own tools brings its own set of complications.
The solution to this situation up to this point has been to throw lots of manpower at the DevOps problem. Developers, testers, deployment managers, and SREs spend many hours keeping track of various updates and configuration changes in the hopes that nothing gets by them. Some organizations have begun to move toward a set of standard tools to reduce complexity, but that hasn’t made much of a dent yet.
The folks at Digital.ai have a fundamentally different approach. Instead of relying on humans to spot problems or trying to force standardized set of tools in the DevOps or CI/CD (continuous integration/continuous delivery) realm, Digital.ai uses machine learning techniques to predict the likelihood that a given piece of new code or code update is going to cause problems.
According to Florian Schouten, the company’s vice president of product management, Digital.ai predictive solution starts by ingesting historical data from DevOps platforms and tools, such as Git, Jenkins, Azure DevOps, Jira, and ServiceNow. Digital.ai then feeds the data into classification algorithms, which detect patterns across those change events in the past.
“Most organizations have 3,000 to 5,000 changes a month that will feed into the model,” Schouten says. “It will capture all the aspects of those, let’s say, 5,000 monthly change events, such as who the team is, what infrastructure was changed, what testing was done during the software development cycle, who the developer or developing team was, how many defects were found during testing, and all these other environmental factors that then can be correlated to the success and failure of past changes.”
Once trained, Digital.ai’s algorithm can then be used to predict the likelihood that a current change will cause problems. The company’s offering can detect more than 80 risk factors out of the box, with a likelihood score generated for each one. The software development manager can use this to make decisions about the need for additional review before hitting the “go live” button.
“If it’s 1% [chance of causing a failure], OK, let it go. I’m not going to spend any time on it,” Schouten tells Datanami. “If it’s a 60% likely thing? I better take a look and route it to the right people for review.”
This approach can bring multiple benefits. For starters, the extra layer of scrutiny can help avoid an outage, which could have a devastating impact on an organization. It can also save money by making more optimal use of existing resources, “which on its own can pay for the solution and often be in the millions of dollars, given how many people tend to be involved and how many change events there are,” Schouten says.
Read the rest of the story at our sister publication Datanami.
Related
Alex Woodie has written about IT as a technology journalist for more than a decade. He brings extensive experience from the IBM midrange marketplace, including topics such as servers, ERP applications, programming, databases, security, high availability, storage, business intelligence, cloud, and mobile enablement. He resides in the San Diego area.