Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • Learning
  • About
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • Learning
  • About
Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • Learning
  • About
  • Artificial Intelligence
  • DevOps
  • Platforms

Advancing Azure Service Quality With Artificial Intelligence: AIOps

  • July 4, 2020
  • liwaiwai.com

“In the era of big data, insights collected from cloud services running at the scale of Azure quickly exceed the attention span of humans. It’s critical to identify the right steps to maintain the highest possible quality of service based on the large volume of data collected. In applying this to Azure, we envision infusing AI into our cloud platform and DevOps process, becoming AIOps, to enable the Azure platform to become more self-adaptive, resilient, and efficient. AIOps will also support our engineers to take the right actions more effectively and in a timely manner to continue improving service quality and delighting our customers and partners. This post continues our Advancing Reliability series highlighting initiatives underway to keep improving the reliability of the Azure platform. The post that follows was written by Jian Zhang, our Program Manager overseeing these efforts, as she shares our vision for AIOps, and highlights areas of this AI infusion that are already a reality as part of our end-to-end cloud service management.”—Mark Russinovich, CTO, Azure


This post includes contributions from Principal Data Scientist Manager Yingnong Dang and Partner Group Software Engineering Manager Murali Chintalapati.


Partner with liwaiwai.com
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

 

As Mark mentioned when he launched this Advancing Reliability blog series, building and operating a global cloud infrastructure at the scale of Azure is a complex task with hundreds of ever-evolving service components, spanning more than 160 datacenters and across more than 60 regions. To rise to this challenge, we have created an AIOps team to collaborate broadly across Azure engineering teams and partnered with Microsoft Research to develop AI solutions to make cloud service management more efficient and more reliable than ever before. We are going to share our vision on the importance of infusing AI into our cloud platform and DevOps process. Gartner referred to something similar as AIOps (pronounced “AI Ops”) and this has become the common term that we use internally, albeit with a larger scope. Today’s post is just the start, as we intend to provide regular updates to share our adoption stories of using AI technologies to support how we build and operate Azure at scale.

Why AIOps?

There are two unique characteristics of cloud services:

  • The ever-increasing scale and complexity of the cloud platform and systems
  • The ever-changing needs of customers, partners, and their workloads

To build and operate reliable cloud services during this constant state of flux, and to do so as efficiently and effectively as possible, our cloud engineers (including thousands of Azure developers, operations engineers, customer support engineers, and program managers) heavily rely on data to make decisions and take actions. Furthermore, many of these decisions and actions need to be executed automatically as an integral part of our cloud services or our DevOps processes. Streamlining the path from data to decisions to actions involves identifying patterns in the data, reasoning, and making predictions based on historical data, then recommending or even taking actions based on the insights derived from all that underlying data.

Infusing AI into cloud platform and DevOps – with AI at the center of Customers, Engineering, and Services.
Figure 1. Infusing AI into cloud platform and DevOps.

The AIOps vision

AIOps has started to transform the cloud business by improving service quality and customer experience at scale while boosting engineers’ productivity with intelligent tools, driving continuous cost optimization, and ultimately improving the reliability, performance, and efficiency of the platform itself. When we invest in advancing AIOps and related technologies, we see this ultimately provides value in several ways:

  • Higher service quality and efficiency: Cloud services will have built-in capabilities of self-monitoring, self-adapting, and self-healing, all with minimal human intervention. Platform-level automation powered by such intelligence will improve service quality (including reliability, and availability, and performance), and service efficiency to deliver the best possible customer experience.
  • Higher DevOps productivity: With the automation power of AI and ML, engineers are released from the toil of investigating repeated issues, manually operating and supporting their services, and can instead focus on solving new problems, building new functionality, and work that more directly impacts the customer and partner experience. In practice, AIOps empowers developers and engineers with insights to avoid looking at raw data, thereby improving engineer productivity.
  • Higher customer satisfaction: AIOps solutions play a critical role in enabling customers to use, maintain, and troubleshoot their workloads on top of our cloud services as easily as possible. We endeavor to use AIOps to understand customer needs better, in some cases to identify potential pain points and proactively reach out as needed. Data-driven insights into customer workload behavior could flag when Microsoft or the customer needs to take action to prevent issues or apply workarounds. Ultimately, the goal is to improve satisfaction by quickly identifying, mitigating, and fixing issues.
Read More  IBM and Salesforce Team Up To Help Businesses Accelerate Adoption of Trustworthy AI

My colleagues Marcus Fontoura, Murali Chintalapati, and Yingnong Dang shared Microsoft’s vision, investments, and sample achievements in this space during the keynote AI for Cloud–Toward Intelligent Cloud Platforms and AIOps at the AAAI-20 Workshop on Cloud Intelligence in conjunction with the 34th AAAI Conference on Artificial Intelligence. The vision was created by a Microsoft AIOps committee across cloud service product groups including Azure, Microsoft 365, Bing, and LinkedIn, as well as Microsoft Research (MSR). In the keynote, we shared a few key areas in which AIOps can be transformative for building and operating cloud systems, as shown in the chart below.

AI for Cloud: AI Ops and AI-Serving Platform showing example use cases in AI for Systems, AI for DevOps, and AI for Customers.
Figure 2. AI for Cloud: AIOps and AI-Serving Platform.

AIOps

Moving beyond our vision, we wanted to start by briefly summarizing our general methodology for building AIOps solutions. A solution in this space always starts with data—measurements of systems, customers, and processes—as the key of any AIOps solution is distilling insights about system behavior, customer behaviors, and DevOps artifacts and processes. The insights could include identifying a problem that is happening now (detect), why it’s happening (diagnose), what will happen in the future (predict), and how to improve (optimize, adjust, and mitigate). Such insights should always be associated with business metrics—customer satisfaction, system quality, and DevOps productivity—and drive actions in line with prioritization determined by the business impact. The actions will also be fed back into the system and process. This feedback could be fully automated (infused into the system) or with humans in the loop (infused into the DevOps process). This overall methodology guided us to build AIOps solutions in three pillars.

AIOps methodologies: Data (Customer/System/DevOps), insights (Detect/Diagnose/Predict/Optimize), and actions (Mitigate/Avert future pain/Optimize usage config/Improve architecture & process).
Figure 3. AIOps methodologies: Data, insights, and actions.

AI for systems

Today, we’re introducing several AIOps solutions that are already in use and supporting Azure behind the scenes. The goal is to automate system management to reduce human intervention. As a result, this helps to reduce operational costs, improve system efficiency, and increase customer satisfaction. These solutions have already contributed significantly to the Azure platform availability improvements, especially for Azure IaaS virtual machines (VMs). AIOps solutions contributed in several ways including protecting customers’ workload from host failures through hardware failure prediction and proactive actions like live migration and Project Tardigrade and pre-provisioning VMs to shorten VM creation time.

Read More  Dynatrace Named A Leader In Artificial Intelligence For IT Operations (AIOps) Report By Independent Research Firm

Of course, engineering improvements and ongoing system innovation also play important roles in the continuous improvement of platform reliability.

  • Hardware Failure Prediction is to protect cloud customers from interruptions caused by hardware failures. We shared our story of Improving Azure Virtual Machine resiliency with predictive ML and live migration back in 2018. Microsoft Research and Azure have built a disk failure prediction solution for Azure Compute, triggering the live migration of customer VMs from predicted-to-fail nodes to healthy nodes. We also expanded the prediction to other types of hardware issues including memory and networking router failures. This enables us to perform predictive maintenance for better availability.
  • Pre-Provisioning Service in Azure brings VM deployment reliability and latency benefits by creating pre-provisioned VMs. Pre-provisioned VMs are pre-created and partially configured VMs ahead of customer requests for VMs. As we described in the IJCAI 2020 publication, As we described in the AAAI-20 keynote mentioned above,  the Pre-Provisioning Service leverages a prediction engine to predict VM configurations and the number of VMs per configuration to pre-create. This prediction engine applies dynamic models that are trained based on historical and current deployment behaviors and predicts future deployments. Pre-Provisioning Service uses this prediction to create and manage VM pools per VM configuration. Pre-Provisioning Service resizes the pool of VMs by destroying or adding VMs as prescribed by the latest predictions. Once a VM matching the customer’s request is identified, the VM is assigned from the pre-created pool to the customer’s subscription.

AI for DevOps

AI can boost engineering productivity and help in shipping high-quality services with speed. Below are a few examples of AI for DevOps solutions.

  • Incident management is an important aspect of cloud service management—identifying and mitigating rare but inevitable platform outages. A typical incident management procedure consists of multiple stages including detection, engagement, and mitigation stages. Time spent in each stage is used as a Key Performance Indicator (KPI) to measure and drive rapid issue resolution. KPIs include time to detect (TTD), time to engage (TTE), and time to mitigate (TTM).
Incident management procedures including Time to Detect (TTD), Time to Engage (TTE), and Time to Mitigate (TTM).
Figure 4. Incident management procedures.

As shared in Pre-Provisioning Service at the AAAI-20 conference, we have developed AI-based solutions that enable engineers not only to detect issues early but also to identify the right team(s) to engage and therefore mitigate as quickly as possible. Tight integration into the platform enables end-to-end touchless mitigation for some scenarios, which considerably reduces customer impact and therefore improves the overall customer experience.

  • Anomaly Detection provides an end-to-end monitoring and anomaly detection solution for Azure IaaS. The detection solution targets a broad spectrum of anomaly patterns that includes not only generic patterns defined by thresholds, but also patterns which are typically more difficult to detect such as leaking patterns (for example, memory leaks) and emerging patterns (not a spike, but increasing with fluctuations over a longer term). Insights generated by the anomaly detection solutions are injected into the existing Azure DevOps platform and processes, for example, alerting through the telemetry platform, incident management platform, and, in some cases, triggering automated communications to impacted customers. This helps us detect issues as early as possible.
Read More  Connecting Urban Environments With IoT And Digital Twins

For an example that has already made its way into a customer-facing feature, Dynamic Threshold is an ML-based anomaly detection model. It is a feature of Azure Monitor used through the Azure portal or through the ARM API. Dynamic Threshold allows users to tune their detection sensitivity, including specifying how many violation points will trigger a monitoring alert.

  • Safe Deployment serves as an intelligent global “watchdog” for the safe rollout of Azure infrastructure components. We built a system, code name Gandalf, that analyzes temporal and spatial correlation to capture latent issues that happened hours or even days after the rollout. This helps to identify suspicious rollouts (during a sea of ongoing rollouts), which is common for Azure scenarios, and helps prevent the issue propagating and therefore prevents impact to additional customers. We provided details on our safe deployment practices in this earlier blog post and went into more detail about how Gandalf works in our USENIX NSDI 2020 paper and slide deck.

AI for customers

To improve the Azure customer experience, we have been developing AI solutions to power the full lifecycle of customer management. For example, a decision support system has been developed to guide customers towards the best selection of support resources by leveraging the customer’s service selection and verbatim summary of the problem experienced. This helps shorten the time it takes to get customers and partners the right guidance and support that they need.

AI-serving platform

To achieve greater efficiencies in managing a global-scale cloud, we have been investing in building systems that support using AI to optimize cloud resource usage and therefore the customer experience. One example is Resource Central (RC), an AI-serving platform for Azure that we described in Communications of the ACM. It collects telemetry from Azure containers and servers, learns from their prior behaviors, and, when requested, produces predictions of their future behaviors. We are already using RC to predict many characteristics of Azure Compute workloads accurately, including resource procurement and allocation, all of which helps to improve system performance and efficiency.

Looking towards the future

We have shared our vision of AI infusion into the Azure platform and our DevOps processes and highlighted several solutions that are already in use to improve service quality across a range of areas. Look to us to share more details of our internal AI and ML solutions for even more intelligent cloud management in the future. We’re confident that these are the right investment solutions to improve our effectiveness and efficiency as a cloud provider, including improving the reliability and performance of the Azure platform itself.

 

Mark Russinovich Chief Technology Officer, Microsoft Azure

Source: Microsoft Azure


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

liwaiwai.com

Related Topics
  • AIOps
  • devops
  • Microsoft Azure
You May Also Like
OpenAI
View Post
  • Artificial Intelligence
  • Platforms

How We Interact With Information: The New Era Of Search

  • September 28, 2023
View Post
  • Artificial Intelligence
  • Engineering
  • Machine Learning
  • Platforms

Bring AI To Looker With The Machine Learning Accelerator

  • September 28, 2023
View Post
  • Artificial Intelligence
  • Technology

Microsoft And Mercy Collaborate To Empower Clinicians To Transform Patient Care With Generative AI

  • September 27, 2023
View Post
  • Artificial Intelligence
  • Technology

NASA’s Mars Rovers Could Inspire A More Ethical Future For AI

  • September 26, 2023
View Post
  • Artificial Intelligence
  • Platforms

Oracle CloudWorld 2023: 6 Key Takeaways From The Big Annual Event

  • September 25, 2023
View Post
  • Artificial Intelligence

3 Ways AI Can Help Communities Adapt To Climate Change In Africa

  • September 25, 2023
Robotic Hand | Lights
View Post
  • Artificial Intelligence
  • Technology

Nvidia H100 Tensor Core GPUs Come To Oracle Cloud

  • September 24, 2023
View Post
  • Artificial Intelligence
  • Engineering
  • Technology

AI-Driven Tool Makes It Easy To Personalize 3D-Printable Models

  • September 22, 2023
A Field Guide To A.I.
Navigate the complexities of Artificial Intelligence and unlock new perspectives in this must-have guide.
Now available in print and ebook.

charity-water



Stay Connected!
LATEST
  • OpenAI 1
    How We Interact With Information: The New Era Of Search
    • September 28, 2023
  • 2
    Bring AI To Looker With The Machine Learning Accelerator
    • September 28, 2023
  • 3
    3 Questions: A New PhD Program From The Center For Computational Science And Engineering
    • September 28, 2023
  • 4
    Microsoft And Mercy Collaborate To Empower Clinicians To Transform Patient Care With Generative AI
    • September 27, 2023
  • 5
    NASA’s Mars Rovers Could Inspire A More Ethical Future For AI
    • September 26, 2023
  • 6
    Oracle CloudWorld 2023: 6 Key Takeaways From The Big Annual Event
    • September 25, 2023
  • 7
    3 Ways AI Can Help Communities Adapt To Climate Change In Africa
    • September 25, 2023
  • Robotic Hand | Lights 8
    Nvidia H100 Tensor Core GPUs Come To Oracle Cloud
    • September 24, 2023
  • 9
    AI-Driven Tool Makes It Easy To Personalize 3D-Printable Models
    • September 22, 2023
  • 10
    Huawei: Advancing a Flourishing AI Ecosystem Together
    • September 22, 2023

about
About
Hello World!

We are liwaiwai.com. Created by programmers for programmers.

Our site aims to provide materials, guides, programming how-tos, and resources relating to artificial intelligence, machine learning and the likes.

We would like to hear from you.

If you have any questions, enquiries or would like to sponsor content, kindly reach out to us at:

[email protected]

Live long & prosper!
Most Popular
  • Coffee | Laptop | Notebook | Work 1
    First HP Work Relationship Index Shows Majority of People Worldwide Have an Unhealthy Relationship with Work
    • September 20, 2023
  • 2
    Huawei Connect 2023: Accelerating Intelligence For Shared Success
    • September 20, 2023
  • 3
    Applying Generative AI To Product Design With BigQuery DataFrames
    • September 21, 2023
  • 4
    Combining AI With A Trusted Data Approach On IBM Power To Fuel Business Outcomes
    • September 21, 2023
  • Microsoft and Adobe 5
    Microsoft And Adobe Partner To Deliver Cost Savings And Business Benefits
    • September 21, 2023
  • /
  • Artificial Intelligence
  • Explore
  • About
  • Contact Us

Input your search keywords and press Enter.