Automated Analysis User Guide

Purpose

To provide actionable insights and comprehensive scoring of application performance, security, quality, and reliability during Argo Rollout with the OpsMx Automated Analysis and Verification platform.

Target Audience

  • SRE

  • DevOps Engineer

  • Developer

  • Development Manager

  • SecOps Engineer

  • Application Owner

Understanding Risk Criteria for Various Stakeholders in a Software Delivery

Software updates carry the risk of causing business disruptions and security incidents. To mitigate this risk, modern enterprises rely on reducing the impact of changes through gradual delivery with Argo Rollouts. These organizations partner with OpsMx to thoroughly assess the risk of changes across multiple dimensions, ensuring a risk-free update.

The ISD Automated Analysis and Verification conducts thorough analysis during every new software release or update and provides the following scores.

  • Performance Score: Evaluates application performance compared to previous releases, with a focus on meeting Service Level Objectives (SLO).

  • Quality Score: Measures quality metrics for applications compared to previous releases.

  • Reliability Score: Assesses the reliability of the application with the change.

  • Security Score: Evaluates the security risk associated with the software update, including any new libraries introduced and potential risks.

  • Business Score: Assesses the impact of the software change on business operations.

Top Outcomes or Questions Answered by ISD Automated Analysis are as follows.

  1. Should you roll forward or roll back the new release?

  2. Does the new release meet the criteria of key stakeholders such as SRE, Developer, DevOps Engineer, Development Manager, Application Owner, and SecOps Engineer?

  3. Does the application still meet the SLO after the change?

  4. Has the reliability of the application been maintained after the change?

  5. Is the application's security risk still acceptable after the update?

  6. Does the application continue to meet or exceed business objectives after the change?

The below table lists stakeholders' care in an enterprise that frequently delivers software innovation.

StakeholderPriorityKey Metrics

SRE

Application reliability and meeting SLO

  • Application Performance: Latency, Error Rate, Throughput

  • Infra metrics: CPU Utilization, Memory Utilization, Network & Disk Usage

  • Mean time to recovery (MTTR): The average amount of time it takes for the system to recover from an outage

  • Mean time between failures (MTBF): The average amount of time between system failure

Developer

Code quality and meeting business objectives

  • Quality metrics - Number and type of issues and trend over releases

  • Error rates and failure rates - Provide insight into the stability and reliability of the software

  • Performance metrics - Response time, throughput, 95th percentile

  • Resource usage metrics - Memory usage, network & disk usage - can help identify scalability issues & optimize resource utilization

  • Operational metrics - Uptime, availability, and mean time to recovery - to ensure SW is meeting availability & reliability goals.

  • Security metrics - Number of security breaches, vulnerabilities, and compliance with security standards - provide insight into the security posture and help identify areas for improvement.

DevOps Engineer

Meeting SLO and minimizing security risks

  • Application and business SLO

  • Compliance: with industry regulations and standards

  • Time to remediation: the time it takes to fix security vulnerabilities

Development Manager

Meeting business objectives and code quality

  • Deployment metrics: speed and frequency of deployments, such as the number of deployments per day, and time it takes to deploy a new release

  • Quality Metrics: Number and types of issues (quality) in releases and trend of issues release over release.

Application Owner

Meeting SLO and minimizing business impact

  • Mean time to recovery (MTTR): The average amount of time it takes for the system to recover from an outage

  • Change failure rate: percentage of changes that result in failures

  • Mean time to recovery (MTTR): time to recover from failures

  • Application and business SLO

SecOps Engineer

Minimizing security risks and meeting business objectives

  • Security vulnerabilities: Number and severity of security vulnerabilities

  • Compliance: with industry regulations and standards

  • Time to remediation: the time it takes to fix security vulnerabilities

  • Change failure rate: percentage of changes that result in failures

  • Mean time to recovery (MTTR): time to recover from failures

Understanding Automated Analysis and Verification Report:

The OpsMx ISD Automated Analysis and Verification report is generated after enabling the analysis during or before/after the canary/blue-green deployment strategy. The analysis uses log and metrics data to assess the risk involved.

Multiple analysis can be performed during the rollout, as desired. The report for each analysis can be accessed in the Argo Rollout Dashboard or Analysis History page in the ISD platform, as depicted below.

Let us understand the automated analysis report as seen below.

Analysis Report Explanation:

  1. Overall Score and Status: The Overall Score is a composite of log and metrics analysis scores, and is used to determine if the rollout should continue or to be stopped. If the score falls below a specified threshold, the rollout is automatically stopped to prevent a potentially faulty release from being deployed. However, it is possible to override the score and proceed with the roll forward. A failing score indicates that critical metrics in the new release are below expectations, and critical errors may be present in the log output of the new release.

  2. Log Summary Score: This score is based solely on the analysis of log data and provides information on the number of critical, error, and warning log clusters (groups of related log messages).

  3. Metrics Summary Score: This score is based solely on the analysis of metrics data and provides information on the number of metrics that failed automated statistical analysis, and if any critical or high-priority watchlist metrics have failed

    • Performance Score: The Performance Score is a sub-score under Metrics Summary, which provides a consolidated analysis of performance-related metrics and provides an evaluation of the new release's performance.

  4. Security Score: The Security Score provides a consolidated analysis of security-related changes in new releases, including the increased risk associated with new CVEs (Common Vulnerabilities and Exposures) in the new release.

  5. Quality Score: The Quality Score provides a summary of issues found in the new release and comparisons to past releases, and categorizes the types of issues.

How is the Score Calculated

ISD performs two types of analysis during Argo Rollout: Those two types are Log analysis and Metrics analysis. First let's examine how the log score is calculated to understand its interpretation.

Log Score Calculation

ISD uses Natural Language Processing (NLP) to categorize the types of "events" (such as Critical, Error, etc.) in the logs of the baseline and new release. An event is a log line or group of lines, depending on the log type. For example, a Java exception log with a stack trace of multiple lines is considered one event. Similarly, log lines for errors from Python code are also considered one event.

Once the events are categorized, they are compared to the baseline to further classify the event type as "expected," "unexpected," or "ignored”.

Type of EventDescription

Expected

An event of a similar type appears in both the baseline and new release logs and will not affect the scoring.

Unexpected

The event is seen for the first time in the new release or it is seen with a higher frequency (many events of that type compared to the baseline). This unexpected event will have an impact on the log score.

Ignored

An event that is categorized as general information or log messages and does not impact scoring.

Algorithm Rules for Understanding Log Messages:

It's important to note that only "Unexpected" events impact scoring. Here are the algorithm rules that help you to understand the log messages:

  • Critical events in the new release will result in a score of ZERO, as they will be treated as showstoppers.

  • Error events in the new release will result in a lowered score, with the exact impact noted in the report. The algorithm that decides the impact of the error is based on analyzing the overall events with and without the error. If the error causes a significant change (such as a crash or more unexpected events), the impact is higher.

  • Warning events in the new release will result in a lowered score, but with a much lower impact on the overall score compared to an error. As stated before, the impact of specific Warning events will be noted in the report.

Reclassification and rescoring of events is possible. This typically occurs when the NLP algorithm classifies an error as critical or vice versa. The SRE or Developer can then reclassify and provide "Supervised Input" to the algorithm, which will not only re-score the current analysis but will also remember the reclassification for future analysis.

Metrics Score Calculation

The metric score is calculated based on a number of metrics in the system, and each metric is given equal weight unless modified by the user. For example, if there are four metrics being analyzed and if one metric fails the analysis, the score will be 75. If two metrics fail the analysis, the score will be 50.

MetricsCriticalWeightAnalysis

Metric1

-

1

Success

Metric2

-

1

Success

Metric3

-

1

Failure

Metric4

-

1

Success

Total Score

75

It is possible to label a metric as critical, similar to labeling a log event as critical. If a metric is marked as critical, the overall metric score will be zero if that metric fails. It is important to exercise caution when marking metrics as critical so that the rollout will not stop because of any failing metric.

MetricsCriticalWeightAnalysis

Metric1

-

1

Success

Metric2

-

1

Success

Metric3

Yes

1

Failure

Metric4

-

1

Success

Total Score

0

The weights of metrics can be adjusted to reflect their importance. If a metric is deemed more important, its analysis failure will have a greater impact on the overall score. For instance, if one of four metrics is assigned twice the weight (2), the impact of its analysis failure on the overall score will be 40% (⅖) instead of the default 25%.

MetricsCriticalWeightAnalysis

Metric1

-

1

Success

Metric2

-

1

Success

Metric3

-

2

Failure

Metric4

-

1

Success

Total Score

60

How to Interpret Scores

The higher the score, the closer the new release is to the baseline and the stakeholders' expectations. A score of 100 is considered ideal, while a score of 80 or above is considered good for promotion. On the other hand, a score below 50 is considered a failure and the rollout should be stopped due to severe issues found in the new release. Scores between 50 and 80 are considered marginal and should be reviewed by SREs or Developers to determine the quality and risk of the release.

Actions to Take on Log Analysis

The following actions can be taken based on the results of the automated log analysis during an Argo Rollout:

  • Download and share specific log messages with developers or peers for feedback or to submit a bug report based on newly discovered issues.

  • Reclassify and annotate log events to accurately reflect their impact based on your application scenario. This allows others to understand why the change occurred.

  • Annotate log events with root cause or diagnosis, as well as next steps, to provide prior knowledge and aid in faster resolution of critical or error events in future release rollouts.

  • Diagnose log events by determining when they occurred during the analysis window and by correlating them with other log and metric events during the analysis run.

  • Investigate further by quickly accessing the filtered and complete log messages of the baseline and metrics during the rollout in the log monitoring tool such as ElasticSearch.

  • Share the full log analysis report with your development or SRE peers to get their insights and assistance in diagnosing any issues. The report includes the context of the baseline and new releases as well as other analyses performed by ISD. Sharing this information helps to provide a clear picture of the situation.

Actions to Take on Metric Analysis

The following actions can be taken when reviewing the metric analysis:

  • Diagnose critical and other metrics and their deviations from the baseline metrics.

  • Take note that if certain metrics are designated as critical, their failure in the analysis will result in an overall score of zero.

Edit the Log and Metrics Template

​​The ISD platform provides default log and metric templates for different applications and monitoring tools. However, custom log messages or metrics may be required for individual applications based on the needs of developers, SREs, or the organization. Editing log and metric templates to add or modify defaults is simple and straightforward.

Edit the Log Template

A Typical Log template looks like below:

apiVersion: v1
kind: ConfigMap
metadata:
 name: elasticsearch-log-generic-ext
data:
 elasticsearch-log-generic-ext: |
   monitoringProvider: ELASTICSEARCH
   accountName: elastic-account-name
   index: kubernetes*
   filterKey: kubernetes.pod_name
   responseKeywords: log
   # errorTopics array if not defined, default set of error topics are applied.
   # errorTopics array if given and disableDefaultErrorTopics is set to false(default), adds the given list into default applied list and
   # if errorString matches with existing errorTopic,default is overridden by the custom one.
   # errorTopics array if given and disableDefaultErrorTopics is set to true, only the given list will be applied.
   errorTopics:
   - errorString: ArrayIndexOutOfBounds
     topic: ERROR
   - errorString: NullPointerException
     topic: ERROR
   tags:
   - errorString: FATAL
     tag: FatalErrors

To edit an existing Error Topic, edit the errrorString with any changes and also update the topic to ERROR or CRITICAL OR WARN or IGNORE. For example, we can change the topic for an error event or add a new custom error event as below.

- errorString: ArrayIndexOutOfBounds
     topic: CRITICAL
- errorString: MYCUSTOMERRORMESSAGE
     topic: ERROR

Edit the Metrics Template

A typical Metrics template looks like below.

apiVersion: v1
kind: ConfigMap
metadata:
 name: prometheus-verifier
data:
 prometheus-verifier: |
   accountName: opsmx-prom
   advancedProvider: PROMETHEUS
   metricTemplateSetup:
     percentDiffThreshold: hard
     groups:
       - metrics:
           - metricType: ADVANCED
             name: "avg(container_memory_usage_bytes{ pod=~\".*${pod_key}.*\"})"
         group: "Memory Usage By Pod Name"
       - metrics:
           - metricType: ADVANCED
             name: "avg(rate(container_cpu_usage_seconds_total{ pod=~\".*${pod_key}.*\"}[2m]) * 100)"
         group: CPU Usage By Pod Name

Modify the template to add additional metrics under an existing group or add a new group with a set of metrics. For example, below is an example of how to add a new performance score group with two metrics Application latency and error rate for the New Relic monitoring tool.

  - metrics:
           - metricType: ADVANCED
             name: "SELECT average(duration)*1000 FROM Transaction WHERE appName = '${app_key}' AND host RLIKE '.*${host_key}.*'"
           - metricType: ADVANCED
             name: "SELECT count(apm.service.error.count) / count(apm.service.transaction.duration) FROM Metric WHERE appName = '${app_key}' AND host RLIKE '.*${host_key}.*'"
         group: CPU Usage By Pod Name

Multi-Service Analysis

If you want to evaluate the new release of multiple services in your application using Argo, follow the steps below:

  • Configure one critical service for progressive rollout using Argo Rollout. Deploy the remaining services in the application using a standard Kubernetes deployment.

    Note: Multi-service rollout is not supported in Argo CD as of version 2.6 and Argo Rollout 1.4. Progressive rollout can only be performed for one service at a time.

  • Set up ISD to analyze the critical service during Argo Rollout. To evaluate the other services at the same time, specify the dependent services in the ISD provider configuration as shown below. ISD can evaluate all services during the rollout even if only one of them is undergoing a progressive Argo Rollout.

    apiVersion: v1
    kind: ConfigMap
    metadata:
     name: opsmx-provider-config-multiservice
    data:
     providerConfig: |
       application: multiservice
       baselineStartTime: "2021-11-03T11:29:46.915Z"
       canaryStartTime: "2021-11-03T11:29:46.915Z"
       lifetimeMinutes: 10
       gitops: false
       passScore: 90
       serviceList:
       - serviceName: ratings
         logTemplateName: logsvc1
         logScopeVariables: "kubernetes.container_name"
         baselineLogScope: "baseapp-rest-1"
         canaryLogScope: "canaryapp-rest-1"
         metricScopeVariables: "$app_key"
         baselineMetricScope: "service:baseapp"
         canaryMetricScope: "service:canaryapp"
         metricTemplateName: metricsvc1
         metricTemplateVersion: "v10.0"
       - serviceName: reviews
         logTemplateName: logsvc1
         logScopeVariables: "kubernetes.container_name"
         baselineLogScope: "baseapp-rest-1"
         canaryLogScope: "canaryapp-rest-1"
         metricScopeVariables: "$app_key"
         baselineMetricScope: "service:baseapp"
         canaryMetricScope: "service:canaryapp"
         metricTemplateName: metricsvc2
         metricTemplateVersion: "v5.0"

  • ISD will analyze all services specified in the configuration and provide a consolidated report. For example, in the below scenario, both the ratings and reviews services were evaluated simultaneously, and the risk of the new release was identified for both.

Summary

The ISD platform offers a sophisticated machine learning-based analysis tool for your Argo Rollout that significantly reduces the risk during the progressive release of new applications.

ISD automated analysis utilizes log and metric data sources to automatically evaluate and identify potential risks in new releases, streamlining the analysis process and reducing the potential for errors. Visit https://www.opsmx.com/continuous-verification-for-argo/ for more information.

Last updated