Best Practices for setting up Verification
OpsMx recommends the following best practices for setting up verification and to get the best use of the Verification feature in ISD for Argo.
Log data is generally of large size and unstructured data. It can get very hard to analyse if the log collection and analysis mechanism is not designed optimally.
Following are the few practices that can make log analysis simpler and more efficient to derive the best results out of log analysis tools like Autopilot (ISD).
In the world of logs, the log size can get enormous very quickly. Such logs are best analysed with the help of indexes. Log indexes can help not only reduce the size of data being analysed but also centralises the analysis to a scope. Most of the log collection platforms have some pre-configured default index list. Whereas the default indexes perform in a very generic manner and fail to provide focused results. For an optimal implementation, define custom indexes to perform better log filtration.
An analysis run should be executed with a well defined scope. In Autopilot, scope is defined with the help of filter keys which can be a comma separated list of variables like namespace, pod hash, application name or component name. Define the filter keys as best suitable for the logs you might be interested in.
No matter how good the analysis engine might be, it entirely depends on the quality of logs. Use the right logging level for your logs, write descriptive log messages, use standard logging libraries and unique identifiers in logs to help improve the entire end-to-end flow of logs and analysis results.
Autopilot provides classification of different log levels based on String Patterns found in log lines. These Error Classification patterns are known as Error Topics. A set of default known Error Topics comes pre-configured in the analysis engine, enabled by default. In case, you have your own set of Error Strings which might help in better classification of logs, make best out of Error Topics by configuring as many as possible.
Analysis is a continuously evolving process. A set of error topics and tags which might keep right theoretically may still present scope of improvement upon running actual analysis runs. To your rescue, reclassification feature is available right there, on the analysis report window. Analyze the report and reclassify wherever needed. Changes reflect from the next run onwards.
Metric data is structured and consists of numeric data points based on the metric calculating formula. Metric data analysis is an effective approach to observe trends of performance during an interval of time. The approach starts with deciding what metrics to monitor, when to monitor and how to derive effective conclusions from the metrics.
Following are the few practices that can make metric analysis simpler and more efficient.
All metrics should be clearly defined so that an organization can benchmark its success. One way to keep metrics understandable is to use the SMART (specific, measurable, achievable, relevant, time-based) model. The Achievable step in this model is particularly important. There’s no point setting targets that cannot be achieved, as people will feel defeated even before they begin.
Autopilot’s Metric Evaluation is based on writing queries supported by the monitoring platform integrated. Queries should be designed in sync with the requirements of analysis. For example, Infrastructure health analysis must carry infrastructure focused queries like CPU, Memory, IO etc. Whereas an application focused analysis must carry average response time, error rate or application performance related queries.
Autopilot expects each query to return a single time-series plot. So the filter keys should be defined precisely so that the resultant query returns a single data point per sample. For example, In prometheus, while measuring response time of a specific microservice, set a combination of replica set hash( pod hash), app name, component name as the filter keys to get exact data needed. Whereas if a specific endpoint performance is to be measured, add a url based key in a set of filters defined.
Each organisation can have different SLAs and may want to measure metrics accordingly. Know your golden metrics and configure them for analysis. Generally, performance based golden metrics can be as follows.
- Application Performance Index
- Application Throughput
- Application Response Time
- Application Error Rate
- CPU Consumption
- Memory Consumption
- Average Time of Recover (in case of downtime)
Each metric can have a different risk direction. For example, reduced error rate is good, whereas a higher error rate as compared to baseline is bad.
Know the type of metric and evaluate the direction of deviation to be considered as risk. Autopilot provides three risk directions, Higher, Lower and HigherAndLower.
- Higher: A metric measurement which when increases beyond threshold is a problem.
- Lower: A metric measurement which when decreases beyond threshold is a problem.
- Higher or Lower: A metric measurement that is a problem if it increases or decreases beyond upper and lower thresholds.
Each metric and environment can have their own definition of tolerance. For example, in a business critical application, response time deviation by 10-20% can be considered huge whereas organisation’s internal portals can still tolerate a similar deviation positively.
Know the environment and the capacity of the metric while setting threshold percentages.
While doing analysis in a limited environment with smaller numbers, the deviations can be unrealistic.For example, in a test environment, where the number of users is limited, the average response time may be very low (say 50 ms), a deviation by 10 ms in this case can cause 20% deviation for the data point.
Analysis is performed better where dense data with realistic numbers can be achieved.
Not all metrics are critical during an analysis and may be just informative. In contrast, there are some highly critical metrics without which analysis may not produce efficient and concluding results. For example, error rate.
Autopilot has 3 criticality levels, Critical(High), MustHave(Medium) and Normal(Low). Evaluate the importance of presence of the metrics during analysis and set the criticality level accordingly.