THIS IS VERY NICE, BUT WHAT IS OBSERVABILITY?
Again, following Gartner; Observability by its very nature must look at the full stack of data available. Looking at a single layer provides only a silo view. To deliver the digital experience necessary to remain competitive, enterprises must go beyond infrastructure and make their digital business observable.
HOW DOES THIS TRANSLATE INTO MONITORING TECHNOLOGY?
The first thing that everyone needs to be aware of in monitoring is that there is no monitoring without instrumentation. Instrumentation or the lack of it defines what can be monitored. A second important point is the fact that observability is used in an application context where the full stack identifies the application stack starting from the OS. More important, if we bring both previous factors together, then we can define observability as the technology to monitor applications that you can instrument, and the monitoring starts from the application framework that you use. Examples are your own written applications in Java, .Net, GO, JavaScript, … Observability is the technology available to monitor your applications from the framework up to the application when you instrument your application.
SO, WHAT INSTRUMENTATION IS AVAILABLE?
Most of the instrumentation is delivered by companies that provide products (software and hardware), that are no monitoring experts. What is delivered as instrumentation is a myriad of possible parameters, well intended, though very difficult to understand and monitor. Common of the shelf software’s do not provide a lot of insight (observability) into their applications. Most rely on the supporting frameworks and OS to provide the instrumentation. When you build your own application, you have the choice to either correctly instrument your application or not. It is this opportunity that you will need to grasp to make your application “observable”. Instrumentation must be added during the development and testing phases of the application to deliver the metrics to correctly monitor your application. The categories of monitoring instrumentation are simple, the implementation is more difficult. The three main categories of instrumentation are:
- Metrics
- Log files
- Tracing
Metrics are available most of the time and are delivered as part of the underlying operating system or framework. This is part of the “full stack monitoring”, so to speak. These parameters reflect the consumption of resources of your application. Nothing special here, these metrics are monitored by most companies.
Logfiles are important for troubleshooting and have been used a lot to that purpose. For applications, these logfiles can go beyond the actual troubleshooting and with more recent technologies; logfiles often contain application performance metrics as well. For application developers, logfiles are an easy target, as most developers keep their logfiles to troubleshoot their own work.
Tracing is the most recent technology, but less widely used today. It gives the developers the possibility to identify performance bottlenecks within the application and to benchmark their applications before bringing them to production. This requires an effort from your development team to instrument the application and to use the tools to measure the performance and capacity of the application.
WHEN TO USE WHICH INSTRUMENTATION?
Now that we know which instrumentation is available, the next question arises, what to use when? There is of course no simple answer to this question. Easiest is to use all technologies, but this can become an overkill, as you will have to keep a balance between the application and the monitoring (the same as you need to do with security).
My answer would be a bit more nuanced, and I therefore provide you with a few handholds.
The first is the value of your application. The value of an application is measured by two things: the cost of unavailability in terms of direct financial loss and loss of opportunity.
If you have a direct (substantial) financial loss when your application is unavailable or if it underperforms, then the quicker you restore your application to a good working state, the better. So, from a monitoring point of view, you want a pro-active indication if something goes haywire. Therefore, metrics may not be sufficient. A good practice is always to provide logging for your application and in this case trace your application so that you know when the performance of your application starts to deteriorate. Keep in mind that tracing is, most of the times, only available for in-house written applications.
If unavailability leads to loss of opportunity, the value of the application unavailability might be less easy to calculate. So, the faster the application is restored to a good working state, the better. In terms of monitoring, tracing might be an overkill. Metrics and logging are certainly no overkill. Of course, it is for everyone to decide.
The second factor to consider is the nature of the application. Is it a single application stack without much interaction with other systems, or is it an application with a lot of interfaces with other systems?
For applications that interact a lot with other applications, it is important to monitor the behavior of the application in function of the interaction with other components and APIs. Next to tracing and logging, it is important that the API’s are monitored and not only from an uptime perspective. You want to know the consumption of the APIs and want to be aware when the APIs get stuck. This means that you will need to trace the calls to the APIs to be aware of the consumption of messages by the APIs and their responses.
Another factor to consider: is the application static in terms of usage or is it very dynamic in terms of usage and resources?
For dynamic applications it is important to monitor the behavior in function of resource usage. Tracing might be considered, so that you know what is happening with the dynamics of the application. Metrics and logging are a must. In this case you want to know the behavior of the application in function of the usage by the end-users and find bottlenecks when the number of users starts to rise.
Static applications or applications with few interactions might suffice with logging for troubleshooting and metrics for performance monitoring.
When you start with instrumentation, make a difference between observability in development and operations. Development is interested in the good working of the application in a limited environment. Operations needs tools to monitor the behavior of the application in real-life situations.
WHAT TO USE AS INSTRUMENTATION?
If you ask most vendors what instrumentation (tracing) to use, it is simple, their own proprietary technology, with a lock-in as a result. I do not think that this is the way to go. For tracing, there is a mature open standard supported by the Cloud Native Computing Foundation: OpenTelemetry.
OpenTelemetry provides an observability framework for cloud-native software. OpenTelemetry is a collection of tools, APIs, and SDKs. You use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) for analysis to understand your software's performance and behavior.
OpenTelemetry provides both instrumentation for many languages and frameworks and exporters for many tools. See also opentelemetry.io. OpenTelemetry has become a de-facto standard for instrumentation of applications, and if you did not include tracing in your application yet, I would recommend you use OpenTelemetry.
WHAT ABOUT YOUR APPLICATIONS IN THE CLOUD?
When you bring your applications to the cloud, you still need to instrument and monitor your applications. Another factor adds to this equation for the cloud. In the cloud, the resources that you use are dynamic and based on a pay what you use principle.
Using this principle will become more and more important for your applications, as you do not want to spend money where none is needed. So, the resource usage of your application should be optimized at every moment in the cloud. The only way you can do this is by monitoring your application and define the resources in function of the users of the application. Monitoring can drive the necessary resource increase or decrease in the cloud and subsequently the price that you pay for those resources.
AND WHERE DOES AI COME INTO THE EQUATION?
AI is not a buzzword anymore and is more and more used in monitoring. We use AI models for application monitoring to pinpoint the root causes for troubleshooting, for dynamic thresholds and for a variety of other functions. AI can be generic, but we think that training of the models in a specific context will be more productive than AI in a generic context.
We think that our custom model approach is closer to reality for use with critical applications and our customers will benefit more from a customized model adapted to their needs.
BRING IT ALL TOGETHER
More and more tool providers can provide full observability in one and the same tool. They are called “full stack” monitoring tools. However, do not let marketing fool you. Most of the tools provide only part of the solution and have a steep learning curve.
CONCLUSION
Tracing and logging can serve different purposes, and it is in function of the type of your application that you can decide on the best technology approach for your application.
Consider up front what you would like to achieve with application monitoring. A good choice of the techniques and technologies to use is a must before you start.