Observability in Distributed Systems

7 min readMar 15, 2020

In order to provide observability of distributed applications, we need to monitor services health, alert on issues, support root cause investigation by providing distributed systems call traces, and enable diagnosis by creating a searchable index of aggregated application/system logs.

In this post I’d like to share my perspective of system observability and its challenges.

Microservices Architecture

Micro-services theme of distributed systems has become very popular, since it allows us to produce highly scalable applications, made up of decoupled, event processing services that fulfill a single responsibility. However, the price we pay in return is lack of consistent visibility and a complexity in understanding the overall business flow of such systems.

Observability Defined

We can define system observability as “the ability to examine and understand the system state.”

More formally, I’d define it as following:

System observability solves the problem of asking arbitrary questions on arbitrary system behavior in real time

Ultimately, observability should attribute almost every stage of SDLC (design, development, testing, deployment) while embracing the simple fact that no system is completely reliable, especially distributed systems that are directly impacted by network issues (see “ Fallacies of distributed computing”)

Consistent visibility means that during development we want to know how exactly our changes are affecting the system and while running in production we want to know about unexpected behaviors. Ease of debugging is crucial for the maintenance and evolution of robust systems, since it gives us confidence that we can quickly react to system failures and fix it.

Monitoring and Observability

In order to understand better what attributes system observability, let’s compare it with Monitoring in its more traditional sense.

In my opinion, their goals are different, as Observability doesn’t replace Monitoring, but is complemented by it as we will see in the following sections.

Monitoring is one of the means for achieving observability, and the main difference is that you can’t monitor what you don’t know.

Traditional monitoring is geared toward man-in-the-loop paradigm, where the system operator is getting alerts about system issues that we kind of “expect” to happen or demand “human touch” to be resolved.

An observable system on the other hand, provides comprehensive context of the system internals, vital for dealing with unexpected failures or getting insights of underlying issues.

As systems are becoming more distributed with hundreds (or even thousands) of services communicating with each other in many ways, we see that traditional and alerting tools gave place to sophisticated platforms that provide out-of-the-box remediation capabilities.

Another reason why traditional monitoring and alerting also fail to serve their purpose in distributed systems, is that those systems are designed to fail fast or temporarily degrade their quality of service.

Troubleshooting System Failures

We can test the degree of system observability by measuring the time required to discover and remediate a system failure.

This process usually starts with a high level failure indicators and continuing with the analysis of deeper contextual signals, such as exceptions, traces, audit files and hardware resources utilization.

Once we have a solid assumption of what might be the problem, we start looking for more evidence that confirms or rejects our initial conjecture. This process very similar to the traditional code debugging, with the difference that we’re “debugging” high-level system flows. This system-level debugging could be effective only if the system produces observability data that carries essential and easy to “digest” data points.

Traditional QA and Observability

“Observability in our software systems has always been valuable and has become even more so in this era of cloud and microservices. However, the observability we add to our systems tends to be rather low level and technical in nature, and too often it seems to require littering our codebase with crufty, verbose calls to various logging, instrumentation, and analytics frameworks. “ ~ Martin Fowler

We all worked (or still working) in companies that are employing manual & partial automatic testing as a gate to releasing a software into production. Most of such companies have dedicated teams of quality engineers that run testing during development and make sure that being released software is thoroughly tested (abused hardening sprints) before an official product release or handing it over to the operations/deployment teams.

This approached is gradually replaced by the DevOps mindset, where development teams are taking responsibility for operating the software they are so proudly author :)

This forces software engineers to think ahead about the means and tools that make system testing in pre/post production as efficient as possible. They are changing their attitude towards testing and make it an integral part of the development lifecycle by adding comprehensive automated tests, creating “production-like” environments and assessing system resilience, trying to find loopholes and discover latent failure modes.

Achieving a high level of debuggability entails deep understanding of the system operational context and its intricate inter-dependencies. This understanding is compounded by the knowledge of system internals such as deployment semantics, bootstrap flow, discovery & routing mechanisms, data consistency modes, I/O models, etc.

Writing Observable Code

Writing a code that natively supports observability is hard to do in a hindsight. We need to analyze system requirements and choose the right instrumentation, take care of the proper logging, exception handling, expose relevant application metrics and so on. This requires “ingrain” these attributes from day one and many frameworks provide out-of-the-box capabilities that are just “waiting” to be used.

The three pillars of the observability

Logs aggregation
Distributed tracing
Alerting

Logging

Let’s start with the basics — logging. While implemented rightly, logs could convey a rich contextual information about the system activity. There is a caveat though: finding the right balance of what to log is an art. It’s like packing a suitcase for traveling, it’s so tempting to add line after line after line, you know.., just in case. This might lead to a poor signal-to-noise ratio and also has cost implications from the storage perspective.

As with everything else, setting goals of what we’re trying to achieve helps. In case of observability, logs should tell us a coherent story about the main system flows. That defines what you want to log, and more importantly — what you don’t want to log. Logs have highly customizable nature and with proper care, they can be so powerful that we may even no need dedicated tools for application metrics or tracing

Logs Aggregation

As opposed to monolithic applications, inspecting the logs of a single service is no longer enough in order to locate and diagnose problems. Since a single client request triggers a chain of service calls to downstream services, we need to inspect logs of each one of them in order to get an useful answer.

Log aggregation achieves this by collecting logging data from multiple sources and aggregating them in a central place. This is usually achieved by installing agents on the relevant hosts that then stream service log data to a centralized log management system for further processing. There are many existing open-source and third-party tools and services, such as ELK stack, that provide powerful indexing, search and visualization mechanisms and enable you analyze log data in real time.

Distributed Tracing

Distributed tracing is a request-centric logging that provide holistic view of what is happening within the system.

A single trace is a chain of related events and that represent the end-to-end request flow through a distributed system. In addition to the flow itself, a trace can carry information about the request structure as well. In microservices-based systems, traces allow to understand what services participating in the request and the request data provides the required context.

Each trace is assigned a globally unique ID — correlation ID, which is propagated throughout the request flow. Usually the correlation ID is generated by the first service that receives a client request. Then it’s passed through to downstream services in request headers or messages headers in case of asynchronous messaging. Correlated service calls allow to calculate various operational and performance metrics.

From the debuggability perspective, distributed tracing makes it possible to perform RCA and in case of a failure to trace all the log entries that contain parts of the same failed transaction.

Alerting

Alerting is the responsive arm of a monitoring system that performs actions based on changes in specific metric. Alerts rules are usually composed of two parts: a set of conditions, and a set of actions to perform when the alert is triggered.

One of the primary benefits of alerting is that it allows system administrators to disengage from the system and enables a system to notify us when something important happens or probably is going to happen. When automatic remediation actions aren’t possible, we want manually to analyze the alert, mitigate the problem, and learn what should be fixed in order to prevent the same problem from surfacing in the future.

For alerting to be effective, we need to keep noise low and signal high. In general, we want to get alerts that clearly present the problem and its cause. We want them to be as reliable and actionable as possible, otherwise they just get crying wolf and ignored.

When creating rules for alerting, asking the following questions can help us to decrease noise and avoid false positives:

Why this alert is important?
Is is urgent?
Is it actionable?
Will it help me to avoid similar condition in the future?
Can I automate the response to this alert?

It’s important to highlight however, that the main purpose of alerting is still to draw human attention and look at the current status of the monitored metric. The human responding to the alert can then investigate the problem and respond accordingly.

Conclusion

The goal of an Observability is not to collect logs, metrics, or traces. It is to build a culture of engineering based on facts and feedback, and then spread that culture within the broader organization

I’ll conclude by saying that building obervability in is no longer an option. Being able to identify production issues effectively not only builds customer confidence and allows sustainable system operations, but also frees engineering teams to what they are supposed to do: move forward with product development and add real business value at a steady pace.

References

Originally published at https://www.linkedin.com.