Skip to content

Observability

Posted on:June 21, 2024 at 7 min read
O11Y

In this article, I will cover the topic of observability (or o11y) and explain why it is essential to incorporate it into your services to ensure you are building a resilient and scalable product.

What is observability?

Observability is the ability to query our system’s telemetry data to understand its current state and find “unknown unknowns”. It allows us to answer the question, “Why is this happening?“.

For a system to be observable, the services must emit signals or telemetry data such as: traces, metrics, and logs through instrumentation.

The three signals of telemetry

Metrics Trace

Why do we need observability?

As an engineer, to build resilient products, you must understand how the system behaves. Observability will give an overview of the system and its performance over time. It will allow the engineers to understand why the service behaves in a particular way and help them troubleshoot issues. It is essential to understand that observability is a continuous process that must evolve along with the product. If the product keeps being improved with new features, engineers must be capable of querying how these new features perform.

Benefits of observability

Use Case

Imagine the following product BananaCake, a cake delivery system. The service is running in production, and suddenly, customers start experiencing issues ordering cakes. This issue triggers a P1 incident since the product’s main feature is unreliable. The first question from the Client Support team is: What is happening?

There are two possible scenarios:

Scenario A - The system is not observable, and the engineers need significant effort and time to troubleshoot and fix the problem. Meanwhile, the business is losing revenue and damaging its reputation.

Scenario B - The system produces telemetry data, making the system observable and helping engineers quickly answer what and why the product is having issues ordering new cakes. Ideally, the system would trigger an alert before customers notice the problem. For example, the latency (metrics) increases significantly in the last minutes, triggering an alert. Then, the engineer on-call acknowledges the alert and starts the investigation process. They look into the traces to find where requests spend more time. After identifying the service causing the issue, the engineer looks at the logs to understand the root cause. In this case, they can verify an error log message indicating the service is throwing errors trying to write in the database due to the number of open connections. The engineers follow up with a fix and monitor the system to have confidence that the changes solved the problem.

In this example, we have described a common practice of using observability and monitoring tools to proactively identify issues and ensure the product keeps running with minimal client impact.

Monitoring vs Observability

Monitoring is a subset of observability. It is the ability to use well-known metrics or KPIs to tell us WHEN things are wrong and trigger alerts upon anomalies. For example, an alert is triggered if the system error rate exceeds 5%. On the other side, observability is the capability to query our telemetry data and understand WHY the system behaves in a certain way. For example, by querying the traces and logs, we can understand there is an issue querying the database.

Observability

Adopting Observability

Observability has become more prevalent in recent years due to the adoption of distributed systems like microservices, serverless, etc., which are inherently harder to observe due to their complexity.

Challenges

To help mitigate these challenges, the Cloud Native Computing Foundation (CNCF) created the OpenTelemetry project, which is the result of the OpenCensus and OpenTracing projects merge.

OpenTelemetry

OpenTelemetry (OTel) is an open-source observability framework to standardise the format of the telemetry data (logs, traces and metrics) and the process of collecting and exporting the data.


In summary, observability is essential for building resilient and scalable products. It involves querying system telemetry data, including logs, metrics, and traces, to understand the current state and identify unknown issues. Observability provides benefits such as a better developer experience debugging issues, proactiveness in detecting issues, reduced incident response time and increased confidence in releasing changes. We also covered the OpenTelemetry project, which helps standardise and simplify observability by providing a framework for collecting and exporting telemetry data.