Observability

In this article, I will cover the topic of observability (or o11y) and explain why it is essential to incorporate it into your services to ensure you are building a resilient and scalable product.

What is observability?

Observability is the ability to query our system’s telemetry data to understand its current state and find “unknown unknowns”. It allows us to answer the question, “Why is this happening?“.

For a system to be observable, the services must emit signals or telemetry data such as: traces, metrics, and logs through instrumentation.

The three signals of telemetry

Logs are structured or unstructured text records that describe an event at a specific time. They are crucial to diagnosing what is happening in the application, like debugging exception messages.
- Unstructured: 29/07/2024 16:47 [API-RESPONSE] "GET" "/v1/products/6c40070f" Status Code: 200. Elapsed Milliseconds: 3
- Structure (preferred since it can be easily searched, filtered, and processed to enable more advanced analytics): {"time": "29/07/2024 16:47", "message":"[API-RESPONSE] GET /v1/products/6c40070f", "http.url": "/v1/products/6c40070f", "http.status_code": 200, "http.method": "GET"}
Metrics are quantitative data (measurement) captured over time. They help measure the system’s performance and identify trends and patterns (monitor symptoms but not the cause).
Example: The RED (Rate, Error, Duration) metrics help understand how the service is performing. The RED metrics consist of a set of counters incremented over time to keep track of the number of requests, errors and duration. In this example, you can monitor the error rate of the application over time and trigger an alert if it reaches a threshold.

Traces are contextual data of a specific request propagated across the entire system. It gives visibility on the request flow through the system, helping identify bottlenecks and where issues are happening.
- Example: A trace is created when the service receives an HTTP request and is enriched with metadata (HTTP method, endpoint, trace id). This trace is propagated through the system, from database calls to external services, which gives a complete overview of that request. In case of an error, we can use the trace identifier to track the request and understand in which part of the system the error is happening.

Why do we need observability?

As an engineer, to build resilient products, you must understand how the system behaves. Observability will give an overview of the system and its performance over time. It will allow the engineers to understand why the service behaves in a particular way and help them troubleshoot issues. It is essential to understand that observability is a continuous process that must evolve along with the product. If the product keeps being improved with new features, engineers must be capable of querying how these new features perform.

Benefits of observability

It provides a great developer experience when investigating issues, which boosts engineering satisfaction and reduces troubleshooting time.
Proactively find unknowns that can impact your clients. You should be able to identify issues before manifesting to clients to guarantee their happiness.
Reduces the mean time to detect an incident (MTTD) and the incident mean time to recover (MTTR), which contributes to offering the best service quality to clients and avoiding breaking service level agreements (SLAs).
Increases confidence when releasing changes since engineers can observe how the system reacts. For example, if the changes increased latency or impacted availability.
It enables monitoring your system based on telemetry data and setting alerts to handle issues reactively.

Use Case

Imagine the following product BananaCake, a cake delivery system. The service is running in production, and suddenly, customers start experiencing issues ordering cakes. This issue triggers a P1 incident since the product’s main feature is unreliable. The first question from the Client Support team is: What is happening?

There are two possible scenarios:

Scenario A - The system is not observable, and the engineers need significant effort and time to troubleshoot and fix the problem. Meanwhile, the business is losing revenue and damaging its reputation.

Scenario B - The system produces telemetry data, making the system observable and helping engineers quickly answer what and why the product is having issues ordering new cakes. Ideally, the system would trigger an alert before customers notice the problem. For example, the latency (metrics) increases significantly in the last minutes, triggering an alert. Then, the engineer on-call acknowledges the alert and starts the investigation process. They look into the traces to find where requests spend more time. After identifying the service causing the issue, the engineer looks at the logs to understand the root cause. In this case, they can verify an error log message indicating the service is throwing errors trying to write in the database due to the number of open connections. The engineers follow up with a fix and monitor the system to have confidence that the changes solved the problem.

In this example, we have described a common practice of using observability and monitoring tools to proactively identify issues and ensure the product keeps running with minimal client impact.

Monitoring vs Observability

Monitoring is a subset of observability. It is the ability to use well-known metrics or KPIs to tell us WHEN things are wrong and trigger alerts upon anomalies. For example, an alert is triggered if the system error rate exceeds 5%. On the other side, observability is the capability to query our telemetry data and understand WHY the system behaves in a certain way. For example, by querying the traces and logs, we can understand there is an issue querying the database.

Adopting Observability

Observability has become more prevalent in recent years due to the adoption of distributed systems like microservices, serverless, etc., which are inherently harder to observe due to their complexity.

Challenges

It is hard to keep the consistency of the telemetry data across the whole organisation since the ownership is split across multiple teams with different practices.
Teams use different tools and programming languages.
Services communicate using different protocols, HTTP, gRPC, AMQP, etc., making consolidating the system’s behaviour hard.
The overhead in the infrastructure to support transferring all the telemetry data.

To help mitigate these challenges, the Cloud Native Computing Foundation (CNCF) created the OpenTelemetry project, which is the result of the OpenCensus and OpenTracing projects merge.

OpenTelemetry

OpenTelemetry (OTel) is an open-source observability framework to standardise the format of the telemetry data (logs, traces and metrics) and the process of collecting and exporting the data.

The telemetry data implements a standard using one specification to rule them, which makes the data consistent.
The OpenTelemetry offers SDKs in all the major programming languages, facilitating team adoption. The SDKs are designed to simplify the integration, making the engineer’s life easier to instrument their services.
It uses a transparent protocol to standardise the metrics, which makes it vendor-agnostic. In the future, migrating vendors will be as easy as changing the exporter. For example, your company instruments your services using the OTel SDKs in Java and Go and uses NewRelic to monitor the system. Migrating from NewRelic to a different tool like DataDog, would simply require to change the OTel exporter and everything would work since it implements the same standard.

In summary, observability is essential for building resilient and scalable products. It involves querying system telemetry data, including logs, metrics, and traces, to understand the current state and identify unknown issues. Observability provides benefits such as a better developer experience debugging issues, proactiveness in detecting issues, reduced incident response time and increased confidence in releasing changes. We also covered the OpenTelemetry project, which helps standardise and simplify observability by providing a framework for collecting and exporting telemetry data.