What is Graph Club?
Graph Club is a weekly or bi-weekly meeting where the team reviews the system’s observability. During this meeting, the team will look into the service SLIs, review alerts, incidents and investigate anomalies that happened over the past weeks.
Why do we need the Graph Club?
Systems evolve, and so must the observability. It is a common mistake for teams to set up the observability of the service and stop maintaining it. If the team stops trusting the telemetry data and alerts due to noise, then the team is potentially missing issues. Maintaining the system’s observability requires continuous effort, so adopting the Graph Club will benefit the team by proactively reviewing and improving it. It must be a requirement to include observability as part of the definition of done when delivering new features. If your team releases a new feature, they must be able to observe how it’s performing.
Goals
- Understand how the system behaves
- Discover unknowns and anomalies
- Share knowledge within the team
- Improve observability
Benefits
- Mentoring the team to get familiar with the observability tools (DataDog, HoneyComb, Kibana, Grafana, etc.)
- It is an excellent opportunity for managers to get more technical and understand the system in detail.
- Continuously improve the system’s observability:
- Review metrics (availability, latency, errors).
- Improve the telemetry data: logging, metrics and tracing.
- Refine alerts and reduce noise.
- Create monitoring dashboards.
- Prepare the system for eventual incidents and help the team get more confident handling incidents, e.g. calculating impact and debugging issues.
- Update run books to help the on-caller during an incident scenario.
Adopting Graph Club
To get started with the Graph Club, you will need the following:
- Nominate a facilitator - It is essential to have someone facilitating the meeting. If possible, you must rotate the person weekly to avoid the team’s reliance on a single person. A good approach is to nominate the in-hours on-caller.
- Create an agenda - The agenda is a document to write all the findings during the meeting and the following actions.
- Set up a recurring meeting - Schedule a weekly or bi-weekly meeting, attach a link to the agenda and invite the whole team. In some scenarios, inviting people from different teams to investigate cross-dependency issues might make sense.
- Training - For a productive meeting and the team’s benefit, consider providing adequate training on the observability tools to ensure everyone is familiarised.
Meeting Structure
The following agenda is just an example, and you should refine it to what makes sense for your team. The agenda should be a guide to help the facilitator and give some structure to the meeting. Include links to the monitoring dashboards to save time during the meeting.
Agenda
- Review all incidents and critical alerts from the previous weeks.
- (link to the incident dashboard)
- Are there any noisy alerts that can be tidied up?
- Review Services Metrics
- (link to the service metrics dashboard)
- Do any services have less than 99.9% Internal Availability?
- Why has this happened?
- Do any services have p99 latency > 150ms?
- Why has this happened?
- Are there any anomalies in the product KPIs? e.g. drop in conversion
- How is the infrastructure performing (CPU, Memory, Disk, etc.)?
- What is the service maturity: security issues, outdated dependencies?
- Investigate Service Errors
- (link to a logging query that shows the service errors)
- Are there any new service errors? Any unexpected increases?
- Monitoring Tools
- Are the current dashboards relevant and helpful to debug issues?
- Is the run book useful in case of an incident?
After the Meeting
- The facilitator is responsible for creating the following actions. e.g. tickets, collaborating with other teams, etc.
- Document learnings from the session (new errors, anomalies, etc.)
- Update the agenda with the topics discussed and add relevant links (logs queries, metrics)
- In case of dependencies with other teams, the facilitator should raise the issue with the respective teams.
In conclusion, by adopting the Graph Club meeting, teams can proactively review the system’s observability, fostering a culture of continuous improvement and vigilance. During this meeting, the facilitator is responsible for ensuring the team follows the agenda. They should focus on reviewing the service metrics, dashboards, alerts and create follow-up actions to enhance the system’s observability.