From Firefighting to Innovation: Using Observability to Boost Software Delivery

Observability is the key to shifting from reactive troubleshooting to proactive problem-solving. By embedding observability into pipelines and workflows, teams can monitor performance, catch issues early, and improve software delivery. In a recent demo, Liatrio demonstrated how Backstage and observability tools like Honeycomb enabled better developer workflows, faster responses, and optimized delivery.

What Is Observability and Why Is It Crucial?

Observability involves analyzing logs, metrics, and traces to understand a system’s internal state and detect issues. Telemetry automates the collection and analysis of this data, enabling teams to monitor performance and troubleshoot efficiently. Instead of reacting to issues after they escalate, teams can proactively identify and resolve problems using real-time telemetry.

Identifying Issues and Setting Clear Objectives

When we first began our observability journey, we struggled with extracting actionable insights. Our initial setup, which used a basic collector connected to Grafana, highlighted the need to establish clear objectives. We focused on Service Level Objectives (SLOs) related to error rates—especially those affecting template functionality and user accessibility.

Once onboarded to Honeycomb, we encountered millions of daily traces. By filtering out unnecessary data, we reduced the event count to under 500,000 meaningful traces, saving time and reducing telemetry-related costs for clients who are billed by event volume.

Implementing SLOs, SLIs, and Actionable Alerts

To ensure proactive monitoring, we created SLOs and Service Level Indicators (SLIs) for error rates, latency, and availability. Error monitoring became our main focus, with two key types of burn rate alerts in place:

  • General Alerts: Triggered when 10% of the error budget is burned, allowing the team to investigate before users are affected.
  • Immediate Alerts: Triggered when 8% of the error budget is consumed within two hours, signaling critical issues that require immediate action.

These alerts ensured that small issues didn’t escalate, providing teams with early warning signals to maintain system stability.

Using Dashboards to Drive Continuous Improvement

Dashboards built with Honeycomb provided actionable insights into system performance, template usage, traffic, and latency. By tracking metrics such as template adoption and usage patterns, we identified areas where developer workflows could be optimized. Telemetry data also enabled teams to make data-driven decisions, continuously refining their workflows to improve user experience and delivery efficiency.

Why Observability Transforms Delivery

Observability is more than a technical necessity—it’s a competitive advantage. By proactively identifying bottlenecks and errors, teams can reduce downtime, optimize their workflows, and deliver software more reliably. This approach builds trust between teams and stakeholders by ensuring stable and predictable deployments.

Liatrio’s experience with Backstage and Honeycomb showcases how observability can transform software delivery for organizations of all sizes. Contact Liatrio today to learn how we can help your organization implement observability practices and accelerate software delivery.

TRANSCRIPT

Travis and I want to talk about our Backstage journey to enable observability. In November of last year, during a team rechartering session, we decided that one of our key milestones would be to sit down and take observability seriously, enabling it for ourselves. Today, we’re going to cover a few things. First, I’ll give a brief overview of observability and telemetry, go over our problem statement, and then Travis will walk you through our timeline. Before diving into definitions, I just want to note that while we’re showing Backstage’s journey, many of the practices and tools like Backstage and Honeycomb are applicable to any application. Observability and telemetry are technology-agnostic practices. Observability is the practice of understanding the internal state of a system by analyzing its outputs, such as logs, metrics, and traces. Telemetry is the automated collection, transmission, and analysis of data from systems to monitor performance and detect issues. We needed to shift from constantly reacting to issues to a proactive approach. Instead of waiting to troubleshoot problems, we wanted to use telemetry data to identify critical issues early, measure real-world usage of templates, plugins, and our catalog, and prioritize improvements and features. This way, we could optimize developer workflows, meet managerial requirements, and improve the application and overall customer experience. With that, I’ll pass it to Travis. Thanks, Connor! I’ll walk you through our journey and highlight a few key decisions we made along the way. Back in October 2024, before officially chartering this observability milestone, we wanted observability but didn’t know exactly what we needed. We deployed an initial collector connected to our old Grafana instance, but we weren’t sure how to use the data effectively. Later, we learned that the observability team was switching to Honeycomb, and we signed up as early adopters. We clarified our objectives and decided to focus on Service Level Objectives (SLOs) around errors, such as templates not working or users being unable to access the site. To prepare, we initialized auto-instrumentation, installed the analytics plugin, and made sure everything was ready for onboarding. Once onboarded to Honeycomb, we saw a massive influx of traces and needed to filter the data to identify what was useful versus noise. After we gained confidence in the setup, we promoted the collector to production and enabled observability in both dev and prod environments. We created SLOs and Service Level Indicators (SLIs) and agreed to prioritize error monitoring. For this project, error tracking was key, but other metrics like latency and availability are also important. With our SLOs and SLIs set, we created actionable alerts to help team members quickly respond to issues. Afterward, we built dashboards to visualize telemetry data for traffic, template usage, and latency, allowing us to address non-critical but important patterns. Filtering Events: When we initially pointed the Gateway Collector v2 to our system, we saw millions of daily events—too much noise. We had to prioritize what data was meaningful, cutting down our daily event count from around three million to just under 500,000 while keeping the important information. This process saved time and potentially reduced costs for clients whose telemetry tools charge by event volume. SLOs and SLIs: We considered monitoring latency, throughput, availability, and errors, but we ultimately focused on error rates. This allowed us to set clear goals for improving user experience and identify problems early, beyond what we could see through traditional logging. Alerts: We configured two types of burn rate alerts: General Alerts: Triggered when we burn through 10% of our error budget. These notify the Backstage team to investigate before users are affected. Immediate Alerts: Triggered when 8% of the budget is consumed within two hours. These indicate a critical issue that requires immediate attention. After validating and testing the alerts, we began leveraging dashboards to further analyze telemetry data. My favorite dashboards show template usage and Backstage traffic, giving us insights into user behavior and areas where we can improve. Thanks to the KH team’s contributions, we can also monitor traffic and access patterns through the app, ensuring smooth operations. By using dashboards, we’ve extended our observability efforts beyond SLO monitoring. We can now track metrics like latency, availability, and template usage more effectively. Why This Matters: The benefits for clients are significant. Teams can shift from reactive to proactive issue resolution, identifying problems before they affect users. Telemetry data provides actionable insights, enabling data-driven decisions that optimize applications and enhance user experience. A better understanding of user behavior fosters trust and satisfaction. This journey has also been a valuable team exercise. We’ve learned a lot about our application and improved our collaboration, thanks to the pairings with the TagOli team. With that, I’ll hand it back to Connor. Thanks, Travis! To wrap up, our observability journey shows how teams can improve their workflows and optimize delivery. We’ve learned a lot, and this experience will help us apply similar solutions with clients. Thanks to the TagOli team for their support over the past few months. Now, let’s open it up for questions or comments.