Achieving Delivery Excellence: Centralized Workflows and Observability in Action

Scaling software delivery effectively requires two key pillars: centralized workflows and robust observability. Without them, teams face duplicated efforts, delayed troubleshooting, and slower deployments. In a recent demo, Liatrio demonstrated how shared workflows and event-driven telemetry can help development teams catch issues faster, reduce operational overhead, and drive continuous improvement.

The Problem: Workflow Duplication and Lack of Visibility

Enterprises often struggle with workflow duplication when individual teams build custom pipelines for services like Maven, Go, or Python. This duplication creates maintenance challenges, leading to significant technical debt when updates or fixes are needed across teams. Our solution involves treating shared workflows as a centrally managed product, where teams can contribute improvements while benefiting from consistency and streamlined updates.

Another issue arises from limited observability within CI/CD processes. Bugs in build pipelines or feature flag misconfigurations can go unnoticed until they cause major disruptions. We addressed this by integrating observability directly into CI/CD workflows, ensuring teams can detect and resolve problems before they escalate.

Scenario 1: Enabling Observability in Application Development

We showcased a smooth development process using Backstage starter kits. Developers created a Golang service and onboarded it to observability tools. Feature flags, implemented using FlagD and OpenFeature, introduced a controlled failure rate (1-in-10 requests) to simulate real-world issues. We then stress-tested the setup using K6 load testing.

As failures accumulated, our service-level objective (SLO) triggered Slack notifications, enabling the team to quickly investigate and resolve the issue. By analyzing traces and performance metrics, we pinpointed the root cause, minimizing the impact on availability and performance.

Scenario 2: Observability in CI/CD Workflows

On the CI/CD side, we used GitHub Actions with semantic versioning managed by Renovate to ensure that workflows and dependencies remained up to date. An OpenTelemetry collector captured traces from GitHub events, giving the team visibility into workflow failures and version-specific issues. With real-time monitoring dashboards, teams could visualize failure points, track build times, and prioritize optimizations.

This approach not only streamlined failure detection but also helped optimize build performance. As build times increased, teams could identify and address bottlenecks, ensuring faster feedback loops and improved efficiency.

Why Centralized Workflows and Observability Matter

Centralized workflows act as the backbone of efficient delivery, reducing technical debt and making updates easier to manage. When combined with observability, teams gain actionable insights into their systems, allowing them to proactively address issues, optimize performance, and improve collaboration.

By embedding observability into the software development lifecycle, teams can troubleshoot faster, automate dependency management, and deploy with confidence.

Want to transform your delivery process with shared workflows and real-time observability? Contact Liatrio today to see how we can help optimize your CI/CD pipelines and streamline software delivery.

TRANSCRIPT

So, hey, we're the TagO11y crew. We did some cool things in preparation for a potential client demo. It was a real demo—it was just a potential client, in case you were curious about how I said that. We wanted to give context about the demo, what we built for it, and how it ties nicely into what we do at Liatrio. We’re going to talk about the tale of two services. We’ll discuss the context, what we built for the demo, and how we could use this internally. The scenario is that we had a great call with a potential client. During that call, we talked through some challenges they were facing. They mentioned they’d love to see a demo of specific scenarios, particularly ones involving introducing a bug, both as an application developer and within workflows, identifying the bug, and fixing it. These core operational tasks are essential for both platform teams and application developers. Before diving into the demo, we talked about centralized workflows, because the client was interested in centralizing their workflows. This is a common pattern we see in enterprises. If you’ve worked in GitLab, you’ve probably seen the .gitlab-ci.yml files. Similarly, GitHub uses shared workflows and custom actions. We talked about common problems when workflows aren’t centralized, particularly as organizations scale. For example, team one might build workflows for Maven and Go services, and a frontend service, duplicating a lot of work. Every other team does the same, resulting in massive duplication. This duplication is painful, and we shared real-world experiences where fixing this type of duplication caused significant technical debt and migration headaches. To address this, we painted a mental model where shared workflows are treated as a product. A platform team owns them, but they are intersourced and centrally managed. This makes upgrades easier since workflows are shared and maintained consistently across teams. Team one might use the Maven and Go service workflows and get shared workflows for tasks like linting, testing, and deployment. Team two might use the Maven workflow but build something different for Python. Team three, a low-level systems team, decided to use Rust and built their own Rust-specific pipelines while inheriting the shared workflows for common tasks. Over time, they could contribute their Rust workflows back to the shared pool for others to use. We discussed our point of view on platform engineering, what works well, and how to improve things. Then, we moved on to what we actually built for the demo, focusing on two scenarios. The first scenario was from the perspective of an application developer, and for that, I’ll pass it off to Jarrett. Good morning, everybody. Let me get my screen share up. Are you all able to see the Backstage template? Great. The first scenario starts with an application developer creating a new service using a starter kit. For this demo, the starter kit consisted of a sequence of Backstage templates. We started with a Golang template to create a basic repository and application. From there, we onboarded the application to the platform and completed Honeycomb onboarding to start emitting telemetry and making useful queries and visualizations. Once the starter kit was set up and the service was onboarded, the developer introduced a new feature to production behind a feature flag. We implemented this feature flag using the FlagD backend and the OpenFeature SDK. In the deployment YAML file, FlagD runs as a sidecar container and serves the flag value from a codified JSON config map. The feature flag is controlled by the defaultVariant attribute—when set to "on," the feature is enabled. In the application’s code (in main.go), we initialized an OpenFeature provider using FlagD as the backend. Then, we used an OpenFeature client to evaluate the feature flag’s value and make decisions based on it. For this demo, when the feature flag was enabled, there was a 1-in-10 chance of failing an HTTP request to simulate a real bug. We created traffic to test this feature flag using K6 performance testing. K6 made it easy to set up test scenarios. We had two scenarios: a high-load test with 100 virtual users making requests every 0.1 seconds for one minute and a laid-back test with 10 virtual users making requests every second for 10 minutes. With the feature flag enabled, errors began to crop up in the system. We defined an SLO for latency and availability using the "endpoint is available and fast" SLI, ensuring requests were completed within a specified duration. If the error budget was exceeded, we expected to receive a notification in Slack. As expected, after the feature was introduced and errors occurred, the SLO triggered a Slack notification. This alerted the team to investigate and resolve the issue using traces to pinpoint the bug. That’s the scenario from the application developer’s perspective. Now, I’ll pass it back to Adriel. Thanks, Jarrett! That was a great example of how application developers can have a smooth experience introducing features while maintaining observability and alerts. If you’re curious about the starter kit, it was originally built for Lululemon, and there’s a recorded demo available if you want to explore it further. You can also play around with it in Backstage. The second scenario involved the same concept but focused on the workflow side of the house. For that, I’ll hand it off to Sam. Thanks, Adriel. Can everyone see the TagOli GitHub Actions repo? Great. We created a dedicated repo for shared GitHub Actions in our demo project. This example showcases a build and test workflow that we semantically versioned and pinned using Renovate. For this scenario, we introduced an error into the workflow and monitored it using Honeycomb queries. We had an OpenTelemetry collector creating traces from GitHub events, and the query showed errors occurring within the build and test workflow. Drilling down into one of the traces revealed the specific step that failed. We created a trigger in Honeycomb to detect any workflow failures grouped by reference workflows and service names. This allowed us to see which release version caused the failure. Additionally, we set up Slack notifications to alert the team when a failure occurred, specifying the version of the workflow that triggered the issue. This way, even if a developer merged a pull request and didn’t notice a failure until hours later, the platform team would be alerted and able to act. This interconnected approach to the software development lifecycle (SDLC) ensures all components operate together effectively. Beyond just failure notifications, it’s important to track metrics like build times. As build times increase, we can address and optimize them using the same SLO/SLI principles. This demo connected many pieces of our existing infrastructure, and we believe it went pretty well.