Fairy Tales of Workflow Orchestration

Chris O'Hara, Achille Roussel, Julien Fabre, Thomas Pelletier | Sep 28th, 2023

The microservice programming model has shown its limits, and the industry is progressively adopting durable execution engines to reach new levels of scale in distributed systems engineering.

Durable execution has, in the majority, taken the shape of workflow orchestrators. But are those frameworks the holy grail of distributed systems or just a stepping stone?

Once upon a time, in hyper-scale land…

While building and operating Segment’s data infrastructure a few years ago, we experienced firsthand how challenging it can be to run distributed systems that process millions of requests per second. Delivering customer data reliably was one of the most critical business requirements: all core product features relied on it, but more importantly, the trust that customers put in the company was in direct correlation with the reliability of the data pipeline.

To respond to this problem, we designed and developed a workflow orchestration engine that powered the delivery of user events to hundreds of destinations on the internet (Centrifuge).

At a high level, the solution we developed somewhat resembled workflow orchestrators we can find in today’s ecosystem.

This framework raised the abstraction level, giving engineers a composable building block that they could use to express the multiple stages of execution needed to implement integrations with internal and external services while delegating the responsibility of reliably executing the workflows to the engine. The decoupling was highly successful; it unlocked orders of magnitude of infrastructure scale and, over time, grew to serve more and more use cases across pillars of the engineering org, becoming a key competitive advantage for the product.

However, software engineers are creative individuals, and ultimately, the workflow orchestration model became too restrictive. Constraints started to arise in many shapes at several steps of the software development process.

Developers wanted to do more with the orchestration engine, and often, solutions required customizing pieces of logic in ways that the workflow framework wasn’t well suited to handle:

  • Integration with existing software is challenging; workflow orchestration requires adapting large portions of a system to be expressed in the framework. Iterative adoption is difficult for users of the workflow platform and systems participating in it.
  • Testing of workflows is much more complex than using common testing constructs that developers were used to working with. This shortcoming leads to productivity losses, instability, and unpredictable system behaviors, highly increasing the overall cost of the solution.
  • Customizing the behavior of the execution graph, the subset of control flow that can be expressed only partially satisfies the needs of developers. Over time, the effectiveness of decoupling the orchestration and application layers diminished; evolving the product often required modifying the engine to support the features that had to be built into the program.
  • Observing and understanding the behavior of applications constructed as orchestrated workflows became highly complex. Logic that used to be written in a single program was now deconstructed across multiple steps. At hyper-scale, classic observability tools fall short of helping solve this problem.

The challenges we faced were not unique; other engineers also encountered similar difficulties and took the time to share their stories with us. You can read about some of these here.

After years of development and operations, it started feeling like workflow orchestration had created more problems than it had solved in the first place.

What if we returned to the drawing board and rethought how to bring scale and reliability to applications?

Durable coroutines: the missing primitive

To resume long-running workloads after a crash or restart, the system must persist the intermediate state periodically. Workflow orchestrators force a style of programming where larger tasks are broken up into steps, and the state to be persisted is simply the input to each step. The computation becomes coupled to the orchestrator and, as a result, becomes harder to build, test, and observe. Imagine writing simple code in your own style without sacrificing fault tolerance. If execution were durable — that is, if execution were to continue across process and node failures — the workflow orchestration abstractions would no longer be necessary.

Some developers may be familiar with coroutines from languages such as Lua and Kotlin. Others may be familiar with Python’s generators or Go’s goroutines, which are coroutines in disguise. Coroutines are functions that can be suspended and resumed. In some implementations, coroutines can explicitly yield control to their callers and other coroutines and send and receive data across yield points.

Although many languages offer coroutines in some form, the suspend and resume operations are intimately tied to the in-memory representation. To be useful in the context of durable execution, the state of a coroutine must be serializable.

A durable coroutine is a variant that can be suspended, serialized, and resumed in another process.

Durable coroutines are a novel solution to the problem of durable execution. When durable coroutines yield, they can be serialized and persisted. Since the serialized representation is not tied to a particular state of memory, a durable coroutine can survive a crash or restart; it can be loaded into a new process and resume from where it left off.

Durable coroutines do not enforce a restrictive programming style. Yield points can be added to any part of the program, either explicitly by a user or library author or implicitly by a compiler, without changing the shape or style of the code. These yield points serve as durable checkpoints but can also be entry points to a scheduler that simplifies the development and operation of distributed systems.

The next chapter of durable execution

Durable coroutines offer a powerful building block to create fault-tolerant and highly scalable applications, but the scope of the problem spans beyond serializing the state.

The execution of coroutines needs to be driven by a scheduler that orchestrates placements, pauses, state snapshots, and code resumption. Integrating durability in the scheduling decision creates opportunities to deliver advanced reliability and introspection features to the application in a non-intrusive fashion. Where and when the coroutines are scheduled can be optimized depending on resource availability or other logical constraints. This model retains the qualities of today’s workflow orchestrators while lifting the restrictions they impose on software developers.

To explore this new programming model, we launched an open-source project at github.com/dispatchrun/coroutine to develop a source-to-source Go compiler that turns regular code into durable programs, and this model could also apply across programming languages and domains.

🚀 If you are as excited as we are about durable execution or have ideas about how to build durable coroutines in other languages, we would love to get in touch; please reach out to us!