The microservice programming model has shown its limits, and the industry is progressively adopting durable execution engines to reach new levels of scale in distributed systems engineering.
Durable execution has, in the majority, taken the shape of workflow orchestrators. But are those frameworks the holy grail of distributed systems or just a stepping stone?
While building and operating Segment’s data infrastructure a few years ago, we experienced firsthand how challenging it can be to run distributed systems that process millions of requests per second. Delivering customer data reliably was one of the most critical business requirements: all core product features relied on it, but more importantly, the trust that customers put in the company was in direct correlation with the reliability of the data pipeline.
To respond to this problem, we designed and developed a workflow orchestration engine that powered the delivery of user events to hundreds of destinations on the internet (Centrifuge).
At a high level, the solution we developed somewhat resembled workflow orchestrators we can find in today’s ecosystem.
This framework raised the abstraction level, giving engineers a composable building block that they could use to express the multiple stages of execution needed to implement integrations with internal and external services while delegating the responsibility of reliably executing the workflows to the engine. The decoupling was highly successful; it unlocked orders of magnitude of infrastructure scale and, over time, grew to serve more and more use cases across pillars of the engineering org, becoming a key competitive advantage for the product.
However, software engineers are creative individuals, and ultimately, the workflow orchestration model became too restrictive. Constraints started to arise in many shapes at several steps of the software development process.
Developers wanted to do more with the orchestration engine, and often, solutions required customizing pieces of logic in ways that the workflow framework wasn’t well suited to handle:
The challenges we faced were not unique; other engineers also encountered similar difficulties and took the time to share their stories with us. You can read about some of these here.
After years of development and operations, it started feeling like workflow orchestration had created more problems than it had solved in the first place.
What if we returned to the drawing board and rethought how to bring scale and reliability to applications?
To resume long-running workloads after a crash or restart, the system must persist the intermediate state periodically. Workflow orchestrators force a style of programming where larger tasks are broken up into steps, and the state to be persisted is simply the input to each step. The computation becomes coupled to the orchestrator and, as a result, becomes harder to build, test, and observe. Imagine writing simple code in your own style without sacrificing fault tolerance. If execution were durable — that is, if execution were to continue across process and node failures — the workflow orchestration abstractions would no longer be necessary.
Some developers may be familiar with coroutines from languages such as Lua and Kotlin. Others may be familiar with Python’s generators or Go’s goroutines, which are coroutines in disguise. Coroutines are functions that can be suspended and resumed. In some implementations, coroutines can explicitly yield control to their callers and other coroutines and send and receive data across yield points.
Although many languages offer coroutines in some form, the suspend and resume operations are intimately tied to the in-memory representation. To be useful in the context of durable execution, the state of a coroutine must be serializable.
A durable coroutine is a variant that can be suspended, serialized, and resumed in another process.
Durable coroutines are a novel solution to the problem of durable execution. When durable coroutines yield, they can be serialized and persisted. Since the serialized representation is not tied to a particular state of memory, a durable coroutine can survive a crash or restart; it can be loaded into a new process and resume from where it left off.
Durable coroutines do not enforce a restrictive programming style. Yield points can be added to any part of the program, either explicitly by a user or library author or implicitly by a compiler, without changing the shape or style of the code. These yield points serve as durable checkpoints but can also be entry points to a scheduler that simplifies the development and operation of distributed systems.
Durable coroutines offer a powerful building block to create fault-tolerant and highly scalable applications, but the scope of the problem spans beyond serializing the state.
The execution of coroutines needs to be driven by a scheduler that orchestrates placements, pauses, state snapshots, and code resumption. Integrating durability in the scheduling decision creates opportunities to deliver advanced reliability and introspection features to the application in a non-intrusive fashion. Where and when the coroutines are scheduled can be optimized depending on resource availability or other logical constraints. This model retains the qualities of today’s workflow orchestrators while lifting the restrictions they impose on software developers.
To explore this new programming model, we launched an open-source project at github.com/stealthrocket/coroutine to develop a source-to-source Go compiler that turns regular code into durable programs, and this model could also apply across programming languages and domains.
🚀 If you are as excited as we are about durable execution or have ideas about how to build durable coroutines in other languages, we would love to get in touch; please reach out to us!