[tech] Workflow service integration patterns & system design
This article is the 3rd part in the series on Workflows with AWS Step functions continuing from workflow idempotency.
Service Integration Patterns
A workflow needs to invoke operations on external systems via integrations. These can either be other AWS services or custom non-AWS services like HTTP endpoints. There are 3 service integration patterns listed below
1 Request Response
In this mode, a call is made to the integration service and if that call succeeds, then the step in the workflow is marked successful. If the integration is a long running task by itself then the workflow does not actually wait for the task to be complete. It just initiates the task in a fire and forget style.
2 Sync Run a job
For some AWS services, it is possible to trigger a long running task and wait in workflow till the task completes. Such integrations are possible with services like batch processing. This makes the workflow states transition with the actual completion of the tasks and simulates a synchronous flow. This is required when the output of the task is required for the next state.
3 Asynchronous Callback with task token
For asynchronous tasks, the callback mechanism provides a way to pause a workflow till the task completes. This mechanism can be used in scenarios like requiring human action or even integration with external APIs.
When an integration is configured with the task token, a new parameter is made available in the state called “Task.Token”. This token is generated during execution of the workflow and a new value is generated per execution. The taskToken is a long random string that can be shared with the external integration. When the taskToken is created, the workflow enters a pause state at the corresponding state. When the task is complete, the external client needs to call the workflow with the taskToken. The task can be marked as success or failure after which the step will resume. Additionally for long running tasks it is possible to add a heartbeat so that an optional timeout for the state can be configured.
The diagram below shows a sample task token integration with a SQS message.
System designs with Workflows
A system (or service) is built by assembling various internal sub-systems. Functionality to the system is exposed to external clients via APIs. Typically clients that are from your customers are considered to be external since you have limited control over how those clients can be changed.
When designing systems with workflows, you can choose to expose the workflow engine as a first class integration point or hide the workflow engine behind a tightly controlled API endpoint as internal implementation.
Exposed workflows
The workflow engine can be exposed to all clients and integrating services as a platform capability. Choose this option when working with a cohesive group of systems that are not exposed outside the company or organization.
In the diagram above (callback with task token section) it is mentioned that the process (or service) responds back to step functions with the task token. This implies that the calling service is aware of the AWS integration (rather than an HTTP API endpoint) and has the correct AWS IAM permissions to make a change. This is a scenario where the workflow engine (ie step functions) are exposed in the integration. All participants of the workflow have to learn about the workflow engine APIs in addition to the state requirements. The complete system has fewer abstractions.
Internal workflows
For most systems, I recommend the workflow engine as an internal implementation detail that is not exposed to other clients or systems. The baseline for exposed APIs could be REST endpoints and asynchronous apis can be internally implemented using step functions based workflows. This means that integrators or clients of the REST APIs do not need to know about workflows and only need to know the REST API contract. This helps with flexibility in the future but you need to build more integration layers.
The following flow diagram shows the architecture of 3 services - a client, a system made of 3 components (gateway, lambda and step function) and an external system.
The primary system exposes REST apis based on AWS api gateway endpoints which are backed by lambdas. The lambdas invoke workflow APIs. View the complete example here. In this example, the workflow includes a step that needs to be completed by an external service. It is implemented by exposing the task token to the external service as a HTTP POST parameter. The external service then invokes the REST API to update the status of the workflow asynchronously at a later time. The external system and clients only know of the API contract exposed at the gateway layer.
References
Service integration patterns from AWS docs
Architectural patterns - Choreography pattern
Sample workflow code