[tech] Workflow Idempotency
This post is a continuation from the workflow basics post.
Idempotency of a workflow
Idempotency is the property of certain operations such that they can be applied multiple times without changing the result beyond the initial application.
For example in HTTP, the methods GET, PUT and DELETE should be implemented in an idempotent manner. If you attempt to GET a resource multiple times, assuming that there are no other changes done the result should be the same. Similarly updating a resource using PUT or attempting to delete a resource multiple times with DELETE should result in the same effect.
Idempotency is a useful property since it helps with recovery in failures in distributed scenarios. If an operation has failed with a transient error it can be retried. If it was completed deterministically (either success or failure), then retrying it should not change the result anyways.
Messaging systems like SQS have “at-least-once delivery” guarantees for delivery of messages. That means that on rare occasions, a message can be sent multiple times. The Retry Pattern is recommended for HTTP clients to handle transient failures to improve stability of an application. These are some common scenarios where a client might trigger a workflow multiple times with idempotency expectations.
Let’s get back to AWS Step functions and look at how idempotency can be achieved.
Idempotency based on name
The API to start a new workflow Execution is StartExecution. The parameters to be provided are-
stateMachineArn - identifier for the workflow
name - identifier for the workflow execution
input - input parameters for first state of workflow as a JSON string
StartExecution is built as an idempotent operation.
If StartExecution is called with the same name and input as a running execution, then the call will succeed with the same response of executionArn and startDate.
If StartExecution is called with an existing name (90 days period) and if the input is different or the execution is completed then an error signaling that the execution name exists is returned.
To build an idempotent operation based on workflows it is possible to use the name field. A parameter for the operation will need to be used as the name of the workflow execution. If the workflow execution throws an error with an existing name, it will be possible to lookup the previous execution via the DescribeExecution call and return the response. See an example of this pattern here.
Using the provided idempotency based on name is fairly simple. However this operation is limited to upto 90 days of retention. This pattern also works for idempotency at the scope of the entire workflow. It is not possible to retry parts of the steps for a workflow. See the following pattern for more complex scenarios.
Idempotency based on external state
For complex long running workflows, intermediate steps in a workflow can fail. It might be possible to restart a workflow that is able to resume from failed intermediate steps. Hence each state within the workflow should be idempotent in itself.
To make individual states (or activity in the diagram above) in a workflow idempotent, you should extract the metadata about a workflow execution (or computation as in the diagram) to an external datastore like dynamodb. Each individual state within an execution can then look up the current metadata and then resume or skip the work of that state.
The diagram above is from an AWS talk - Under the Covers of AWS: Its Core Distributed Systems. The talk covers various primitives to build distributed systems including workflows.
With metadata stored in an external service, the name of the workflow execution does not matter. From the input, some unique identifiers will instead need to look up keys in the metadata store and each step can store additional information.
Using an external datastore adds complexity to the solution but it allows for more control. It also removes the 90 day limits on workflow idempotency of the previous pattern.
Next
Service Integration patterns with async flows