HA/DR Strategy for Self-Hosted Agents
How to setup your self-hosted agent in an Active/Passive configuration
For a high-availability configuration of the env0 self-hosted agent, you can install the agent in two different K8s clusters in two different availability zones. You can configure the self-hosted agent in the following mode:
Active-Passive
- Setup the failover agent in a separate region or availability zone
- Once you confirmed that the agent is live and running, scale the deployment to 0.
kubectl scale deploy <agentname>.agent-trigger-deployment --replicas=0
if using the proxy pod, also scale the proxy deployment to 0.
kubectl scale deploy <agentname>.agent-proxy-deployment --replicas=0
- When an outage has occurred in the primary cluster, scale the failover deployments back to 1
kubectl scale deploy <agentname>.agent-trigger-deployment --replicas=1
if using the proxy pod, also scale the proxy deployment to 1.
kubectl scale deploy <agentname>.agent-proxy-deployment --replicas=1
Things to consider đ¤
Working Directory
when failing over to a new cluster, the working directory may not be in sync or backed up. This will result in the high possibility that you cannot resume or approve a deployment that was waiting for approval. A simple redeployment will allow you to resume your deployments in the new cluster.
- Alternatively you can use env0-Hosted Encrypted State to resume deployments across multiple clusters.
State File
When implementing a failover strategy for self-hosted env0 agents, it is critical to ensure that Terraform state files remain accessible and consistent across clusters.
Losing state sync
If the state is lost or becomes unsynchronized, deployments may fail or lead to infrastructure drift.
đĸ Remote State Storage (Recommended)
For best reliability, state should be stored in a remote backend that remains accessible regardless of cluster failures. This ensures that when a standby cluster is activated, it can resume deployments without losing state.
The common use cases are:
- env0âs remote backend as it is backed up across two AWS regions, providing built-in redundancy
- Terraformâs remote backend
- AWS S3 + DynamoDB (state locking)
- Google Cloud Storage (GCS)
- Azure Blob Storage
đĄ State Stored in the Agent Namespace (Less Ideal)
If the Terraform state is stored within the env0 agent's Kubernetes namespace, read more about it here .
In this case, extra steps are needed to sync state between the primary and standby clusters. This may involve:
- Persistent storage replication across clusters.
- A manual or automated process to copy state from the failed cluster to the standby.
Updated 11 days ago