Release Execution Runbook
This runbook guides you through executing a complete LendOS release from branch creation through production deployment.
Target Audience: DevOps Engineers executing releases
Prerequisites:
- AWS Console access for RDS operations
- Kubernetes contexts configured:
lendos-stg/SREOperatorAccessbxci-prod/SREOperatorAccess(Production)lendos-dev/SREOperatorAccess
- AWS CLI assume-role working for
artifacts-prod/SREOperatorAccess - kubectl and k9s installed and working
- GitHub Codespaces access
- Access to Teams for product coordination (transitioning to ticket-based)
Overview: Release Phases​
A complete release follows this flow:
Phase 1: Preparation
├─ Cut release branch from develop
├─ Create DAML upgrade branch
└─ Build DAML upgrade Docker images
Phase 2: Staging Deployment
├─ Merge release branch to main (triggers artifact builds)
├─ Update deployment configurations (values-tags.yaml)
├─ Execute DAML upgrade in Staging
├─ Product smoke testing
└─ Product approval for production
Phase 3: Production Deployment
├─ Execute DAML upgrade in Production
├─ Product smoke testing
└─ Complete gitflow (merge to develop)
Typical Timeline:
- Staging DAML upgrade: 30-45 minutes
- Production DAML upgrade: 1-1.5 hours
- Total release time: 2-3 hours (including coordination)
Phase 1: Preparation​
1.1 Cut Release Branch​
Release branches follow the CalVer format documented in BRANCHING.md.
# From develop (usually HEAD)
git checkout develop
git pull origin develop
# Create release branch (format: release/YYYY.R.H)
# Example: release/2025.4.0
git checkout -b release/2025.4.0
1.2 Update VERSION File​
CURRENT PROCESS (Manual - to be automated in future stories):
Edit the VERSION file at repository root to match the release version:
# Edit VERSION file
echo "2025.4.0" > VERSION
git add VERSION
git commit -m "Bump VERSION to 2025.4.0 for release"
git push origin release/2025.4.0
Why this matters: The VERSION file is the source of truth for the release version. Future automation will enforce this matches the branch name.
1.3 Create DAML Upgrade Branch​
The DAML upgrade process requires a separate branch to generate upgrade code.
# From the release branch
git checkout -b upgrade/2025.4.0
Follow the instructions in services/portal-daml-upgrade/README.md:
cd services/portal-daml-upgrade
# Generate upgrade code
# Format: ./generate-upgrade.sh <old-version> <new-version> <old-data-loader-version>
# Example versions - adjust based on your actual current/target versions
./generate-upgrade.sh 0.1.92 0.1.93 0.1.92
Important Notes:
- The
generate-upgrade.shscript regenerates files insrc/main/daml/Generated - Files in
src/main/daml/Upgradeandsrc/test/damlwon't be overwritten if they exist - Test locally using the local setup process documented in the README
- CURRENT PROCESS: Upgrade branch is NOT merged back (to be improved in Story #5832)
1.4 Build and Push DAML Upgrade Docker Images​
Prerequisites:
- Launch a GitHub Codespace (or use local environment with AWS access)
- Assume the
Artifacts/SREOperatorAccessrole
# Checkout upgrade branch in codespace
git fetch origin
git checkout upgrade/2025.4.0
cd services/portal-daml-upgrade
# Assume AWS role for ECR access
assume artifacts-prod/SREOperatorAccess
# Build and push images (creates both -init and -runner variants)
./docker-build.sh --push
Expected Output:
- Images tagged as
portal-daml-upgrade:{target_base}.{build}-initandportal-daml-upgrade:{target_base}.{build}-runner - Example:
portal-daml-upgrade:2025.8.1.1-init,portal-daml-upgrade:2025.8.1.1-runner
Verification:
# Verify images are in ECR
aws ecr describe-images \
--repository-name right-pedal/portal-daml-upgrade \
--region us-east-1 \
--query 'imageDetails[*].imageTags' \
--output table
Phase 2: Staging Deployment​
2.1 Run Preflight Checks​
Before merging, run preflight checks to validate the release branch:
# Run locally (basic validation)
lendos preflight
# Run against the PR (full validation with milestone and approval checks)
lendos preflight --pr <PR_NUMBER>
Checks performed:
- VERSION file matches release branch
- No generated code committed
- Milestone complete (all issues closed)
- Product team approval on PR
See Preflight Checks Documentation for details.
GitHub Actions: The workflow runs automatically on release PRs to main and must pass before merge.
2.2 Merge Release Branch to Main​
CURRENT PROCESS (To be improved in Stories #5825, #5826):
# Create PR: release/2025.4.0 → main
gh pr create \
--base main \
--head release/2025.4.0 \
--title "Release 2025.4.0" \
--body "Release 2025.4.0 - includes [list key features/fixes]"
# Once approved and preflight passes, merge
gh pr merge --merge
What Happens Next:
- GitHub Actions automatically builds all changed services
- Uses
codacy/git-versionto generate semver tags (e.g.,v0.0.829,v0.0.567) - Publishes images to ECR
Known Pain Point: We rebuild artifacts on main that were never tested. Future process will build RC artifacts on release branches and promote them (Stories #5825, #5826).
2.2 Identify Built Artifact Tags​
CURRENT PROCESS (Manual - to be automated in Story #5827):
After the main build completes, identify the generated tags:
# Check GitHub Actions for build completion
gh run list --branch main --limit 5
# Check ECR for latest tags
aws ecr describe-images \
--repository-name right-pedal/lendos-portal-backend \
--region us-east-1 \
--query 'sort_by(imageDetails,&imagePushedAt)[-5:].imageTags' \
--output table
Record the tags for:
- LendOS Portal Backend
- LendOS Portal Frontend
- Signing Portal Backend
- Signing Portal Frontend
2.3 Update values-tags.yaml for Staging​
Location: lendos-eks-workloads repository
CURRENT PROCESS (Manual - to be automated in Story #5827):
# Clone/update lendos-eks-workloads repo
cd /path/to/lendos-eks-workloads
git checkout main
git pull origin main
git checkout -b release/2025.4.0-values
# Edit values-tags files
# Update for staging first
# Files: values-tags-staging.yaml or similar (exact paths may vary)
Update the image tags for all services to the tags identified in step 2.2.
git add .
git commit -m "Update values-tags for release 2025.4.0"
git push origin release/2025.4.0-values
# Create PR
gh pr create \
--base main \
--title "Update tags for release 2025.4.0" \
--body "Updates image tags for release 2025.4.0 deployment"
# Get approval and merge
2.4 Execute DAML Upgrade in Staging​
See detailed instructions in Section 3: DAML Upgrade Procedure below
Use environment: Staging
- Kubernetes context:
lendos-stg/SREOperatorAccess - Canton Aurora cluster:
canton-0-1-54
2.5 Product Smoke Testing & Approval​
Communication: Teams (transitioning to ticket-based approval) Approver: Usually Luis (or designated backup)
Product Team Actions:
- Execute smoke tests against staging environment
- Verify key workflows function correctly
- Approve for production deployment
DevOps Actions:
- Monitor for any issues reported during smoke testing
- Be available to address any problems discovered
- Document any issues and resolutions
Phase 3: Production Deployment​
3.1 Update values-tags.yaml for Production​
Follow the same process as Section 2.3, but for production values files.
3.2 Execute DAML Upgrade in Production​
See detailed instructions in Section 3: DAML Upgrade Procedure below
Use environment: Production (BXCI)
- Kubernetes context:
bxci-prod/SREOperatorAccess - Canton Aurora cluster:
canton-20241104-0324
3.3 Product Smoke Testing​
Product team repeats smoke tests in production environment.
3.4 Complete Gitflow Process​
CURRENT PROCESS:
Per BRANCHING.md, after the release is merged to main, we need to sync those changes back to develop. We do this by opening a PR from main to develop:
# Create PR: main → develop
gh pr create \
--base develop \
--head main \
--title "Sync release 2025.4.0 to develop" \
--body "Syncs changes from release 2025.4.0 (now in main) back to develop"
# Review and merge
gh pr merge --merge
This ensures develop gets the actual merge commit from main, keeping the branches properly synchronized.
Cleanup Decisions:
- Release branch: Can be deleted or kept for reference after both merges
- Upgrade branch: CURRENT PROCESS stays open (to be improved in Story #5832)
Section 3: DAML Upgrade Procedure​
This section provides detailed instructions for executing a DAML upgrade in any environment (Staging or Production).
Prerequisites​
Before starting the upgrade:
-
Product Approval Obtained
- For Staging: Ready for smoke test
- For Production: Staging smoke test passed, approved for production
-
Access Verified
- Correct Kubernetes context configured
- AWS console access for RDS snapshots
lendos-eks-workloadsrepository cloned locally
-
DAML Upgrade Images Built
- Images from Phase 1.4 available in ECR
- Image tags known (e.g.,
v0.1.93.0.1.92.1-init,v0.1.93.0.1.92.1-runner)
3.1 Pre-Upgrade Steps​
3.1.1 Scale Down Services​
Services must be stopped before the upgrade to prevent contract conflicts.
Using kubectl:
# Set context (adjust for environment)
CONTEXT="lendos-stg/SREOperatorAccess" # or bxci-prod/SREOperatorAccess for production
# Scale to 0 (order doesn't matter)
kubectl scale deployment lendos-portal --replicas=0 --context=$CONTEXT -n lendos-portal
kubectl scale deployment signing-portal --replicas=0 --context=$CONTEXT -n signing-portal
kubectl scale deployment pqs --replicas=0 --context=$CONTEXT -n pqs
kubectl scale deployment daml-http-json --replicas=0 --context=$CONTEXT -n daml-http-json
# Verify all pods are terminated
kubectl get pods --context=$CONTEXT -n lendos-portal
kubectl get pods --context=$CONTEXT -n signing-portal
kubectl get pods --context=$CONTEXT -n pqs
kubectl get pods --context=$CONTEXT -n daml-http-json
Using k9s (Alternative):
k9s --context=$CONTEXT
# Navigate to deployments (:deployments)
# Use namespace filters to find each service
# Select each service and scale to 0 (s key)
Services to scale down (each in their own namespace):
lendos-portal(namespace: lendos-portal)signing-portal(namespace: signing-portal)pqs(namespace: pqs) - Participant Query Storedaml-http-json(namespace: daml-http-json)
3.1.2 Optional: Restart Canton (Performance Optimization)​
Restarting Canton components frees up GC space and can improve upgrade performance:
# Delete pods (they will automatically restart)
kubectl delete pod -l app=canton-participant --context=$CONTEXT -n canton-participant
kubectl delete pod -l app=canton-domain --context=$CONTEXT -n canton-domain
# Wait for pods to come back up
kubectl wait --for=condition=ready pod -l app=canton-participant --context=$CONTEXT -n canton-participant --timeout=300s
kubectl wait --for=condition=ready pod -l app=canton-domain --context=$CONTEXT -n canton-domain --timeout=300s
3.1.3 Take RDS Snapshot​
CRITICAL: Always take a snapshot before proceeding. This is your rollback point.
- Log into AWS Console
- Navigate to RDS → Databases
- Select the appropriate Canton Aurora cluster:
- Staging (lendos-stg):
canton-0-1-54 - Production (bxci-prod):
canton-20241104-0324
- Staging (lendos-stg):
- Actions → Take Snapshot
- Name:
canton-upgrade-2025-4-0-YYYYMMDD-HHMM(no dots - use dashes) - WAIT for snapshot to complete (Status: Available)
Verification:
# Verify snapshot status via CLI
aws rds describe-db-cluster-snapshots \
--db-cluster-snapshot-identifier canton-upgrade-2025-4-0-YYYYMMDD-HHMM \
--region us-east-1 \
--query 'DBClusterSnapshots[0].Status'
Do NOT proceed until status is "available".
3.2 Execute DAML Upgrade​
Location: lendos-eks-workloads/charts/daml-upgrade/
cd lendos-eks-workloads/charts/daml-upgrade/
# Run upgrade script
# Format: ./run-upgrade.sh --from X.X.XX --to X.X.XX <environment>
# Environment: "staging" or "bxci" (production)
# Example for staging
./run-upgrade.sh --from 0.1.92 --to 0.1.93 staging
# Example for production
./run-upgrade.sh --from 0.1.92 --to 0.1.93 bxci
What Happens: The 7 Stages​
The script executes 7 stages automatically via Helm jobs:
- upload-dars - Uploads new DAR files to the ledger
- init-upgrade-coordinator - Initializes the upgrade coordinator contract
- initialize-upgraders - Creates upgrader contracts (runs in parallel)
- upgrade-consent - Obtains consent from all parties for the upgrade
- upgrade - Executes the actual contract migration (runs in parallel)
- cleanup-upgraders - Removes upgrader contracts (runs in parallel)
- cleanup-coordinator - Removes upgrade coordinator contract
Important: The Helm chart is idempotent. If a stage fails, you can rerun it:
# Rerun a specific stage
./run-upgrade.sh --from 0.1.92 --to 0.1.93 staging --stage upload-dars
Monitor Progress​
View all upgrade jobs:
kubectl get jobs \
--context=$CONTEXT \
-n lendos-portal \
-l io.lendos.app/current-version
# Watch for completion
kubectl get jobs \
--context=$CONTEXT \
-n lendos-portal \
-l io.lendos.app/current-version \
--watch
View logs for a specific stage:
# Example: upload-dars stage
kubectl logs \
--context=$CONTEXT \
-n lendos-portal \
-l io.lendos.app/upgrade-stage=upload-dars \
--follow
# Other stages: init-upgrade-coordinator, initialize-upgraders,
# upgrade-consent, upgrade, cleanup-upgraders, cleanup-coordinator
Expected Timeline:
- Staging: 30-45 minutes for all stages
- Production: 1-1.5 hours for all stages
3.3 Post-Upgrade Steps​
3.3.1 Sync ArgoCD Applications​
ArgoCD will deploy the new service images based on the updated values-tags.yaml.
Using ArgoCD UI:
- Navigate to ArgoCD dashboard
- Sync
lendos-portalapplication - Sync
signing-portalapplication - Wait for sync to complete (status: Healthy & Synced)
Using ArgoCD CLI:
# Sync both applications
argocd app sync lendos-portal --prune
argocd app sync signing-portal --prune
# Wait for sync to complete
argocd app wait lendos-portal --health
argocd app wait signing-portal --health
3.3.2 Scale Up Services​
Restore services to normal replica counts:
# Adjust replica counts based on environment (these are examples)
kubectl scale deployment pqs --replicas=1 --context=$CONTEXT -n pqs
kubectl scale deployment daml-http-json --replicas=1 --context=$CONTEXT -n daml-http-json
kubectl scale deployment lendos-portal --replicas=2 --context=$CONTEXT -n lendos-portal
kubectl scale deployment signing-portal --replicas=2 --context=$CONTEXT -n signing-portal
# Verify pods are starting (check each namespace)
kubectl get pods --context=$CONTEXT -n lendos-portal --watch
kubectl get pods --context=$CONTEXT -n signing-portal
kubectl get pods --context=$CONTEXT -n pqs
kubectl get pods --context=$CONTEXT -n daml-http-json
Note: If services were already scaled up by ArgoCD sync, they may already be running. Verify current state:
kubectl get deployments --context=$CONTEXT -n lendos-portal
kubectl get deployments --context=$CONTEXT -n signing-portal
kubectl get deployments --context=$CONTEXT -n pqs
kubectl get deployments --context=$CONTEXT -n daml-http-json
3.3.3 Validate Services​
Basic Validation (Current Process):
# Check all pods are running and ready
kubectl get pods --context=$CONTEXT -n lendos-portal
kubectl get pods --context=$CONTEXT -n signing-portal
kubectl get pods --context=$CONTEXT -n pqs
kubectl get pods --context=$CONTEXT -n daml-http-json
# Check backend logs for successful DAR loading
kubectl logs deployment/lendos-portal-backend --context=$CONTEXT -n lendos-portal --tail=100
kubectl logs deployment/signing-portal-backend --context=$CONTEXT -n signing-portal --tail=100
Key Success Indicators:
- All pods show "Running" status with "Ready" containers
- Backend logs show successful connection to ledger
- Backend logs show new DARs are available
- No error messages in startup logs
Known Gap: Current process relies on "backends started successfully" as primary validation. Future automation (Story #5828) will add:
- PQS contract count comparison (before/after)
- Automated health check verification
- Frontend availability checks
- More comprehensive validation
3.4 Troubleshooting & Recovery​
Common Issues​
Issue: Upgrade stage fails
-
Check logs for the failed stage:
kubectl logs -l io.lendos.app/upgrade-stage=<stage-name> --context=$CONTEXT -n lendos-portal -
Identify the error (common: timeouts, connection issues, data validation)
-
Fix the underlying issue (if possible)
-
Rerun the specific stage:
./run-upgrade.sh --from X.X.XX --to X.X.XX <env> --stage <stage-name>
Issue: Services won't start after upgrade
- Check backend logs for DAR loading errors
- Verify upgrade completed all stages successfully
- Check Canton participant is healthy:
kubectl logs deployment/canton-participant --context=$CONTEXT -n canton-participant --tail=100
Issue: Data validation fails
- Check PQS for contract counts (manual query)
- Coordinate with product team on specific data issues
- May require fix-forward approach with additional migration
Rollback Strategies​
Preferred: Fix-Forward
DAML's contract nature makes forward fixes preferable to rollbacks:
- Identify the issue
- Create a fix in a hotfix branch
- Build and deploy the fix
- May require additional migration step if contract-related
Catastrophic Failure: RDS Restore
Use only if the upgrade is unrecoverable:
-
Stop all services (prevent further writes)
kubectl scale deployment lendos-portal --replicas=0 --context=$CONTEXT -n lendos-portal
kubectl scale deployment signing-portal --replicas=0 --context=$CONTEXT -n signing-portal
kubectl scale deployment pqs --replicas=0 --context=$CONTEXT -n pqs
kubectl scale deployment daml-http-json --replicas=0 --context=$CONTEXT -n daml-http-json
kubectl delete pod -l app=canton-participant --context=$CONTEXT -n canton-participant -
Restore RDS snapshot via AWS Console:
- Navigate to snapshot taken in step 3.1.3
- Actions → Restore Snapshot
- Restore to the same cluster identifier
- Wait for restore to complete (can take 15-30 minutes)
-
Revert service image tags:
- Revert values-tags.yaml PR in lendos-eks-workloads
- Sync ArgoCD applications with previous tags
-
Scale services back up:
kubectl scale deployment pqs --replicas=1 --context=$CONTEXT -n pqs
kubectl scale deployment daml-http-json --replicas=1 --context=$CONTEXT -n daml-http-json
kubectl scale deployment lendos-portal --replicas=2 --context=$CONTEXT -n lendos-portal
kubectl scale deployment signing-portal --replicas=2 --context=$CONTEXT -n signing-portal -
Verify services are healthy with previous version
Why this works: Service images themselves haven't fundamentally changed in a way that breaks compatibility - it's the DAML contracts that were upgraded. Rolling back the database restores the old contract state.
Section 4: Dev Environment (Special Case)​
The Dev environment follows a different process than Staging/Production.
Key Differences:
- Does NOT run DAML upgrades
- Manual sync-based deployment (not auto-deploy)
- Data reseeding via
data-loader-servicepre-sync hook (idempotent) - Used for testing features before release branches are cut
Dev Deployment Process​
-
Product requests sync (when they want to test merged features)
-
Generate changelog of merged items since last sync
-
Manually sync ArgoCD apps:
workload-lendos-eks-cluster-lendos-portalworkload-lendos-eks-cluster-signing-portal
-
Image tags:
- Typical approach: Sync to
mainbranch tags (latest stable) - Alternative: Use specific tags if testing pre-release features
- Format:
deploy/dev-YYYY.R.H.BUILD(e.g.,deploy/dev-2025.4.0.3392) - Current limitation: Both portals currently use the same tag, which can be problematic if services diverge. This needs improvement for independent service versioning.
- Typical approach: Sync to
-
Verify deployment:
- Check items from "Waiting to Sync" column are deployed
- Test in dev environment
- Move items in project board: https://github.com/orgs/rplendos/projects/32/views/28
-
Data reseeding: The
data-loader-serviceruns as a pre-sync hook and reloads test data automatically (idempotent)
Section 5: Reference Information​
Services Involved in Release​
LendOS Portal:
- Backend: Separate image tag (e.g.,
v0.0.829) - Frontend: Separate image tag (e.g.,
v0.0.567) - DAML DARs: Included in DAML upgrade
Signing Portal:
- Backend: Separate image tag
- Frontend: Separate image tag
Supporting Services:
pqs- Participant Query Storedaml-http-json- JSON API for DAMLcanton-participant- Canton ledger participant nodecanton-domain- Canton domain (sequencer) nodedata-loader-service- Test data loader (dev environment only)
Version Formats​
Current State:
- Release Version (CalVer):
2025.4.0(in VERSION file) - Auto-generated Semver Tags:
v0.0.829(backend),v0.0.567(frontend) - DAML Upgrade Images:
v0.1.93.0.1.92.1-runner,v0.1.93.0.1.92.1-init
Desired Future State (per BRANCHING.md):
- Everything CalVer:
vYYYY.R.HYYYY: Year (2025)R: Release counter (1, 2, 3...)H: Hotfix counter (0, 1, 2...)
- RC Tags:
v2025.4.0-rc.1,v2025.4.0-rc.2
Key Repositories​
lendos (Main Monorepo):
- All service code
VERSIONfile at rootservices/portal-daml-upgrade/- DAML upgrade tooling- Branches:
release/YYYY.R.H,upgrade/YYYY.R.H
lendos-eks-workloads:
- Helm charts and deployment configs
values-tags.yamlfiles for staging/prodcharts/daml-upgrade/- DAML upgrade helm chart +run-upgrade.sh
Environment Details​
| Environment | Context | Canton Cluster | DAML Upgrades | URLs |
|---|---|---|---|---|
| Dev | lendos-dev/SREOperatorAccess | (dev cluster) | No (data reload instead) | portal.dev.app.lendos.io, sign.dev.lendos.io |
| Staging | lendos-stg/SREOperatorAccess | canton-0-1-54 | Yes | (staging URLs) |
| Production | bxci-prod/SREOperatorAccess | canton-20241104-0324 | Yes | (production URLs) |
Known Gaps & Future Improvements​
This runbook documents the current process as of 2025-11-10. Several improvements are planned:
Immediate (Epic Scope)​
- Story #5822:
Pre-flight validation checklist- COMPLETED (see Preflight Checks) - Story #5823: Upgrade to Daml 3.5.5 upgrade tool (single image)
- Story #5832: Improve upgrade branch management (directories, no generated code)
- Story #5824: Script service shutdown/startup operations (reduce manual errors)
- Story #5825: CalVer RC build pipeline (build on release branches)
- Story #5826: Promote RC artifacts (no build on main)
- Story #5827: Automate values-tags.yaml updates
- Story #5828: Post-upgrade validation script (PQS counts, health checks)
Future (Post-Epic)​
- Automated release branch protection (scope control)
- Release coordination dashboard
- Standardize on CalVer everywhere
- Ticket-based product approval workflow
Related Documentation​
BRANCHING.md- Gitflow process and CalVer formatservices/portal-daml-upgrade/README.md- DAML upgrade tooling instructionslendos-eks-workloads/charts/daml-upgrade/README.md- Helm chart documentationdocs/versioning-and-release.md- Planned future versioning workflow
Feedback & Improvements​
This runbook is a living document. If you encounter issues, unclear instructions, or have suggestions for improvement:
- Document the issue during your release execution
- Create a GitHub issue with label
documentation - Propose specific improvements based on real experience
Last Updated: 2025-11-10 Epic: #5818 - Enable Independent Release Execution Stories: #5820 (DAML Upgrade), #5821 (Release Branch to Deployment)