Skip to main content

Release Execution Runbook

This runbook guides you through executing a complete LendOS release from branch creation through production deployment.

Target Audience: DevOps Engineers executing releases

Prerequisites:

  • AWS Console access for RDS operations
  • Kubernetes contexts configured:
    • lendos-stg/SREOperatorAccess
    • bxci-prod/SREOperatorAccess (Production)
    • lendos-dev/SREOperatorAccess
  • AWS CLI assume-role working for artifacts-prod/SREOperatorAccess
  • kubectl and k9s installed and working
  • GitHub Codespaces access
  • Access to Teams for product coordination (transitioning to ticket-based)

Overview: Release Phases​

A complete release follows this flow:

Phase 1: Preparation
├─ Cut release branch from develop
├─ Create DAML upgrade branch
└─ Build DAML upgrade Docker images

Phase 2: Staging Deployment
├─ Merge release branch to main (triggers artifact builds)
├─ Update deployment configurations (values-tags.yaml)
├─ Execute DAML upgrade in Staging
├─ Product smoke testing
└─ Product approval for production

Phase 3: Production Deployment
├─ Execute DAML upgrade in Production
├─ Product smoke testing
└─ Complete gitflow (merge to develop)

Typical Timeline:

  • Staging DAML upgrade: 30-45 minutes
  • Production DAML upgrade: 1-1.5 hours
  • Total release time: 2-3 hours (including coordination)

Phase 1: Preparation​

1.1 Cut Release Branch​

Release branches follow the CalVer format documented in BRANCHING.md.

# From develop (usually HEAD)
git checkout develop
git pull origin develop

# Create release branch (format: release/YYYY.R.H)
# Example: release/2025.4.0
git checkout -b release/2025.4.0

1.2 Update VERSION File​

CURRENT PROCESS (Manual - to be automated in future stories):

Edit the VERSION file at repository root to match the release version:

# Edit VERSION file
echo "2025.4.0" > VERSION

git add VERSION
git commit -m "Bump VERSION to 2025.4.0 for release"
git push origin release/2025.4.0

Why this matters: The VERSION file is the source of truth for the release version. Future automation will enforce this matches the branch name.

1.3 Create DAML Upgrade Branch​

The DAML upgrade process requires a separate branch to generate upgrade code.

# From the release branch
git checkout -b upgrade/2025.4.0

Follow the instructions in services/portal-daml-upgrade/README.md:

cd services/portal-daml-upgrade

# Generate upgrade code
# Format: ./generate-upgrade.sh <old-version> <new-version> <old-data-loader-version>
# Example versions - adjust based on your actual current/target versions
./generate-upgrade.sh 0.1.92 0.1.93 0.1.92

Important Notes:

  • The generate-upgrade.sh script regenerates files in src/main/daml/Generated
  • Files in src/main/daml/Upgrade and src/test/daml won't be overwritten if they exist
  • Test locally using the local setup process documented in the README
  • CURRENT PROCESS: Upgrade branch is NOT merged back (to be improved in Story #5832)

1.4 Build and Push DAML Upgrade Docker Images​

Prerequisites:

  • Launch a GitHub Codespace (or use local environment with AWS access)
  • Assume the Artifacts/SREOperatorAccess role
# Checkout upgrade branch in codespace
git fetch origin
git checkout upgrade/2025.4.0

cd services/portal-daml-upgrade

# Assume AWS role for ECR access
assume artifacts-prod/SREOperatorAccess

# Build and push images (creates both -init and -runner variants)
./docker-build.sh --push

Expected Output:

  • Images tagged as portal-daml-upgrade:{target_base}.{build}-init and portal-daml-upgrade:{target_base}.{build}-runner
  • Example: portal-daml-upgrade:2025.8.1.1-init, portal-daml-upgrade:2025.8.1.1-runner

Verification:

# Verify images are in ECR
aws ecr describe-images \
--repository-name right-pedal/portal-daml-upgrade \
--region us-east-1 \
--query 'imageDetails[*].imageTags' \
--output table

Phase 2: Staging Deployment​

2.1 Run Preflight Checks​

Before merging, run preflight checks to validate the release branch:

# Run locally (basic validation)
lendos preflight

# Run against the PR (full validation with milestone and approval checks)
lendos preflight --pr <PR_NUMBER>

Checks performed:

  • VERSION file matches release branch
  • No generated code committed
  • Milestone complete (all issues closed)
  • Product team approval on PR

See Preflight Checks Documentation for details.

GitHub Actions: The workflow runs automatically on release PRs to main and must pass before merge.

2.2 Merge Release Branch to Main​

CURRENT PROCESS (To be improved in Stories #5825, #5826):

# Create PR: release/2025.4.0 → main
gh pr create \
--base main \
--head release/2025.4.0 \
--title "Release 2025.4.0" \
--body "Release 2025.4.0 - includes [list key features/fixes]"

# Once approved and preflight passes, merge
gh pr merge --merge

What Happens Next:

  • GitHub Actions automatically builds all changed services
  • Uses codacy/git-version to generate semver tags (e.g., v0.0.829, v0.0.567)
  • Publishes images to ECR

Known Pain Point: We rebuild artifacts on main that were never tested. Future process will build RC artifacts on release branches and promote them (Stories #5825, #5826).

2.2 Identify Built Artifact Tags​

CURRENT PROCESS (Manual - to be automated in Story #5827):

After the main build completes, identify the generated tags:

# Check GitHub Actions for build completion
gh run list --branch main --limit 5

# Check ECR for latest tags
aws ecr describe-images \
--repository-name right-pedal/lendos-portal-backend \
--region us-east-1 \
--query 'sort_by(imageDetails,&imagePushedAt)[-5:].imageTags' \
--output table

Record the tags for:

  • LendOS Portal Backend
  • LendOS Portal Frontend
  • Signing Portal Backend
  • Signing Portal Frontend

2.3 Update values-tags.yaml for Staging​

Location: lendos-eks-workloads repository

CURRENT PROCESS (Manual - to be automated in Story #5827):

# Clone/update lendos-eks-workloads repo
cd /path/to/lendos-eks-workloads

git checkout main
git pull origin main
git checkout -b release/2025.4.0-values

# Edit values-tags files
# Update for staging first
# Files: values-tags-staging.yaml or similar (exact paths may vary)

Update the image tags for all services to the tags identified in step 2.2.

git add .
git commit -m "Update values-tags for release 2025.4.0"
git push origin release/2025.4.0-values

# Create PR
gh pr create \
--base main \
--title "Update tags for release 2025.4.0" \
--body "Updates image tags for release 2025.4.0 deployment"

# Get approval and merge

2.4 Execute DAML Upgrade in Staging​

See detailed instructions in Section 3: DAML Upgrade Procedure below

Use environment: Staging

  • Kubernetes context: lendos-stg/SREOperatorAccess
  • Canton Aurora cluster: canton-0-1-54

2.5 Product Smoke Testing & Approval​

Communication: Teams (transitioning to ticket-based approval) Approver: Usually Luis (or designated backup)

Product Team Actions:

  • Execute smoke tests against staging environment
  • Verify key workflows function correctly
  • Approve for production deployment

DevOps Actions:

  • Monitor for any issues reported during smoke testing
  • Be available to address any problems discovered
  • Document any issues and resolutions

Phase 3: Production Deployment​

3.1 Update values-tags.yaml for Production​

Follow the same process as Section 2.3, but for production values files.

3.2 Execute DAML Upgrade in Production​

See detailed instructions in Section 3: DAML Upgrade Procedure below

Use environment: Production (BXCI)

  • Kubernetes context: bxci-prod/SREOperatorAccess
  • Canton Aurora cluster: canton-20241104-0324

3.3 Product Smoke Testing​

Product team repeats smoke tests in production environment.

3.4 Complete Gitflow Process​

CURRENT PROCESS:

Per BRANCHING.md, after the release is merged to main, we need to sync those changes back to develop. We do this by opening a PR from main to develop:

# Create PR: main → develop
gh pr create \
--base develop \
--head main \
--title "Sync release 2025.4.0 to develop" \
--body "Syncs changes from release 2025.4.0 (now in main) back to develop"

# Review and merge
gh pr merge --merge

This ensures develop gets the actual merge commit from main, keeping the branches properly synchronized.

Cleanup Decisions:

  • Release branch: Can be deleted or kept for reference after both merges
  • Upgrade branch: CURRENT PROCESS stays open (to be improved in Story #5832)

Section 3: DAML Upgrade Procedure​

This section provides detailed instructions for executing a DAML upgrade in any environment (Staging or Production).

Prerequisites​

Before starting the upgrade:

  1. Product Approval Obtained

    • For Staging: Ready for smoke test
    • For Production: Staging smoke test passed, approved for production
  2. Access Verified

    • Correct Kubernetes context configured
    • AWS console access for RDS snapshots
    • lendos-eks-workloads repository cloned locally
  3. DAML Upgrade Images Built

    • Images from Phase 1.4 available in ECR
    • Image tags known (e.g., v0.1.93.0.1.92.1-init, v0.1.93.0.1.92.1-runner)

3.1 Pre-Upgrade Steps​

3.1.1 Scale Down Services​

Services must be stopped before the upgrade to prevent contract conflicts.

Using kubectl:

# Set context (adjust for environment)
CONTEXT="lendos-stg/SREOperatorAccess" # or bxci-prod/SREOperatorAccess for production

# Scale to 0 (order doesn't matter)
kubectl scale deployment lendos-portal --replicas=0 --context=$CONTEXT -n lendos-portal
kubectl scale deployment signing-portal --replicas=0 --context=$CONTEXT -n signing-portal
kubectl scale deployment pqs --replicas=0 --context=$CONTEXT -n pqs
kubectl scale deployment daml-http-json --replicas=0 --context=$CONTEXT -n daml-http-json

# Verify all pods are terminated
kubectl get pods --context=$CONTEXT -n lendos-portal
kubectl get pods --context=$CONTEXT -n signing-portal
kubectl get pods --context=$CONTEXT -n pqs
kubectl get pods --context=$CONTEXT -n daml-http-json

Using k9s (Alternative):

k9s --context=$CONTEXT

# Navigate to deployments (:deployments)
# Use namespace filters to find each service
# Select each service and scale to 0 (s key)

Services to scale down (each in their own namespace):

  • lendos-portal (namespace: lendos-portal)
  • signing-portal (namespace: signing-portal)
  • pqs (namespace: pqs) - Participant Query Store
  • daml-http-json (namespace: daml-http-json)

3.1.2 Optional: Restart Canton (Performance Optimization)​

Restarting Canton components frees up GC space and can improve upgrade performance:

# Delete pods (they will automatically restart)
kubectl delete pod -l app=canton-participant --context=$CONTEXT -n canton-participant
kubectl delete pod -l app=canton-domain --context=$CONTEXT -n canton-domain

# Wait for pods to come back up
kubectl wait --for=condition=ready pod -l app=canton-participant --context=$CONTEXT -n canton-participant --timeout=300s
kubectl wait --for=condition=ready pod -l app=canton-domain --context=$CONTEXT -n canton-domain --timeout=300s

3.1.3 Take RDS Snapshot​

CRITICAL: Always take a snapshot before proceeding. This is your rollback point.

  1. Log into AWS Console
  2. Navigate to RDS → Databases
  3. Select the appropriate Canton Aurora cluster:
    • Staging (lendos-stg): canton-0-1-54
    • Production (bxci-prod): canton-20241104-0324
  4. Actions → Take Snapshot
  5. Name: canton-upgrade-2025-4-0-YYYYMMDD-HHMM (no dots - use dashes)
  6. WAIT for snapshot to complete (Status: Available)

Verification:

# Verify snapshot status via CLI
aws rds describe-db-cluster-snapshots \
--db-cluster-snapshot-identifier canton-upgrade-2025-4-0-YYYYMMDD-HHMM \
--region us-east-1 \
--query 'DBClusterSnapshots[0].Status'

Do NOT proceed until status is "available".

3.2 Execute DAML Upgrade​

Location: lendos-eks-workloads/charts/daml-upgrade/

cd lendos-eks-workloads/charts/daml-upgrade/

# Run upgrade script
# Format: ./run-upgrade.sh --from X.X.XX --to X.X.XX <environment>
# Environment: "staging" or "bxci" (production)

# Example for staging
./run-upgrade.sh --from 0.1.92 --to 0.1.93 staging

# Example for production
./run-upgrade.sh --from 0.1.92 --to 0.1.93 bxci

What Happens: The 7 Stages​

The script executes 7 stages automatically via Helm jobs:

  1. upload-dars - Uploads new DAR files to the ledger
  2. init-upgrade-coordinator - Initializes the upgrade coordinator contract
  3. initialize-upgraders - Creates upgrader contracts (runs in parallel)
  4. upgrade-consent - Obtains consent from all parties for the upgrade
  5. upgrade - Executes the actual contract migration (runs in parallel)
  6. cleanup-upgraders - Removes upgrader contracts (runs in parallel)
  7. cleanup-coordinator - Removes upgrade coordinator contract

Important: The Helm chart is idempotent. If a stage fails, you can rerun it:

# Rerun a specific stage
./run-upgrade.sh --from 0.1.92 --to 0.1.93 staging --stage upload-dars

Monitor Progress​

View all upgrade jobs:

kubectl get jobs \
--context=$CONTEXT \
-n lendos-portal \
-l io.lendos.app/current-version

# Watch for completion
kubectl get jobs \
--context=$CONTEXT \
-n lendos-portal \
-l io.lendos.app/current-version \
--watch

View logs for a specific stage:

# Example: upload-dars stage
kubectl logs \
--context=$CONTEXT \
-n lendos-portal \
-l io.lendos.app/upgrade-stage=upload-dars \
--follow

# Other stages: init-upgrade-coordinator, initialize-upgraders,
# upgrade-consent, upgrade, cleanup-upgraders, cleanup-coordinator

Expected Timeline:

  • Staging: 30-45 minutes for all stages
  • Production: 1-1.5 hours for all stages

3.3 Post-Upgrade Steps​

3.3.1 Sync ArgoCD Applications​

ArgoCD will deploy the new service images based on the updated values-tags.yaml.

Using ArgoCD UI:

  1. Navigate to ArgoCD dashboard
  2. Sync lendos-portal application
  3. Sync signing-portal application
  4. Wait for sync to complete (status: Healthy & Synced)

Using ArgoCD CLI:

# Sync both applications
argocd app sync lendos-portal --prune
argocd app sync signing-portal --prune

# Wait for sync to complete
argocd app wait lendos-portal --health
argocd app wait signing-portal --health

3.3.2 Scale Up Services​

Restore services to normal replica counts:

# Adjust replica counts based on environment (these are examples)
kubectl scale deployment pqs --replicas=1 --context=$CONTEXT -n pqs
kubectl scale deployment daml-http-json --replicas=1 --context=$CONTEXT -n daml-http-json
kubectl scale deployment lendos-portal --replicas=2 --context=$CONTEXT -n lendos-portal
kubectl scale deployment signing-portal --replicas=2 --context=$CONTEXT -n signing-portal

# Verify pods are starting (check each namespace)
kubectl get pods --context=$CONTEXT -n lendos-portal --watch
kubectl get pods --context=$CONTEXT -n signing-portal
kubectl get pods --context=$CONTEXT -n pqs
kubectl get pods --context=$CONTEXT -n daml-http-json

Note: If services were already scaled up by ArgoCD sync, they may already be running. Verify current state:

kubectl get deployments --context=$CONTEXT -n lendos-portal
kubectl get deployments --context=$CONTEXT -n signing-portal
kubectl get deployments --context=$CONTEXT -n pqs
kubectl get deployments --context=$CONTEXT -n daml-http-json

3.3.3 Validate Services​

Basic Validation (Current Process):

# Check all pods are running and ready
kubectl get pods --context=$CONTEXT -n lendos-portal
kubectl get pods --context=$CONTEXT -n signing-portal
kubectl get pods --context=$CONTEXT -n pqs
kubectl get pods --context=$CONTEXT -n daml-http-json

# Check backend logs for successful DAR loading
kubectl logs deployment/lendos-portal-backend --context=$CONTEXT -n lendos-portal --tail=100
kubectl logs deployment/signing-portal-backend --context=$CONTEXT -n signing-portal --tail=100

Key Success Indicators:

  • All pods show "Running" status with "Ready" containers
  • Backend logs show successful connection to ledger
  • Backend logs show new DARs are available
  • No error messages in startup logs

Known Gap: Current process relies on "backends started successfully" as primary validation. Future automation (Story #5828) will add:

  • PQS contract count comparison (before/after)
  • Automated health check verification
  • Frontend availability checks
  • More comprehensive validation

3.4 Troubleshooting & Recovery​

Common Issues​

Issue: Upgrade stage fails

  1. Check logs for the failed stage:

    kubectl logs -l io.lendos.app/upgrade-stage=<stage-name> --context=$CONTEXT -n lendos-portal
  2. Identify the error (common: timeouts, connection issues, data validation)

  3. Fix the underlying issue (if possible)

  4. Rerun the specific stage:

    ./run-upgrade.sh --from X.X.XX --to X.X.XX <env> --stage <stage-name>

Issue: Services won't start after upgrade

  1. Check backend logs for DAR loading errors
  2. Verify upgrade completed all stages successfully
  3. Check Canton participant is healthy:
    kubectl logs deployment/canton-participant --context=$CONTEXT -n canton-participant --tail=100

Issue: Data validation fails

  1. Check PQS for contract counts (manual query)
  2. Coordinate with product team on specific data issues
  3. May require fix-forward approach with additional migration

Rollback Strategies​

Preferred: Fix-Forward

DAML's contract nature makes forward fixes preferable to rollbacks:

  1. Identify the issue
  2. Create a fix in a hotfix branch
  3. Build and deploy the fix
  4. May require additional migration step if contract-related

Catastrophic Failure: RDS Restore

Use only if the upgrade is unrecoverable:

  1. Stop all services (prevent further writes)

    kubectl scale deployment lendos-portal --replicas=0 --context=$CONTEXT -n lendos-portal
    kubectl scale deployment signing-portal --replicas=0 --context=$CONTEXT -n signing-portal
    kubectl scale deployment pqs --replicas=0 --context=$CONTEXT -n pqs
    kubectl scale deployment daml-http-json --replicas=0 --context=$CONTEXT -n daml-http-json
    kubectl delete pod -l app=canton-participant --context=$CONTEXT -n canton-participant
  2. Restore RDS snapshot via AWS Console:

    • Navigate to snapshot taken in step 3.1.3
    • Actions → Restore Snapshot
    • Restore to the same cluster identifier
    • Wait for restore to complete (can take 15-30 minutes)
  3. Revert service image tags:

    • Revert values-tags.yaml PR in lendos-eks-workloads
    • Sync ArgoCD applications with previous tags
  4. Scale services back up:

    kubectl scale deployment pqs --replicas=1 --context=$CONTEXT -n pqs
    kubectl scale deployment daml-http-json --replicas=1 --context=$CONTEXT -n daml-http-json
    kubectl scale deployment lendos-portal --replicas=2 --context=$CONTEXT -n lendos-portal
    kubectl scale deployment signing-portal --replicas=2 --context=$CONTEXT -n signing-portal
  5. Verify services are healthy with previous version

Why this works: Service images themselves haven't fundamentally changed in a way that breaks compatibility - it's the DAML contracts that were upgraded. Rolling back the database restores the old contract state.


Section 4: Dev Environment (Special Case)​

The Dev environment follows a different process than Staging/Production.

Key Differences:

  • Does NOT run DAML upgrades
  • Manual sync-based deployment (not auto-deploy)
  • Data reseeding via data-loader-service pre-sync hook (idempotent)
  • Used for testing features before release branches are cut

Dev Deployment Process​

  1. Product requests sync (when they want to test merged features)

  2. Generate changelog of merged items since last sync

  3. Manually sync ArgoCD apps:

    • workload-lendos-eks-cluster-lendos-portal
    • workload-lendos-eks-cluster-signing-portal
  4. Image tags:

    • Typical approach: Sync to main branch tags (latest stable)
    • Alternative: Use specific tags if testing pre-release features
    • Format: deploy/dev-YYYY.R.H.BUILD (e.g., deploy/dev-2025.4.0.3392)
    • Current limitation: Both portals currently use the same tag, which can be problematic if services diverge. This needs improvement for independent service versioning.
  5. Verify deployment:

  6. Data reseeding: The data-loader-service runs as a pre-sync hook and reloads test data automatically (idempotent)


Section 5: Reference Information​

Services Involved in Release​

LendOS Portal:

  • Backend: Separate image tag (e.g., v0.0.829)
  • Frontend: Separate image tag (e.g., v0.0.567)
  • DAML DARs: Included in DAML upgrade

Signing Portal:

  • Backend: Separate image tag
  • Frontend: Separate image tag

Supporting Services:

  • pqs - Participant Query Store
  • daml-http-json - JSON API for DAML
  • canton-participant - Canton ledger participant node
  • canton-domain - Canton domain (sequencer) node
  • data-loader-service - Test data loader (dev environment only)

Version Formats​

Current State:

  • Release Version (CalVer): 2025.4.0 (in VERSION file)
  • Auto-generated Semver Tags: v0.0.829 (backend), v0.0.567 (frontend)
  • DAML Upgrade Images: v0.1.93.0.1.92.1-runner, v0.1.93.0.1.92.1-init

Desired Future State (per BRANCHING.md):

  • Everything CalVer: vYYYY.R.H
    • YYYY: Year (2025)
    • R: Release counter (1, 2, 3...)
    • H: Hotfix counter (0, 1, 2...)
  • RC Tags: v2025.4.0-rc.1, v2025.4.0-rc.2

Key Repositories​

lendos (Main Monorepo):

  • All service code
  • VERSION file at root
  • services/portal-daml-upgrade/ - DAML upgrade tooling
  • Branches: release/YYYY.R.H, upgrade/YYYY.R.H

lendos-eks-workloads:

  • Helm charts and deployment configs
  • values-tags.yaml files for staging/prod
  • charts/daml-upgrade/ - DAML upgrade helm chart + run-upgrade.sh

Environment Details​

EnvironmentContextCanton ClusterDAML UpgradesURLs
Devlendos-dev/SREOperatorAccess(dev cluster)No (data reload instead)portal.dev.app.lendos.io, sign.dev.lendos.io
Staginglendos-stg/SREOperatorAccesscanton-0-1-54Yes(staging URLs)
Productionbxci-prod/SREOperatorAccesscanton-20241104-0324Yes(production URLs)

Known Gaps & Future Improvements​

This runbook documents the current process as of 2025-11-10. Several improvements are planned:

Immediate (Epic Scope)​

  • Story #5822: Pre-flight validation checklist - COMPLETED (see Preflight Checks)
  • Story #5823: Upgrade to Daml 3.5.5 upgrade tool (single image)
  • Story #5832: Improve upgrade branch management (directories, no generated code)
  • Story #5824: Script service shutdown/startup operations (reduce manual errors)
  • Story #5825: CalVer RC build pipeline (build on release branches)
  • Story #5826: Promote RC artifacts (no build on main)
  • Story #5827: Automate values-tags.yaml updates
  • Story #5828: Post-upgrade validation script (PQS counts, health checks)

Future (Post-Epic)​

  • Automated release branch protection (scope control)
  • Release coordination dashboard
  • Standardize on CalVer everywhere
  • Ticket-based product approval workflow

  • BRANCHING.md - Gitflow process and CalVer format
  • services/portal-daml-upgrade/README.md - DAML upgrade tooling instructions
  • lendos-eks-workloads/charts/daml-upgrade/README.md - Helm chart documentation
  • docs/versioning-and-release.md - Planned future versioning workflow

Feedback & Improvements​

This runbook is a living document. If you encounter issues, unclear instructions, or have suggestions for improvement:

  1. Document the issue during your release execution
  2. Create a GitHub issue with label documentation
  3. Propose specific improvements based on real experience

Last Updated: 2025-11-10 Epic: #5818 - Enable Independent Release Execution Stories: #5820 (DAML Upgrade), #5821 (Release Branch to Deployment)