Canton Data Consistency Validation Runbook
This runbook guides you through validating Canton ledger data consistency after incident recovery, component restarts, or any event that might affect data integrity.
Target Audience: SRE Engineers, DevOps Engineers
When to Use:
- After restarting Canton components (sequencer, mediator, participant)
- After "No Data Showing" incidents
- After database connection pool exhaustion events
- After any incident affecting the ledger or http-json service
- Before declaring an incident resolved
Prerequisites​
kubectlconfigured for the target environment- Tailscale connected (for PQS access)
- PQS readonly password from 1Password
PQS Connection​
# Production
psql "host=pqs-db-bxci user=readonly dbname=lendos_pqs sslmode=require"
# Staging
psql "host=pqs-db-stg user=readonly dbname=lendos_pqs sslmode=require"
# Dev
psql "host=pqs-db-dev user=readonly dbname=lendos_pqs sslmode=require"
Phase 1: Canton Component Health​
1.1 Check All Pods Are Running​
# Check Canton domain components
kubectl get pods -n canton-domain
# Check Canton participant
kubectl get pods -n canton-participant
# Check daml-http-json
kubectl get pods -n daml-http-json
Expected: All pods Running with 1/1 ready. Recent restarts are acceptable if recovering from an incident.
1.2 Verify Sequencer Is Processing Requests​
kubectl logs -n canton-domain deploy/canton-domain-sequencer --tail=50 | grep -E "sends request|subscribes from"
Expected: Recent timestamps showing:
'PAR::lendos-participant::...' sends request with id 'tick-...''MED::lendos::...' sends request with id 'tick-...'
1.3 Verify Mediator Is Finalizing Transactions​
kubectl logs -n canton-domain deploy/canton-domain-mediator --tail=50 | grep -E "Phase [256]|verdict"
Expected: Recent entries showing:
Phase 2: Registered requestPhase 5: Received responsesPhase 6: Finalized request with verdict Approve
1.4 Verify Participant Is Connected to Domain​
kubectl logs -n canton-participant deploy/canton-participant --tail=100 | grep -iE "connected|subscription|domain|caught up"
Expected:
Connected to domainor similarSubscription started- No
disconnectedorerrormessages
1.5 Check for ACS Commitment Errors​
# Check for commitment mismatch errors (indicates data divergence)
kubectl logs -n canton-participant deploy/canton-participant --tail=500 | grep -iE "COMMITMENT_MISMATCH|ACS.*error"
Expected: No matches. Any COMMITMENT_MISMATCH errors indicate serious data consistency issues requiring escalation.
1.6 Verify HTTP JSON API Is Responding​
# Check for errors
kubectl logs -n daml-http-json deploy/daml-http-json --tail=100 | grep -iE "error|exception|timeout|pool"
Expected: No recent errors, connection timeouts, or pool exhaustion messages.
Phase 2: Prometheus Metrics (Grafana)​
Check these metrics in Grafana to ensure the system is healthy:
| Metric | Query | Healthy Value |
|---|---|---|
| Sequencer lag | canton_sequencer_client_delay_seconds | < 5 seconds |
| HikariCP utilization | (HikariPool_1_pool_ActiveConnections / HikariPool_1_pool_MaxConnections) * 100 | < 70% |
| Pending connections | HikariPool_1_pool_PendingConnections | 0 |
| Connection acquisition time | HikariPool_1_pool_ConnectionAcquisitionTime_mean | < 100ms |
Alert Thresholds​
| Metric | Warning | Critical |
|---|---|---|
| HikariCP utilization | >70% | >90% |
| Pending connections | >0 | >5 |
| Acquisition time | >100ms | >500ms |
Phase 3: PQS Data Consistency Validation​
Connect to PQS (see Prerequisites) and run these checks.
3.1 Quick Health Check​
SELECT check_data_quality_alert();
Expected: OK: No data quality issues found
If WARNING, proceed to detailed report.
3.2 Detailed Data Quality Report​
SELECT entity_type, validation_type, issue_count
FROM data_quality_report()
WHERE issue_count > 0
ORDER BY issue_count DESC;
Issue Severity Guide:
| Validation Type | Meaning | Severity | Action |
|---|---|---|---|
| Multiple States | Entity in >1 state (e.g., active AND draft) | HIGH | Investigate immediately |
| Duplicate * State | Multiple contracts same ID/state | HIGH | Check archival logic |
| No Facilities | Active deal without facilities | MEDIUM | May be orphan data |
| Orphaned (*) | Missing parent relationship | MEDIUM | Check relationship integrity |
3.3 Check PQS Sync Position​
SELECT ix, "offset" FROM __watermark;
Verification: Run twice, 30 seconds apart. The ix value should increase if transactions are happening.
3.4 Check Transaction Flow​
SELECT
date_trunc('hour', effective_at) AS hour,
COUNT(*) AS tx_count
FROM __transactions
WHERE effective_at > NOW() - INTERVAL '4 hours'
GROUP BY date_trunc('hour', effective_at)
ORDER BY hour DESC;
Expected: Consistent transaction flow. Gaps during incident window are expected, but flow should resume after recovery.
Phase 4: Before/After Contract Count Comparison​
This is the key validation step. Compare contract counts before the incident to current counts.
4.1 Compare Historical Timestamp to Current​
Replace the timestamp with the time before the incident started (UTC):
SELECT * FROM compare_contract_counts('2025-11-28 09:00:00+00'::timestamptz);
Output:
snapshot | deals | facilities | loans | master_trades | sub_trades | locs
-----------------+-------+------------+-------+---------------+------------+------
Before (Nov 28) | 1050 | 770 | 175 | 92 | 152 | 89
Current | 1055 | 773 | 176 | 93 | 153 | 90
4.2 Using Interval​
-- Compare with 2 hours ago
SELECT * FROM compare_contract_counts('2 hours'::interval);
-- Compare with 1 day ago
SELECT * FROM compare_contract_counts('1 day'::interval);
4.3 Compare Between Two Timestamps​
For comparing two historical points (e.g., before and after a specific event window):
SELECT * FROM compare_contract_counts(
'2025-11-28 09:00:00+00'::timestamptz, -- from (before incident)
'2025-11-28 12:00:00+00'::timestamptz -- to (after recovery)
);
Output:
snapshot | deals | facilities | loans | master_trades | sub_trades | locs
-----------------+-------+------------+-------+---------------+------------+------
At Nov 28 09:00 | 1050 | 770 | 175 | 92 | 152 | 89
At Nov 28 12:00 | 1055 | 773 | 176 | 93 | 153 | 90
This is useful for:
- Comparing state before an incident vs after recovery (both historical)
- Validating that no contracts were lost during a maintenance window
- Auditing contract changes over a specific time period
4.4 Interpreting Results​
| Scenario | Interpretation | Action |
|---|---|---|
| Counts equal or slightly higher | Normal - expected growth | OK |
| Counts significantly higher | Large batch operation occurred | Verify with product team |
| Counts lower | DATA LOSS | Escalate immediately |
| Counts much higher than expected | Possible duplicate contracts | Run data quality report |
Phase 5: Functional Verification​
5.1 UI Verification​
- Open LendOS Portal in browser
- Navigate to a page that loads ledger data (e.g., Servicing view)
- Verify data loads without "No Data Showing" error
- Check browser console (F12) for API errors
5.2 API Health Check (Optional)​
# From within the cluster or via port-forward
curl -s http://daml-http-json.daml-http-json:7575/readyz
Summary Checklist​
Canton Components​
- All Canton pods running (domain-manager, sequencer, mediator, participant)
- Sequencer processing tick requests (recent log entries)
- Mediator finalizing transactions (Phase 6 with Approve)
- Participant connected to domain
- No COMMITMENT_MISMATCH errors
Metrics (Grafana)​
- Sequencer lag < 5s
- HikariCP pool < 70% utilization
- No pending connections
PQS Validation​
-
check_data_quality_alert()returns OK (or known issues) - Watermark ix is advancing
- Transaction flow shows activity post-recovery
- Contract counts show no unexpected drops
Functional​
- HTTP JSON API responding without errors
- UI loads data successfully
Escalation​
If any of the following occur, escalate immediately:
- COMMITMENT_MISMATCH errors in participant logs
- Contract count drops in before/after comparison
- HIGH severity data quality issues (Multiple States, Duplicates)
- Watermark not advancing after 5+ minutes
- Persistent "No Data Showing" after all components show healthy
Escalation Path:
- Teams channel:
Engineering
Related Documentation​
Related PR: lendos-iac#168 (PQS historical query fix)