Skip to main content

Canton Data Consistency Validation Runbook

This runbook guides you through validating Canton ledger data consistency after incident recovery, component restarts, or any event that might affect data integrity.

Target Audience: SRE Engineers, DevOps Engineers

When to Use:

  • After restarting Canton components (sequencer, mediator, participant)
  • After "No Data Showing" incidents
  • After database connection pool exhaustion events
  • After any incident affecting the ledger or http-json service
  • Before declaring an incident resolved

Related Issues: #6025, #6027


Prerequisites​

  • kubectl configured for the target environment
  • Tailscale connected (for PQS access)
  • PQS readonly password from 1Password

PQS Connection​

# Production
psql "host=pqs-db-bxci user=readonly dbname=lendos_pqs sslmode=require"

# Staging
psql "host=pqs-db-stg user=readonly dbname=lendos_pqs sslmode=require"

# Dev
psql "host=pqs-db-dev user=readonly dbname=lendos_pqs sslmode=require"

Phase 1: Canton Component Health​

1.1 Check All Pods Are Running​

# Check Canton domain components
kubectl get pods -n canton-domain

# Check Canton participant
kubectl get pods -n canton-participant

# Check daml-http-json
kubectl get pods -n daml-http-json

Expected: All pods Running with 1/1 ready. Recent restarts are acceptable if recovering from an incident.

1.2 Verify Sequencer Is Processing Requests​

kubectl logs -n canton-domain deploy/canton-domain-sequencer --tail=50 | grep -E "sends request|subscribes from"

Expected: Recent timestamps showing:

  • 'PAR::lendos-participant::...' sends request with id 'tick-...'
  • 'MED::lendos::...' sends request with id 'tick-...'

1.3 Verify Mediator Is Finalizing Transactions​

kubectl logs -n canton-domain deploy/canton-domain-mediator --tail=50 | grep -E "Phase [256]|verdict"

Expected: Recent entries showing:

  • Phase 2: Registered request
  • Phase 5: Received responses
  • Phase 6: Finalized request with verdict Approve

1.4 Verify Participant Is Connected to Domain​

kubectl logs -n canton-participant deploy/canton-participant --tail=100 | grep -iE "connected|subscription|domain|caught up"

Expected:

  • Connected to domain or similar
  • Subscription started
  • No disconnected or error messages

1.5 Check for ACS Commitment Errors​

# Check for commitment mismatch errors (indicates data divergence)
kubectl logs -n canton-participant deploy/canton-participant --tail=500 | grep -iE "COMMITMENT_MISMATCH|ACS.*error"

Expected: No matches. Any COMMITMENT_MISMATCH errors indicate serious data consistency issues requiring escalation.

1.6 Verify HTTP JSON API Is Responding​

# Check for errors
kubectl logs -n daml-http-json deploy/daml-http-json --tail=100 | grep -iE "error|exception|timeout|pool"

Expected: No recent errors, connection timeouts, or pool exhaustion messages.


Phase 2: Prometheus Metrics (Grafana)​

Check these metrics in Grafana to ensure the system is healthy:

MetricQueryHealthy Value
Sequencer lagcanton_sequencer_client_delay_seconds< 5 seconds
HikariCP utilization(HikariPool_1_pool_ActiveConnections / HikariPool_1_pool_MaxConnections) * 100< 70%
Pending connectionsHikariPool_1_pool_PendingConnections0
Connection acquisition timeHikariPool_1_pool_ConnectionAcquisitionTime_mean< 100ms

Alert Thresholds​

MetricWarningCritical
HikariCP utilization>70%>90%
Pending connections>0>5
Acquisition time>100ms>500ms

Phase 3: PQS Data Consistency Validation​

Connect to PQS (see Prerequisites) and run these checks.

3.1 Quick Health Check​

SELECT check_data_quality_alert();

Expected: OK: No data quality issues found

If WARNING, proceed to detailed report.

3.2 Detailed Data Quality Report​

SELECT entity_type, validation_type, issue_count
FROM data_quality_report()
WHERE issue_count > 0
ORDER BY issue_count DESC;

Issue Severity Guide:

Validation TypeMeaningSeverityAction
Multiple StatesEntity in >1 state (e.g., active AND draft)HIGHInvestigate immediately
Duplicate * StateMultiple contracts same ID/stateHIGHCheck archival logic
No FacilitiesActive deal without facilitiesMEDIUMMay be orphan data
Orphaned (*)Missing parent relationshipMEDIUMCheck relationship integrity

3.3 Check PQS Sync Position​

SELECT ix, "offset" FROM __watermark;

Verification: Run twice, 30 seconds apart. The ix value should increase if transactions are happening.

3.4 Check Transaction Flow​

SELECT
date_trunc('hour', effective_at) AS hour,
COUNT(*) AS tx_count
FROM __transactions
WHERE effective_at > NOW() - INTERVAL '4 hours'
GROUP BY date_trunc('hour', effective_at)
ORDER BY hour DESC;

Expected: Consistent transaction flow. Gaps during incident window are expected, but flow should resume after recovery.


Phase 4: Before/After Contract Count Comparison​

This is the key validation step. Compare contract counts before the incident to current counts.

4.1 Compare Historical Timestamp to Current​

Replace the timestamp with the time before the incident started (UTC):

SELECT * FROM compare_contract_counts('2025-11-28 09:00:00+00'::timestamptz);

Output:

    snapshot     | deals | facilities | loans | master_trades | sub_trades | locs
-----------------+-------+------------+-------+---------------+------------+------
Before (Nov 28) | 1050 | 770 | 175 | 92 | 152 | 89
Current | 1055 | 773 | 176 | 93 | 153 | 90

4.2 Using Interval​

-- Compare with 2 hours ago
SELECT * FROM compare_contract_counts('2 hours'::interval);

-- Compare with 1 day ago
SELECT * FROM compare_contract_counts('1 day'::interval);

4.3 Compare Between Two Timestamps​

For comparing two historical points (e.g., before and after a specific event window):

SELECT * FROM compare_contract_counts(
'2025-11-28 09:00:00+00'::timestamptz, -- from (before incident)
'2025-11-28 12:00:00+00'::timestamptz -- to (after recovery)
);

Output:

    snapshot     | deals | facilities | loans | master_trades | sub_trades | locs
-----------------+-------+------------+-------+---------------+------------+------
At Nov 28 09:00 | 1050 | 770 | 175 | 92 | 152 | 89
At Nov 28 12:00 | 1055 | 773 | 176 | 93 | 153 | 90

This is useful for:

  • Comparing state before an incident vs after recovery (both historical)
  • Validating that no contracts were lost during a maintenance window
  • Auditing contract changes over a specific time period

4.4 Interpreting Results​

ScenarioInterpretationAction
Counts equal or slightly higherNormal - expected growthOK
Counts significantly higherLarge batch operation occurredVerify with product team
Counts lowerDATA LOSSEscalate immediately
Counts much higher than expectedPossible duplicate contractsRun data quality report

Phase 5: Functional Verification​

5.1 UI Verification​

  1. Open LendOS Portal in browser
  2. Navigate to a page that loads ledger data (e.g., Servicing view)
  3. Verify data loads without "No Data Showing" error
  4. Check browser console (F12) for API errors

5.2 API Health Check (Optional)​

# From within the cluster or via port-forward
curl -s http://daml-http-json.daml-http-json:7575/readyz

Summary Checklist​

Canton Components​

  • All Canton pods running (domain-manager, sequencer, mediator, participant)
  • Sequencer processing tick requests (recent log entries)
  • Mediator finalizing transactions (Phase 6 with Approve)
  • Participant connected to domain
  • No COMMITMENT_MISMATCH errors

Metrics (Grafana)​

  • Sequencer lag < 5s
  • HikariCP pool < 70% utilization
  • No pending connections

PQS Validation​

  • check_data_quality_alert() returns OK (or known issues)
  • Watermark ix is advancing
  • Transaction flow shows activity post-recovery
  • Contract counts show no unexpected drops

Functional​

  • HTTP JSON API responding without errors
  • UI loads data successfully

Escalation​

If any of the following occur, escalate immediately:

  1. COMMITMENT_MISMATCH errors in participant logs
  2. Contract count drops in before/after comparison
  3. HIGH severity data quality issues (Multiple States, Duplicates)
  4. Watermark not advancing after 5+ minutes
  5. Persistent "No Data Showing" after all components show healthy

Escalation Path:

  • Teams channel: Engineering


Related PR: lendos-iac#168 (PQS historical query fix)