Canton Data Consistency Validation Runbook

This runbook guides you through validating Canton ledger data consistency after incident recovery, component restarts, or any event that might affect data integrity.

Target Audience: SRE Engineers, DevOps Engineers

When to Use:

After restarting Canton components (sequencer, mediator, participant)
After "No Data Showing" incidents
After database connection pool exhaustion events
After any incident affecting the ledger or http-json service
Before declaring an incident resolved

Related Issues: #6025, #6027

Prerequisites

kubectl configured for the target environment
Tailscale connected (for PQS access)
PQS readonly password from 1Password

PQS Connection

# Production
psql "host=pqs-db-bxci user=readonly dbname=lendos_pqs sslmode=require"

# Staging
psql "host=pqs-db-stg user=readonly dbname=lendos_pqs sslmode=require"

# Dev
psql "host=pqs-db-dev user=readonly dbname=lendos_pqs sslmode=require"

Phase 1: Canton Component Health

1.1 Check All Pods Are Running

# Check Canton domain components
kubectl get pods -n canton-domain

# Check Canton participant
kubectl get pods -n canton-participant

# Check daml-http-json
kubectl get pods -n daml-http-json

Expected: All pods Running with 1/1 ready. Recent restarts are acceptable if recovering from an incident.

1.2 Verify Sequencer Is Processing Requests

kubectl logs -n canton-domain deploy/canton-domain-sequencer --tail=50 | grep -E "sends request|subscribes from"

Expected: Recent timestamps showing:

'PAR::lendos-participant::...' sends request with id 'tick-...'
'MED::lendos::...' sends request with id 'tick-...'

1.3 Verify Mediator Is Finalizing Transactions

kubectl logs -n canton-domain deploy/canton-domain-mediator --tail=50 | grep -E "Phase [256]|verdict"

Expected: Recent entries showing:

Phase 2: Registered request
Phase 5: Received responses
Phase 6: Finalized request with verdict Approve

1.4 Verify Participant Is Connected to Domain

kubectl logs -n canton-participant deploy/canton-participant --tail=100 | grep -iE "connected|subscription|domain|caught up"

Expected:

Connected to domain or similar
Subscription started
No disconnected or error messages

1.5 Check for ACS Commitment Errors

# Check for commitment mismatch errors (indicates data divergence)
kubectl logs -n canton-participant deploy/canton-participant --tail=500 | grep -iE "COMMITMENT_MISMATCH|ACS.*error"

Expected: No matches. Any COMMITMENT_MISMATCH errors indicate serious data consistency issues requiring escalation.

1.6 Verify HTTP JSON API Is Responding

# Check for errors
kubectl logs -n daml-http-json deploy/daml-http-json --tail=100 | grep -iE "error|exception|timeout|pool"

Expected: No recent errors, connection timeouts, or pool exhaustion messages.

Phase 2: Prometheus Metrics (Grafana)

Check these metrics in Grafana to ensure the system is healthy:

Metric	Query	Healthy Value
Sequencer lag	`canton_sequencer_client_delay_seconds`	< 5 seconds
HikariCP utilization	`(HikariPool_1_pool_ActiveConnections / HikariPool_1_pool_MaxConnections) * 100`	< 70%
Pending connections	`HikariPool_1_pool_PendingConnections`	0
Connection acquisition time	`HikariPool_1_pool_ConnectionAcquisitionTime_mean`	< 100ms

Alert Thresholds

Metric	Warning	Critical
HikariCP utilization	>70%	>90%
Pending connections	>0	>5
Acquisition time	>100ms	>500ms

Phase 3: PQS Data Consistency Validation

Connect to PQS (see Prerequisites) and run these checks.

3.1 Quick Health Check

SELECT check_data_quality_alert();

Expected: OK: No data quality issues found

If WARNING, proceed to detailed report.

3.2 Detailed Data Quality Report

SELECT entity_type, validation_type, issue_count
FROM data_quality_report()
WHERE issue_count > 0
ORDER BY issue_count DESC;

Issue Severity Guide:

Validation Type	Meaning	Severity	Action
Multiple States	Entity in >1 state (e.g., active AND draft)	HIGH	Investigate immediately
Duplicate * State	Multiple contracts same ID/state	HIGH	Check archival logic
No Facilities	Active deal without facilities	MEDIUM	May be orphan data
Orphaned (*)	Missing parent relationship	MEDIUM	Check relationship integrity

3.3 Check PQS Sync Position

SELECT ix, "offset" FROM __watermark;

Verification: Run twice, 30 seconds apart. The ix value should increase if transactions are happening.

3.4 Check Transaction Flow

SELECT
    date_trunc('hour', effective_at) AS hour,
    COUNT(*) AS tx_count
FROM __transactions
WHERE effective_at > NOW() - INTERVAL '4 hours'
GROUP BY date_trunc('hour', effective_at)
ORDER BY hour DESC;

Expected: Consistent transaction flow. Gaps during incident window are expected, but flow should resume after recovery.

Phase 4: Before/After Contract Count Comparison

This is the key validation step. Compare contract counts before the incident to current counts.

4.1 Compare Historical Timestamp to Current

Replace the timestamp with the time before the incident started (UTC):

SELECT * FROM compare_contract_counts('2025-11-28 09:00:00+00'::timestamptz);

Output:

    snapshot     | deals | facilities | loans | master_trades | sub_trades | locs
-----------------+-------+------------+-------+---------------+------------+------
 Before (Nov 28) |  1050 |        770 |   175 |            92 |        152 |   89
 Current         |  1055 |        773 |   176 |            93 |        153 |   90

4.2 Using Interval

-- Compare with 2 hours ago
SELECT * FROM compare_contract_counts('2 hours'::interval);

-- Compare with 1 day ago
SELECT * FROM compare_contract_counts('1 day'::interval);

4.3 Compare Between Two Timestamps

For comparing two historical points (e.g., before and after a specific event window):

SELECT * FROM compare_contract_counts(
    '2025-11-28 09:00:00+00'::timestamptz,  -- from (before incident)
    '2025-11-28 12:00:00+00'::timestamptz   -- to (after recovery)
);

Output:

    snapshot     | deals | facilities | loans | master_trades | sub_trades | locs
-----------------+-------+------------+-------+---------------+------------+------
 At Nov 28 09:00 |  1050 |        770 |   175 |            92 |        152 |   89
 At Nov 28 12:00 |  1055 |        773 |   176 |            93 |        153 |   90

This is useful for:

Comparing state before an incident vs after recovery (both historical)
Validating that no contracts were lost during a maintenance window
Auditing contract changes over a specific time period

4.4 Interpreting Results

Scenario	Interpretation	Action
Counts equal or slightly higher	Normal - expected growth	OK
Counts significantly higher	Large batch operation occurred	Verify with product team
Counts lower	DATA LOSS	Escalate immediately
Counts much higher than expected	Possible duplicate contracts	Run data quality report

Phase 5: Functional Verification

5.1 UI Verification

Open LendOS Portal in browser
Navigate to a page that loads ledger data (e.g., Servicing view)
Verify data loads without "No Data Showing" error
Check browser console (F12) for API errors

5.2 API Health Check (Optional)

# From within the cluster or via port-forward
curl -s http://daml-http-json.daml-http-json:7575/readyz

Summary Checklist

Canton Components

All Canton pods running (domain-manager, sequencer, mediator, participant)
Sequencer processing tick requests (recent log entries)
Mediator finalizing transactions (Phase 6 with Approve)
Participant connected to domain
No COMMITMENT_MISMATCH errors

Metrics (Grafana)

Sequencer lag < 5s
HikariCP pool < 70% utilization
No pending connections

PQS Validation

check_data_quality_alert() returns OK (or known issues)
Watermark ix is advancing
Transaction flow shows activity post-recovery
Contract counts show no unexpected drops

Functional

HTTP JSON API responding without errors
UI loads data successfully

Escalation

If any of the following occur, escalate immediately:

COMMITMENT_MISMATCH errors in participant logs
Contract count drops in before/after comparison
HIGH severity data quality issues (Multiple States, Duplicates)
Watermark not advancing after 5+ minutes
Persistent "No Data Showing" after all components show healthy

Escalation Path:

Teams channel: Engineering

Related PR: lendos-iac#168 (PQS historical query fix)

Prerequisites​

PQS Connection​

Phase 1: Canton Component Health​

1.1 Check All Pods Are Running​

1.2 Verify Sequencer Is Processing Requests​

1.3 Verify Mediator Is Finalizing Transactions​

1.4 Verify Participant Is Connected to Domain​

1.5 Check for ACS Commitment Errors​

1.6 Verify HTTP JSON API Is Responding​

Phase 2: Prometheus Metrics (Grafana)​

Alert Thresholds​

Phase 3: PQS Data Consistency Validation​

3.1 Quick Health Check​

3.2 Detailed Data Quality Report​

3.3 Check PQS Sync Position​

3.4 Check Transaction Flow​

Phase 4: Before/After Contract Count Comparison​

4.1 Compare Historical Timestamp to Current​

4.2 Using Interval​

4.3 Compare Between Two Timestamps​

4.4 Interpreting Results​

Phase 5: Functional Verification​

5.1 UI Verification​

5.2 API Health Check (Optional)​

Summary Checklist​

Canton Components​

Metrics (Grafana)​

PQS Validation​

Functional​

Escalation​

Related Documentation​