Validator Disaster Recovery

This section was copied from existing reviewed documentation. Source: docs/src/validator_operator/validator_disaster_recovery.rst Reviewers: Skip this section. Remove markers after final approval.

There are three ways to recover from disasters:

In simple cases where only a single node is affected but the overall network is still healthy, a Restore from backup is usually sufficient.
If a full backup is unavailable but an identities backup has been created, the balance of the validator can be recovered on a new validator.
If the global synchronizer breaks, the super validators will initiate a roll-forward Logical Synchronizer Upgrade to roll forward to a new physical synchronizer. Validators will need to initiate the procedure on their node based on the information communicated by the SVs.

A recent database backup is available, or:

An up-to-date identities backup is available, or:

The validator participant was using an external KMS to manage its keys and the KMS still retains those keys. (Note that recovering the validator from only KMS keys - i.e., without an identities backup or database backup -is an involved process that is not explicitly documented here.)

If neither of the above holds, it is not possible to recover the relevant participant secret keys to prove asset ownership.

Restoring a validator from backups

The entire node can be restored from backups as long as all of the following hold:

A database backup is available.
The database backup is less than 30 days old. Due to sequencer pruning, a participant that is more than 30 days behind will be unable to catch up on the synchronizer to become fully operational again.
If the backup was taken before the synchronizer underwent a logical synchronizer upgrade, then restoring the node from the backup will only be possible if synchronizer nodes on the old physical synchronizer are still available. If this is true, you must restore the node on the old physical synchronizer first so it can catch up and become fully operational on the new physical synchronizer.

If one of the above does not hold, it might still be possible to recover the node using the re-onboarding procedure discussed below. The following steps can be taken to restore a node from backups:

Scale down all components in the validator node to 0 replicas.
Restore the storage and DBs of all components from the backups. The exact process for this depends on the storage and DBs used by the components, and is not documented here.
Once all storage has been restored, scale up all components in the validator node back to 1 replica.

NOTE: Currently, you have to manually re-onboard any users that were onboarded after the backup was taken. If you are running a docker-compose deployment, you can restore the Postgres databases as follows:

Stop the validator and participant using ./stop.sh.
Wipe out the existing database volume: docker volume rm compose_postgres-splice.
Start only the postgres container: docker compose up -d postgres-splice
Check whether postgres is ready with: docker exec splice-validator-postgres-splice-1 pg_isready (rerun this command until it succeeds)
Restore the validator database (assuming validator_dump_file contains the filename of the dump from which you wish to restore): docker exec -i splice-validator-postgres-splice-1 psql -U cnadmin validator < $validator_dump_file
Restore the participant database (assuming participant_dump_file contains the filename of the dump from which you wish to restore, and migration_id contains the latest migration ID): docker exec -i splice-validator-postgres-splice-1 psql -U cnadmin participant-$migration_id < $participant_dump_file
Stop the postgres instance: docker compose down
Start your validator as usual

Recovery from an identities backup: Re-onboard a validator and recover balances of all users it hosts

In the case of a catastrophic failure of the validator node, some data owned by the validator and users it hosts can be recovered from the SVs. This data includes Canton Coin balance and CNS entries. This is achieved by deploying a new validator node with control over the original validator’s namespace key. The namespace key must be provided via an identities backup file. It is used by the new validator for migrating the parties hosted on the original validator to the new validator. SVs assist this process by providing information about all contracts known to them that the migrated parties are stakeholders of. The following steps assume that you have a backup of the identities of the validator, as created in the Backup of Node Identities section. In case you do not have such a backup but instead have a backup of the validator participant’s database, you can assemble an identities backup manually. To recover from the identities backup, we deploy a new validator with some special configuration described below. Refer to either the docker-compose deployment instructions or the kubernetes instructions depending on which setup you chose. Once the new validator is up and running, you should be able to login as the administrator and see its balance. Other users hosted on the validator will need to re-onboard, but their coin balance and CNS entries should be recovered and will accessible to users that have re-onboarded. In case of issues, please consult the troubleshooting section below.

This process preserves all party IDs and all contracts shared with the DSO party. This means that you must keep the same validator party hint and you do not need a new onboarding secret. If you get any errors about needing a new onboarding secret, double check your configuration instead of requesting a new secret.

Kubernetes Deployment

To re-onboard a validator in a Kubernetes deployment and recover the balances of all users it hosts, repeat the steps described in helm-validator-install for installing the validator app and participant. While doing so, please note the following:

Create a Kubernetes secret with the content of the identities backup file. Assuming you set the environment variable PARTICIPANT_BOOTSTRAP_DUMP_FILE to a backup file path, you can create the secret with the following command:

    kubectl create secret generic participant-bootstrap-dump \
        --from-file=content=${PARTICIPANT_BOOTSTRAP_DUMP_FILE} \
        -n validator

Uncomment the following lines in the standalone-validator-values.yaml file. This will specify a new participant ID for the validator. Replace put-some-new-string-never-used-before with a string that was never used before. Make sure to also adjust nodeIdentifier to match the same value.

# participantIdentitiesDumpImport:
#   secretName: participant-bootstrap-dump
#   # Make sure to also adjust nodeIdentifier to the same value
#   newParticipantIdentifier: put-some-new-string-never-used-before
# migrateValidatorParty: true

Docker-Compose Deployment

To re-onboard a validator in a Docker-compose deployment and recover the balances of all users it hosts, type:

./start.sh -s "&lt;SPONSOR_SV_URL&gt;" -o "" -p &lt;party_hint&gt; -m "&lt;MIGRATION_ID&gt;" -i "&lt;node_identities_dump_file&gt;" -P "&lt;new_participant_id&gt;" -w

where <node_identities_dump_file> is the path to the file containing the node identities backup, and <new_participant_id> is a new identifier to be used for the new participant. It must be one never used before. Note that in subsequent restarts of the validator, you should keep providing -P with the same <new_participant_id>.

Obtaining an Identities Backup from a Participant Database Backup

In case you do not have a usable identities backup but instead have a backup of the validator participant’s database, you can assemble an identities backup manually. Here is one possible way to do so:

Restore the database backup into a temporary postgres instance and deploy a temporary participant against that instance.
- See the section on restoring a validator from backups for pointers that match your deployment model.
- You only need to restore and scale up the participant, i.e., you can ignore the validator app and its database.
- In case the restored participant shuts down immediately due to failures, add the following additional configuration:
additionalEnvVars: - name: ADDITIONAL_CONFIG_EXIT_ON_FATAL_FAILURES value: canton.parameters.exit-on-fatal-failures = false
Open a Canton console to the temporary participant.

Run below commands in the opened console. This will store the backup into a local file (relative to the local directory from which you opened the console) called identities-dump.json.

import com.digitalasset.canton.topology.transaction.TopologyMapping
import com.digitalasset.canton.topology.store.TimeQuery
import java.util.Base64

val id = participant.id.toProtoPrimitive

// This line needs to be adapted if your participant stores keys in an external KMS
val keys = "[" + participant.keys.secret.list().filter(k => k.name.get.unwrap != "cometbft-governance-keys").map(key => s"{\"keyPair\": \"${Base64.getEncoder.encodeToString(participant.keys.secret.download(key.publicKey.fingerprint).toByteArray)}\", \"name\": \"${key.name.get.unwrap}\"}") .mkString(",") + "]"

val authorizedStoreSnapshot = Base64.getEncoder.encodeToString(participant.topology.transactions.export_topology_snapshot(timeQuery = TimeQuery.Range(from = None, until = None), filterMappings = Seq(TopologyMapping.Code.NamespaceDelegation, TopologyMapping.Code.OwnerToKeyMapping, TopologyMapping.Code.VettedPackages), filterNamespace = participant.id.namespace.toProtoPrimitive).toByteArray)

val combinedJson = s"""{ "id" : "$id", "keys" : $keys, "authorizedStoreSnapshot" : "$authorizedStoreSnapshot" }"""

// Write to file
import java.nio.file.{Files, Paths}
val dumpPath = Paths.get("identities-dump.json")
Files.writeString(dumpPath, combinedJson)

Note that above commands need to be adapted if your participant is configured to store keys in an external KMS.

Limitations and Troubleshooting

In some non-standard cases, the automated re-onboarding from key backup might not succeed in migrating (i.e., recovering) a party. Please check the logs of the validator for warnings or error entries that may give clues.

Parties not migrated automatically

The following types of parties will not be migrated by default:

Parties that are hosted on multiple participants. These may get unhosted from the original (failed) participant, but will remain hosted on any other participants.
External parties that are hosted on the validator. These may get unhosted from the original (failed) participant. Please refer to validator_recover_external_party for instructions on how to recover external parties.

In some cases you might want to force the migration attempt for a set of parties that were not migrated automatically. To do so, you can set the parties-to-migrate configuration option on your validator app. A migration will be attempted for each party that you pass to this option. The initialization of the validator app will be interrupted on the first failed migration attempt.

Troubleshooting failed ACS imports

Troubleshooting rejected topology snapshots

In rare cases, the re-onboarding process may fail at the ImportTopologySnapshot step because an OwnerToKeyMapping for the old participant ID has an insufficient number of signatures in the topology snapshot. This only affects validators that were originally onboarded on Splice 0.4.1 or earlier, which used a Canton version that did not require the mapped keys to co-sign OwnerToKeyMapping transactions. You can identify this issue by looking for the following messages in your participant logs:

Missing authorizers: ReferencedAuthorizations(extraKeys = &lt;key-id&gt;...)
Rejected transaction ... OwnerToKeyMapping(...) ... due to Not authorized

To work around this, follow these steps:

Start only the new participant (without the validator app). Do not wipe its state from the previous (failed) re-onboarding attempt.

Open a Canton console to the new participant and run the following commands to propose the corrected OwnerToKeyMapping. Replace the key ID prefixes with those from the rejected OwnerToKeyMapping in your participant logs, and replace the old participant ID with your actual old participant ID:

val keys = Seq("&lt;signing-key-id-prefix&gt;", "&lt;encryption-key-id-prefix&gt;").map(prefix =>
  participant.keys.public.list().filter(_.publicKey.id.toProtoPrimitive.startsWith(prefix)).head.publicKey)

val oldParticipantId = ParticipantId.fromProtoPrimitive("&lt;old-participant-id&gt;", "participant").toOption.get
val otk = OwnerToKeyMapping(member = oldParticipantId, keys = NonEmpty.from(keys).get)
participant.topology.owner_to_key_mappings.propose(otk, force = ForceFlag.AlienMember)

Start the validator app using your original identities dump configuration.

Recover the Coin balance of an external party

Roll Forward Logical Synchronizer Upgrade

In case the SVs communicate that they recover from a loss of the physical synchronizer, they will communicate the newPhysicalSynchronizerId and the sequencerSuccessors. Validators then need to:

Wait for their node to finish catching up to the latest transaction on the existing synchronizer. A good indicator for that is that you don’t see any new logs containing Processing event at in your participant INFO logs.
Initiate the roll forward LSU through a Canton console:

val existingPhysicalSynchronizerId = participant.synchronizers.list_connected().find(_.synchronizerAlias == "global").head.physicalSynchronizerId
participant.synchronizers.perform_manual_lsu(
  existingPhysicalSynchronizerId,
  newPhysicalSynchronizerId,
  upgradeTime = None,
  sequencerSuccessors,
)

Resolving ACS mismatches

Note that depending on how exactly the old synchronizer failed, validators may desynchronize if some validators have observed a transaction before the failure while others have not. In that case, the participant will produce ACS mismatches that should be resolved using the standard ACS mismatch resolution process after migrating to the new physical synchronizer.

Overview

Validator Deployment

Super Validator Deployment

Splice Fundamentals

Canton Console

Production Operations

Extension Synchronizers

Troubleshooting

Release Notes

Reference

Community Resources

Restoring a validator from backups

Recovery from an identities backup: Re-onboard a validator and recover balances of all users it hosts

Kubernetes Deployment

Docker-Compose Deployment

Obtaining an Identities Backup from a Participant Database Backup

Limitations and Troubleshooting

Parties not migrated automatically

Troubleshooting failed ACS imports

Troubleshooting rejected topology snapshots

Recover the Coin balance of an external party

Roll Forward Logical Synchronizer Upgrade

Resolving ACS mismatches

Overview

Validator Deployment

Super Validator Deployment

Splice Fundamentals

Canton Console

Production Operations

Extension Synchronizers

Troubleshooting

Release Notes

Reference

Community Resources

Documentation Index

​Restoring a validator from backups

​Recovery from an identities backup: Re-onboard a validator and recover balances of all users it hosts

​Kubernetes Deployment

​Docker-Compose Deployment

​Obtaining an Identities Backup from a Participant Database Backup

​Limitations and Troubleshooting

​Parties not migrated automatically

​Troubleshooting failed ACS imports

​Troubleshooting rejected topology snapshots

​Recover the Coin balance of an external party

​Roll Forward Logical Synchronizer Upgrade

​Resolving ACS mismatches

Restoring a validator from backups

Recovery from an identities backup: Re-onboard a validator and recover balances of all users it hosts

Kubernetes Deployment

Docker-Compose Deployment

Obtaining an Identities Backup from a Participant Database Backup

Limitations and Troubleshooting

Parties not migrated automatically

Troubleshooting failed ACS imports

Troubleshooting rejected topology snapshots

Recover the Coin balance of an external party

Roll Forward Logical Synchronizer Upgrade

Resolving ACS mismatches