Vault* System Issues#

Here is a list of common Vault* system issues and tips to resolve them.

Health Check#

A good way to tell whether Vault is healthy or not is to look at its pod readiness. All 3 instances should be ready 3/3.

$ kubectl -n orch-platform get pods | grep -E 'vault-\d+'
vault-0  3/3  Running  0  23h
vault-1  3/3  Running  0  23h
vault-2  3/3  Running  0  23h

Common Failures#

Vault is sealed#

Vault will be sealed automatically after pod/node restart. It needs to be manually unsealed by the operator for security reasons.

Symptom#

For most of the time, a vault pod showing 2/3 ready indicates that it’s sealed.

$ kubectl -n orch-platform get pods | grep -E 'vault-\d+'
vault-0  2/3  Running  0  23h
vault-1  3/3  Running  0  23h
vault-2  3/3  Running  0  23h

You can further check the vault status to be sure. $1 should be the name of the Vault pod, for example, vault-0.

$ kubectl -n orch-platform exec -it $1 -- vault status | grep Sealed
Sealed          false

Recovery method for environment without auto unseal#

Note

Each Vault instance needs to be unsealed separately.

Run the following command and supply the unseal key to unseal vault. $1 should be the name of the Vault pod, for example, vault-0.

$ kubectl -n orch-platform -it $1 -- vault operator unseal

The unseal key must be securely stored during initial deployment.

Recovery method for environment with auto unseal#

A Vault pod must be unsealed automatically when it is restarted. You can restart the pod by the following command. $1 should be the name of the Vault pod, for example, vault-0.

kubectl -n orch-platform delete pod $1

Alternatively, you can take the recovery_keys and unseal the instance using the same method described above.

kubectl -n orch-platform exec -it $1 -- vault operator unseal
Unseal Key (will be hidden):

The recovery key must be securely stored during initial deployment.

Catastrophic Failures#

Vault is unrecoverable#

In the event of catastrophic failure such as unable to retrieve unseal keys, you must wipe out Vault databases and start all over.

Warning

Doing this will result in loss of critical data such as our root PKI, and therefore should only be performed in extreme situations where Vault is unrecoverable. Consult a Platform team member before you proceed.

Recovery Method#

You first need to access the Postgres shell.

kubectl -n orch-database exec -it postgresql-0 -- psql -U orch-platform-vault_user \
-d orch-platform-vault

You will be prompted for a password, which can be found in the following secret.

kubectl -n orch-platform get secret vault-local-postgresql -o yaml | \
yq .data.PGPASSWORD | base64 -d

There are two tables Vault is using, vault_ha_locks and vault_kv_store. Once we are in the Postgres shell, run the following commands to drop tables.

orch-platform-vault=> drop table vault_ha_locks, vault_kv_store;

Run the following command to initialize Vault. Unlike the unseal instruction, this only needs to be done on one of the Vault instances.

$ kubectl -n orch-platform exec -it vault-0 -- vault operator init \
-key-shares=1 -key-threshold=1
(unseal key and root token will be shown here)

Safely store the unseal key and root token in a secure location.

Seal Migration#

It is possible to switch between Shamir seal (manual unseal) and AWS KMS seal (auto unseal) without the need to reinitialize Vault.

Step 1: Update Vault configuration#

The seal method is configured in Helm value under .server.ha.config.

  • To use Shamir seal (manual unseal):

    server:
      ha:
        config: |
        ...
        seal "shamir" {
        }
        seal "awskms" {
          disabled = "true"
        }
    

    Shamir seal is the default and typically it does not require any seal config to use it. However, when doing seal migration, it is required to explicitly specify disabled = “true” in the previous seal method.

  • To use AWS KMS seal (auto unseal):

    server:
      ha:
        config: |
        ...
        seal "awskms" {
        }
    

Note that the shamir section must be completely removed instead of using disabled = “true” when using AWS KMS seal.

Step 2: Kill all standby instances and let Kubernetes restart them#

kubectl -n orch-platform delete pod <standby1> <standby2> ...

You can confirm whether each instance is active or standby by running:

kubectl -n orch-platform exec -it <vault pod> -- vault status

Look for the HA Mode field.

Step 3: Unseal all restarted standby instances with migration flag#

kubectl -n orch-platform exec -it <standby1> -- vault operator unseal -migrate
# supply unseal key
kubectl -n orch-platform exec -it <standby2> -- vault operator unseal -migrate
# supply unseal key

Confirm that we are in seal migration mode by running:

kubectl -n orch-platform exec -it <vault pod> -- vault status

Look for the Seal Migration in Progress field.

Step 4: Step down the active instance#

One of the standby instances will take over and become active automatically.

kubectl -n orch-platform exec -it <active> -- vault login
# supply root token
kubectl -n orch-platform exec -it <active> -- vault operator step-down

Step 5: Finally, restart the original active instance#

kubectl -n orch-platform delete pod <active>

Reference#