Edge Cluster Orchestrator Troubleshooting
=========================================

Prerequisite Knowledge
----------------------

.. note:: Before commencing troubleshooting, Intel recommends
   being familiar with the following documents:

- :doc:`../cluster_orch/arch/technology_stack` for the technology stack used

- :doc:`../cluster_orch/arch/architecture` for the architecture within Edge
  Cluster Orchestrator

- :doc:`../cluster_orch/arch/deployment` referencing the deployment of the
  Edge Cluster

These will aid greatly in providing a mental model of the cluster creation
process.

This guide provides troubleshooting steps for common issues encountered
trying to set up a cluster within Edge Orchestrator using specific kubectl
commands to help troubleshoot and navigate through the different logs.

Typically it will take a few minutes before the Cluster is
successfully deployed as shown below:

.. image:: ./images/co/successful_clusterlist.png
   :alt: successful cluster list
   :width: 100%

However, if the cluster is stuck in the creation phase for a period longer than
**10 minutes,** then these are the recommended  **debugging commands** to aid
in troubleshooting the various reasons as to why the host is stuck in
creation mode.

Edge Orchestrator
-----------------

.. note::

   This guide is intended for experienced SRE / DevOps engineers who have been
   granted permissions/roles to access the console and further intervention is
   needed to troubleshoot the issue after using logs gather from
   observability. See: :doc:`capture_logs`

   The following commands are executed against an Edge Orchestrator's
   Kubernetes\* Cluster. To troubleshoot access to the ``KUBECONFIG`` is
   required.  Similar logs information can be obtained from
   ``observability-admin`` UI. See:
   :doc:`/user_guide/monitor_deployments/grafana_content`.

.. code:: shell

    kubectl logs -n capi-operator-system -l app=cluster-api-operator

**context:** issues with the overall management of CAPI components.
Things to check for would any misconfigurations or missing resources.

.. code:: shell

    kubectl logs -n capr-system -l cluster.x-k8s.io/provider=bootstrap-rke2

**context:** issues during the bootstrap process of edge clusters,
such as clusters failing to initialize. Things to check for would be
any misconfigurations, for errors during the bootstrap process of
edge clusters and or network issues.

.. code:: shell

    kubectl logs -n orch-cluster -l app=cluster-manager-cm

**context:** issues related to API calls failing or unexpected behaviour in
cluster management.

.. code:: shell

    kubectl logs -n orch-cluster -l app=cluster-manager-controller

**context:** issues relating to CAPI template components required for
further cluster creation - ControlPlaneTemplate, MachineTemplate,
ClusterTemplate and ClusterClass.

.. code:: shell

    kubectl logs -n orch-cluster -l app=southbound-api

Edge Node Logs
--------------

Prerequisite Knowledge to Access the Edge Node Logs
'''''''''''''''''''''''''''''''''''''''''''''''''''

.. note::
   To access the logs on the Edge Node, you will need to have
   access to the logs via observability dashboard. See:
   :doc:`../../../user_guide/monitor_deployments/grafana_content`

To access the logs related to both Cluster Agent and RKE Server, you have to
select the Edge Node Agent Log search as below:

.. image:: ./images/co/edgeNodeAgent.png
   :alt: Edge Node logs
   :width: 100%

For the particular logs you are looking for,
you can use the search with the filter as shown below for RKE Server:

.. image:: ./images/co/rkelog.png
   :alt: rke logs
   :width: 100%

Additionally Issues related to the connect-agent and cluster agent on the Edge
Node not responding can be diagnosed by accessing the logs through the
dashboard:

.. image:: ./images/co/cluster-obs.png
   :alt: Cluster logs
   :width: 100%

and through searching for the component name in
the search filter as displayed below:

.. image:: ./images/co/connectAgent.png
   :alt: connect Agent
   :width: 80%

For more information related to logs.
See: :doc:`/user_guide/monitor_deployments/grafana_content`.

**context:** issues with ``rke2-server`` misconfiguration
or Kubernetes cluster installation
(check agent logs for more info as shown in the example for RKE).

**context:** issues with communication to Edge
Node Cluster Agent, such as network failure
or issues specific to the agent
(check agent logs for more info as shown above).

See for more information:
:doc:`/user_guide/monitor_deployments/grafana_content`.

Troubleshooting Edge Node Clusters Extensions Deployment
--------------------------------------------------------

Edge Node
'''''''''

Diagnosing extension issues typically requires kubectl access to the cluster.
You can do this by downloading the cluster kubeconfig from the UI. You can use
this guideline :doc:`/user_guide/set_up_edge_infra/accessing_clusters` to
download the kubeconfig of a target edge cluster.

.. note::

   If the ``connect-agent`` is not deployed and running - the user will not
   have access to Edge Cluster and access to kubectl to use the kubectl
   commands with the kubeconfig will not execute

The kubeconfig file can then be used to interact with
the workload cluster from the Edge Orchestrator environment.
For example, to list all pods in all namespace in the
Edge Node workload cluster (without extensions)

.. code:: shell

    kubectl --kubeconfig=kubeconfig.yaml get pods -A

.. image:: ./images/co/kubectl-listPods.png
   :alt: kubectl-list pods
   :width: 100%

This kubeconfig file will have a short-lived
token with maximum of 60 minutes from the web-ui
login access. So, you have to login
and download a new kubeconfig for the cluster
when it expires.

Example of the logs:

.. image:: ./images/co/short-lived-token-img.png
   :alt: short-lived-token-img
   :width: 100%

When the kubeconfig file is downloaded, since
it has the same name, you may be careful to use
the most recent version to avoid the expired token issue.

In the deployments tab you will see which apps are failing:

.. image:: ./images/co/exstenion-list.png
   :alt: exstenion-list
   :width: 40%

.. image:: ./images/co/failing-ext-deployments.png
   :alt: failing-ext-deployments

If the number of total apps reported is lower
than what you expected, you will need to check the
orchestrator cluster for a few things:

- App Deployment Manager logs
- ``cattle-fleet-system`` ``gitjob`` logs
- app deployment jobs in the namespace allocated to the downstream cluster

.. image:: ./images/co/gitbjobcmd.png
   :alt: gitbjob cmd
   :width: 100%

For diagnosing issues with an app that is failing
you can use the downloaded kubeconfig to access
pod logs for the app that is having issues.
If the app does not seem to have any pods running
you can try using helm:

.. code:: shell

    helm ls  -A
    helm history -n <chart namespace> <release name>

The above commands can be used to list the
installed charts and retrieve some info on
ones that failed during install.

Testing Extension Changes
-------------------------

If you want to test changes that you believe will fix an extension, you will
first need a :doc:`/user_guide/package_software/deploy_packages` to apply your
changes to.

Extensions are currently stored here:
`https://github.com/open-edge-platform/cluster-extensions
<https://github.com/open-edge-platform/cluster-extensions>`_.

Once you have your deployment package, make the necessary changes you want to
test. Then, update the version in the ``applications.yaml`` and
``deployment-package.yaml`` files.

.. note::

   The version field in the ``applications.yaml`` file is an arbitrary value you
   can modify to avoid conflicts with other packages and does not need to match
   the chart version.

After making your changes, update the deployment package version in the
``deployment-package.yaml`` file and import it into the UI under the deployment
packages tab.

.. image:: ./images/co/import-deployment-package.png
   :alt: import-deployment-package
   :width: 50%

Once the deployment package is imported, go to your failing deployment and
upgrade it:

.. image:: ./images/co/upgrade-failing-deployment.png
   :alt: upgrade-failing-deployment
   :width: 50%

If you want you can also delete and recreate
the cluster to try a start to finish installation
with a clean state.