AWS* Load Balancer#

Note

This page applies to cloud deployment only.

AWS Load Balancer Architecture#

AWS load balancer architecture

Health Check#

A good way to tell whether load balancer is healthy or not is to curl the web-ui endpoint.

  • The service is healthy if it returns HTTP 200

  • It’s likely a load balancer issue if the connection times out

  • It’s likely a K8s service issue if it returns an HTTP status code but not 200

Common Failures#

DNS is not resolved - (1)#

  • Symptom

    Normally, the web-ui.edgeorchestration.intel.com should eventually be resolved into the Elastic IP address of the load balancer.

    $ dig web-ui.edgeorchestration.intel.com
    (...)
    ;; ANSWER SECTION:
    web-ui.edgeorchestration.intel.com. 900 IN CNAME
    web-ui.edgeorchestration.espdprod.infra-host.com.
    web-ui.edgeorchestration.espdprod.infra-host.com. 300 IN CNAME
    edgeorchestration.espdprod.infra-host.com.
    edgeorchestration.espdprod.infra-host.com. 60 IN A 34.217.254.245
    edgeorchestration.espdprod.infra-host.com. 60 IN A 44.240.209.7
    (...)
    

    There is likely a DNS issue if it is not resolved.

  • Recovery method

    The top level *.edgeorchestration.intel.com is managed by Intel IT. You must contact Intel IT if this is not resolved.

    Check the public zone in Route53 and make sure they are pointing to the correct load balancer instance.

Target group does not have any entry - (2)#

  • Symptom

    There are no target entries being registered at all in AWS Console > EC2 > TargetGroup > (select TargetGroup of Traefik or SRE).

  • Recovery method

    Check the targetgroupbinding resource in Kubernetes* platform. It should have the following entries. If not, check Terraform log from the cloud installer.

    $ kubectl get targetgroupbinding -A
    NAMESPACE       NAME                       SERVICE-NAME
    SERVICE-PORT   TARGET-TYPE   AGE
    argocd          argocd                     argocd-server
    443            ip            24d
    orch-gateway    traefik-grpc               traefik
    443            ip            24d
    orch-gateway    traefik-https              traefik
    443            ip            24d
    gitea           gitea                      gitea-http
    443            ip            24d
    orch-boots      ingress-nginx-controller   ingress-nginx-controller
    443            ip            24d
    

    Further check the log of aws-load-balancer-controller in kube-system namespace if still not working.

    • IAM role for Amazon Elastic Kubernetes Service* (Amazon EKS*) node should be associated with a policy that grants access to load balancer-related resources.

    • A VPC endpoint should be set up to allow access to sts.amazonaws.com.

Target group has entries but showing unhealthy - (3)(4)#

  • Symptom

    There are target entries in AWS Console > EC2 > Target Group > (select TG of Traefik or SRE), but the status is unhealthy.

  • Recovery method

    Check the Network ACL (3) and Security Group of Kubernetes node (4).

    • Network ACL should have the following rules:

      Inbound: ALLOW port 443 from Intel network
      Outbound: ALLOW port 8443 to private subnet (Traefik)
      Outbound: ALLOW port 9141 to private subnet (SRE exporter)
      
    • Security Group should have the following rules:

      ALLOW port 8443 from public subnet
      ALLOW port 9141 from public subnet
      

Troubleshooting Tips#

Kubernetes Service Port (5)#

There are 3 ports in Kubernetes Service that are often confusing.

  • port: this config will affect 2 things, and should be set to 443 in our case:

    • The outward-facing port of the load balancer when the Service type is LoadBalancer.

    • The port we use to access the service from other internal pods via https://<service>.<namespace>.svc when using Service type ClusterIP.

  • nodePort: the port that Kubernetes node listens on when the Service type is NodePort.

  • targetPort: the port which pod is listening on:

    • Should be 8443 for Traefik.

    • Should be 9141 for SRE exporter.

Note

Currently, we use ip TargetGroup type and therefore nodePort does not really matter. nodePort will become useful when we switch to instance TargetGroup type in the future to incorporate Calico CNI.

Traefik ingressroute (6)#

The list of all Host SNI Traefik* software is handling can be dumped via ingressroute:

$ kubectl get ingressroute -A -o yaml | yq '.items[].spec.routes[].match'
Host(`app-service-proxy.demo.espd.infra-host.com`)
Host(`app-service-proxy.demo.espd.infra-host.com`) && PathPrefix(`/app-service-proxy`)
Host(`attest-node.demo.espd.infra-host.com`) && PathPrefix(`/`)
Host(`connect-gateway.demo.espd.infra-host.com`) && PathPrefix(`/kubernetes`)
Host(`connect-gateway.demo.espd.infra-host.com`) && PathPrefix(`/connect`)
Host(`observability-ui.demo.espd.infra-host.com`)
Host(`registry-oci.demo.espd.infra-host.com`) && (PathPrefix(`/api/`) || PathPrefix(`/service/`) || PathPrefix(`/v2/`) || PathPrefix(`/chartrepo/`) || PathPrefix(`/c/`))
Host(`registry-oci.demo.espd.infra-host.com`) && PathPrefix(`/`)
Host(`infra-node.demo.espd.infra-host.com`) && PathPrefix(`/`)
Host(`api.demo.espd.infra-host.com`) && PathPrefix(`/edge-orchestrator.intel.com/openapi.json`)
Host(`api.demo.espd.infra-host.com`) && PathPrefix(`/edge-orchestrator.intel.com/docs`)
Host(`cluster-orch-node.demo.espd.infra-host.com`) && PathPrefix(`/`)
Host(`gitea.demo.espd.infra-host.com`)
Host(`update-node.demo.espd.infra-host.com`) && PathPrefix(`/`)
Host(`api.demo.espd.infra-host.com`) && PathPrefix(`/v`)
Host(`api.demo.espd.infra-host.com`) && PathPrefix(`/v1/orgs`)
Host(`onboarding-node.demo.espd.infra-host.com`) && PathPrefix(`/`)
Host(`onboarding-stream.demo.espd.infra-host.com`) && PathPrefix(`/`)
Host(`observability-admin.demo.espd.infra-host.com`)
Host(`keycloak.demo.espd.infra-host.com`)
Host(`logs-node.demo.espd.infra-host.com`)
Host(`metrics-node.demo.espd.infra-host.com`) && (PathPrefix(`/prometheus/api/v1/query`) || PathPrefix(`/prometheus/api/v1/query_range`) || PathPrefix(`/api/v1/push`)|| PathPrefix(`/otlp/v1/metrics`))
Host(`vault.demo.espd.infra-host.com`)
Host(`release.demo.espd.infra-host.com`)
Host(`fleet.demo.espd.infra-host.com`)
Host(`telemetry-node.demo.espd.infra-host.com`) && PathPrefix(`/`)
Host(`tinkerbell-server.demo.espd.infra-host.com`)
(Host(`demo.espd.infra-host.com`)) && PathPrefix(`/boots`)
Host(`vnc.demo.espd.infra-host.com`)
(Host(`web-ui.demo.espd.infra-host.com`) || Host(`demo.espd.infra-host.com`)) && PathPrefix(`/`)