Anthos Multi Cluster Ingress — прерывистое подключение и исчезновение серверной службы

Я использую частный кластер 2 GKE, настроенный в europe-west2. У меня есть выделенный кластер конфигурации для MCI и рабочий кластер для рабочих нагрузок. Оба кластера зарегистрированы в концентраторе Anthos, а функция входа включена в кластере конфигурации. Кроме того, в рабочем кластере работает последняя версия ASM 1.12.2.

As far as MCI is concerned my deployment is 'standard' as in based on available docs (ie https://cloud.google.com/architecture/distributed-services-on-gke-private-using-anthos-service-mesh#configure-multi-cluster-ingress, terraform-example-foundation repo etc).

Everything works but I'm hitting an intermittent connectivity issue no matter how many times I redeploy entire stack. My eyes are bleeding from staring at logging dashboard. I ran out of dots to connect.

I'm probing some endpoints presented from my cluster which most of the time returns 200 with following logged under resource.type="http_load_balancer":

      {
httpRequest: {
 latency: "0.081658s"
 remoteIp: "20.83.144.189"
 requestMethod: "GET"
 requestSize: "360"
 requestUrl: "https://foo.bar.io/"
 responseSize: "1054"
 serverIp: "100.64.72.136"
 status: 200
 ...
}
insertId: "10mjvz4e8g0nq"
jsonPayload: {
 @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
 statusDetails: "response_sent_by_backend"
}
...
resource: {
 labels: {
  backend_service_name: "mci-4z8mmz-80-asm-ingress-mcs-istio"
  forwarding_rule_name: "mci-4z8mmz-fws-asm-ingress-mci-istio"
  project_id: "prj-foo-bar"
  target_proxy_name: "mci-4z8mmz-asm-ingress-mci-istio"
  url_map_name: "mci-4z8mmz-asm-ingress-mci-istio"
  zone: "global"
 }
 type: "http_load_balancer"
}
severity: "INFO"
spanId: "2a986abfc69bef6f"
timestamp: "2022-02-04T15:24:14.160642Z"
...
}

At random intervals, anything between 1 - 5 hours the probes start failing with 404 for a period of 5 - 10 mins and following is logged:

      {
httpRequest: {
 ...
 requestMethod: "GET"
 ...
 requestUrl: "https://foo.bar.io/"
 ...
 status: 404
 ...
}
insertId: "10mjvz4e8g0nq"
jsonPayload: {
 @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
 statusDetails: "internal_error"
}
...
resource: {
 labels: {
  backend_service_name: ""
  forwarding_rule_name: "mci-4z8mmz-fws-asm-ingress-mci-istio"
  project_id: "prj-foo-bar"
  target_proxy_name: "mci-4z8mmz-asm-ingress-mci-istio"
  url_map_name: "mci-4z8mmz-asm-ingress-mci-istio"
  zone: "global"
 }
 type: "http_load_balancer"
}
severity: "WARNING"
...
}

backend_service_name and serverIp disappears and the external LB provisioned via MCI goes for an extended nap. If I try to access the endpoints in a browser during that period i get 404'd and eventually connection was closed.

I've searched logs far and wide and cannot find any leads.

Has anyone experienced a similar issue ? Could this be a regional thing ? I'm yet to try deploying to another region.

Any info/links/ideas much appreciated.

Edit:

I also confirmed that health checks are fine and there are no transitions. Pods never receive the request so 404's are coming from external lb.

1 ответ

У меня была такая же/похожая проблема при использовании HTTPS с MultiClusterIngress.

Служба поддержки Google предложила использовать буквальный статический IP-адрес для аннотации:

      networking.gke.io/static-ip: STATIC_IP_ADDRESS

Попробуйте использовать буквальный IP-адрес, например

      34.102.201.47

Вместо

      https://www.googleapis.com/compute/v1/projects/PROJECT_ID/global/addresses/ADDRESS_NAME

как описано в https://cloud.google.com/kubernetes-engine/docs/how-to/multi-cluster-ingress#static.

Если это не решит проблему, попробуйте обратиться в службу поддержки Google.