Модуль ML Model продолжает перезапускаться при развертывании Seldon

У меня есть такое развертывание Селдона:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: gs://seldon-models/mlflow/elasticnet_wine
        name: classifier
      name: default
      replicas: 1     

Модель успешно загружена с сервера, но через некоторое время поды переходят в состояние crashloop и перезапускайте снова и снова.

Когда я вижу журналы, ошибок нет, так как журналы перезапущены, и я могу видеть только то, как загружаются пакеты python.

PS C:\Users\xxx\mlflow> kubectl logs -p -c wines-classifier model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Executing before-run script
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
readline-7.0         | 324 KB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
tbb4py-2020.0        | 209 KB    | ########## | 100%
scipy-1.1.0          | 13.2 MB   | ########## | 100%
zlib-1.2.11          | 103 KB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
openssl-1.1.1g       | 2.5 MB    | ########## | 100%
mkl_fft-1.0.6        | 135 KB    | ########## | 100%
blas-1.0             | 6 KB      | ########## | 100%
pip-20.1.1           | 1.8 MB    | ########## | 100%
wheel-0.34.2         | 51 KB     | ########## | 100%
libffi-3.2.1         | 40 KB     | ########## | 100%
scikit-learn-0.19.1  | 3.9 MB    | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB   | ########## | 100%
sqlite-3.32.3        | 1.1 MB    | ########## | 100%
numpy-1.15.4         | 34 KB     | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
libgcc-ng-9.1.0      | 5.1 MB    | ########## | 100%
setuptools-47.3.1    | 514 KB    | ########## | 100%
mkl_random-1.0.1     | 324 KB    | ########## | 100%
python-3.6.9         | 30.2 MB   | ########## | 100%
certifi-2020.6.20    | 156 KB    | ########## | 100%
numpy-base-1.15.4    | 3.4 MB    | ########## | 100%
intel-openmp-2019.4  | 729 KB    | ########## | 100%
libedit-3.1.20191231 | 167 KB    | ########## | 100%
libstdcxx-ng-9.1.0   | 3.1 MB    | ########## | 100%
tbb-2020.0           | 1.1 MB    | ########## | 100%
mkl-2018.0.3         | 126.9 MB  | #########  |  91%

Теперь, пытаясь с -p параметр, предложенный @arghya-sadhu:

PS C:\Users\xxx\mlflow> kubectl logs -p model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp wines-classifier
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
scikit-learn-0.19.1  | 3.9 MB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
zlib-1.2.11          | 103 KB    | ########## | 100%
tbb4py-2020.0        | 209 KB    | ########## | 100%
setuptools-47.3.1    | 514 KB    | ########## | 100%
libedit-3.1.20191231 | 167 KB    | ########## | 100%
tbb-2020.0           | 1.1 MB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
mkl_random-1.0.1     | 324 KB    | ########## | 100%
libgcc-ng-9.1.0      | 5.1 MB    | ########## | 100%
python-3.6.9         | 30.2 MB   | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB   | ########## | 100%
libffi-3.2.1         | 40 KB     | ########## | 100%
mkl-2018.0.3         | 126.9 MB  | ########## | 100%
libstdcxx-ng-9.1.0   | 3.1 MB    | ########## | 100%
readline-7.0         | 324 KB    | ########## | 100%
intel-openmp-2019.4  | 729 KB    | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
pip-20.1.1           | 1.8 MB    | ########## | 100%
numpy-base-1.15.4    | 3.4 MB    | ########## | 100%
wheel-0.34.2         | 51 KB     | ########## | 100%
scipy-1.1.0          | 13.2 MB   | #########3 |  93%

И описание стручка:

PS C:\Users\ivarea\repo\smartgraph\mlflow-v2> kubectl describe pod model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Name:         model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Namespace:    default
Priority:     0
Node:         mlops-control-plane/172.19.0.2
Start Time:   Thu, 25 Jun 2020 10:08:20 +0200
Labels:       app=model-a-wines-classifier-0-wines-classifier
              fluentd=true
              pod-template-hash=5b8bc7889d
              seldon-app=model-a-wines-classifier
              seldon-app-svc=model-a-wines-classifier-wines-classifier
              seldon-deployment-id=model-a
              version=wines-classifier
Annotations:  prometheus.io/path: /prometheus
              prometheus.io/scrape: true
Status:       Running
IP:           10.244.0.17
IPs:
  IP:           10.244.0.17
Controlled By:  ReplicaSet/model-a-wines-classifier-0-wines-classifier-5b8bc7889d
Init Containers:
  wines-classifier-model-initializer:
    Container ID:  containerd://6a3b158cf4218f8c177f6d18eb5d0387946bf9cc36f1173754b68a029483da8b
    Image:         gcr.io/kfserving/storage-initializer:0.2.2
    Image ID:      gcr.io/kfserving/storage-initializer@sha256:7a7d3cf4c5121a3e6bad0acc9e88bbdfa9c7f774d80bd64d8e35a84dcfef8890
    Port:          <none>
    Host Port:     <none>
    Args:
      gs://seldon-models/mlflow/model-a
      /mnt/models
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 25 Jun 2020 10:08:24 +0200
      Finished:     Thu, 25 Jun 2020 10:08:47 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /mnt/models from wines-classifier-provision-location (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Containers:
  wines-classifier:
    Container ID:   containerd://536753d25877994a17d1f1a63bbaf8717dc9180b80f061152688e4c8504c8468
    Image:          seldonio/mlflowserver_rest:0.5
    Image ID:       docker.io/seldonio/mlflowserver_rest@sha256:0fd54a0a314fafc82c490c91df0c4776be454702a307b4b76e12ed6958b4ee00
    Ports:          6000/TCP, 9000/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 25 Jun 2020 10:23:28 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 25 Jun 2020 10:19:09 +0200
      Finished:     Thu, 25 Jun 2020 10:20:41 +0200
    Ready:          False
    Restart Count:  7
    Liveness:       tcp-socket :http delay=60s timeout=1s period=5s #success=1 #failure=3
    Readiness:      tcp-socket :http delay=20s timeout=1s period=5s #success=1 #failure=3
    Environment:
      PREDICTIVE_UNIT_SERVICE_PORT:          9000
      PREDICTIVE_UNIT_ID:                    wines-classifier
      PREDICTIVE_UNIT_IMAGE:                 seldonio/mlflowserver_rest:0.5
      PREDICTOR_ID:                          wines-classifier
      PREDICTOR_LABELS:                      {"version":"wines-classifier"}
      SELDON_DEPLOYMENT_ID:                  model-a
      PREDICTIVE_UNIT_METRICS_SERVICE_PORT:  6000
      PREDICTIVE_UNIT_METRICS_ENDPOINT:      /prometheus
      PREDICTIVE_UNIT_PARAMETERS:            [{"name":"model_uri","value":"/mnt/models","type":"STRING"}]
    Mounts:
      /etc/podinfo from podinfo (rw)
      /mnt/models from wines-classifier-provision-location (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
  seldon-container-engine:
    Container ID:  containerd://938e8f7e3ac23355c8a7a475b71ab54b858aff5ca485f26b99feaba09bb60069
    Image:         docker.io/seldonio/seldon-core-executor:1.1.0
    Image ID:      docker.io/seldonio/seldon-core-executor@sha256:661173fcbc6cb4e9b56db353b19e97d04d9c086e9dc445217f84dc1721bdf894
    Ports:         8000/TCP, 8000/TCP, 5001/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      --sdep
      model-a
      --namespace
      default
      --predictor
      wines-classifier
      --http_port
      8000
      --grpc_port
      5001
      --transport
      rest
      --protocol
      seldon
      --prometheus_path
      /prometheus
    State:          Running
      Started:      Thu, 25 Jun 2020 10:08:51 +0200
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      100m
    Liveness:   http-get http://:8000/live delay=20s timeout=60s period=5s #success=1 #failure=3
    Readiness:  http-get http://:8000/ready delay=20s timeout=60s period=5s #success=1 #failure=3
    Environment:
      ENGINE_PREDICTOR:  <binary ommited>
      REQUEST_LOGGER_DEFAULT_ENDPOINT_PREFIX:  http://default-broker.
      SELDON_LOG_MESSAGES_EXTERNALLY:          false
    Mounts:
      /etc/podinfo from podinfo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  wines-classifier-provision-location:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-6vqwk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6vqwk
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                  From                          Message
  ----     ------     ----                 ----                          -------
  Normal   Scheduled  <unknown>            default-scheduler             Successfully assigned default/model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp to mlops-control-plane
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "gcr.io/kfserving/storage-initializer:0.2.2" already present on machine
  Normal   Created    15m                  kubelet, mlops-control-plane  Created container wines-classifier-model-initializer
  Normal   Started    15m                  kubelet, mlops-control-plane  Started container wines-classifier-model-initializer
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "seldonio/mlflowserver_rest:0.5" already present on machine
  Normal   Created    15m                  kubelet, mlops-control-plane  Created container wines-classifier
  Normal   Started    15m                  kubelet, mlops-control-plane  Started container wines-classifier
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "docker.io/seldonio/seldon-core-executor:1.1.0" already present on machine
  Normal   Created    14m                  kubelet, mlops-control-plane  Created container seldon-container-engine
  Normal   Started    14m                  kubelet, mlops-control-plane  Started container seldon-container-engine
  Warning  Unhealthy  14m (x8 over 14m)    kubelet, mlops-control-plane  Readiness probe failed: dial tcp 10.244.0.17:9000: connect: connection refused
  Warning  Unhealthy  28s (x171 over 14m)  kubelet, mlops-control-plane  Readiness probe failed: HTTP probe failed with statuscode: 503

Как отключить перезапуск, чтобы просмотреть журналы и увидеть фактическую ошибку?

2 ответа

Вероятно, тесты живости и готовности по умолчанию имеют слишком короткие тайм-ауты, чтобы позволить контейнеру классификатора завершить установку зависимостей. Прежде чем контейнер запустится, Kubernetes уже перезапускает его, потому что он не прошел проверку жизнеспособности/готовности.

В моем случае мне пришлось добавить следующее в объявление развертывания Seldon, чтобы увеличить время ожидания (конечно, вы можете настроить значения):

      apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: ...
spec:
  name: ...
  predictors:
    - graph:
        ...
      name: ...
      replicas: ...
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                readinessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 120
                  periodSeconds: 30
                  successThreshold: 1
                  tcpSocket:
                    port: 9000
                  timeoutSeconds: 3
                livenessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 120
                  periodSeconds: 30
                  successThreshold: 1
                  tcpSocket:
                    port: 9000
                  timeoutSeconds: 3

Использовать -p флаг, как в приведенном ниже примере команды, чтобы проверить журналы ранее завершенных ruby(пример) журналы контейнера из модуля web-1(пример)

kubectl logs -p -c ruby web-1

Проверить события с помощью команды kubectl get events

Использовать kubectl describe pod podname чтобы проверить, что могло вызвать crashloop

Другие вопросы по тегам