「kubernetes」- 控制 Pod 在节点上的分配（调度）

问题描述

该笔记将记录：在 Kuberntes 中，如何调度 Pod 资源，以及常见问题的解决办法；

解决方案

实现 Pod 调度（例如驱逐、亲和、节点选择等等），与之相关的技术有如下若干方式

nodeSelector

在 Pod.Spec 中，指定 nodeSelector 属性，然后仅具备特定标签的 Node 才会被放置该 Pod 实例：

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeSelector:
    gpu: "true"                                                                 # 调度器将只在包含标签 gpu=true 的节点中选择

我们也可以将 pod 调度到特定节点：
1）每个节点都有个唯一标签，其中键为 kubernetes.io/hostname，值为该节点的实际主机名，因此我们也可以将 pod 调度到某个确定的节点；
2）但是如果节点离线，标签的唯一性会会导致 pod 不可调度；
3）所以，我们避免使用唯一性的标签，而是应该通过标签选择器考虑符合特定标准的逻辑节点组；

但是 nodeSelector 比较简单无法实现复杂的需求，所以引入 nodeAffinity 特性，而 nodeSelector 将来或许会被淘汰；

nodeAffinity（Pod with Node）

其与 nodeSelector 类似，每个 pod 可以定义自己的节点亲缘性规则，控制 Pod 在节点之间的调度；

这些规则可以允许你指定硬性限制或者偏好限制：
1）对于硬性限制：必须匹配标签才能够调度到节点上；
2）对于偏好限制：则将告知 Kubernetes 对于某个特定的 pod，它更倾向于调度到某些节点上，而 Kubernetes 将尽量把这个 pod 调度到这些节点上面。如果无法满足，则 pod 将被调度到其他某个节点上；

硬性限制的案例，节点必须满足 gpu=ture 才会放置该 Pod 实例：

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:       # 仅影响创建时的调度；而执行期间，若节点标签被删除，不会被重新调度；
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu
            operator: In
            values:
            - "true"                                        # 虽然冗长，但是功能强大，表达性更强；
...

偏好限制的案例，优先选择特定节点：

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80                                                              # 优先调度到 zone-cn 节点；
        preference:
          matchExpressions:
          - key: availability-zone
            operator: In
            values:
            - zone-cn
      - weight: 20                                                              # 同时优先调度到 dedicated 类型节点；
        preference:
          matchExpressions:
          - key: share-type
            operator: In
            values:
            - dedicated
  containers:
  - name: with-node-affinity
    image: k8s.gcr.io/pause:2.0

# 在这里的优先级：
# 最高的节点是同时具有 zone-cn 与 dedicated 的节点；
# 然后是，仅具有 zone-cn 的节点；
# 再而是，仅具有 dedicated 节点；
# 最后是，不具备这两个标签的任何节点；

# Q：在实际的实验中，即使存在满足条件的节点，但是并非所有的 Pod 都会调度到符合标签的节点上，这与节点亲和优先级相矛盾？
# A：这是因为还存在其他调度函数来影响调度，比如 Selector SpreadPriority（避免 Node 故障而导致所有 Pod 实例失败）等等；

podAffinity（Pod with Pod）

Assigning Pods to Nodes | Kubernetes

控制 Pod 与 Pod 之间的关系：
1）podAffinity，使得 Pod 相互靠近；
2）podAntiAffinity，使得 Pod 相互排除；

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
...
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
...

Taint and Toleration

evict-pod（a kubectl plugin）

rajatjindal/kubectl-evict-pod: This plugin evicts the given pod and is useful for testing pod disruption budget rules
How to evict specific pods on the Kubernetes cluster – DEV Community

通过该插件，能够驱逐特定 Pod 实例，以应对某些特殊场景。

场景：Namespace，选择节点（node-selector）

kubernetes – How to assign a namespace to certain nodes? – Stack Overflow
Taint toleration configured on namespace level · Issue #77687 · kubernetes/kubernetes

我们希望将在 ns-foo 命名空间中的 Pod 全部调度到 hostname 为 hn-foo 的节点上；

这需要使用名为 PodNodeSelector 的 Admission Controller 插件；

实验环境：Kubernetes v1.18.18

第一步，启用 PodNodeSelector 插件

# vim /etc/kubernetes/manifests/kube-apiserver.yaml
...
    - --enable-admission-plugins=NodeRestriction,PodNodeSelector
...

# 注意事项，
# 1）我们仅添加 PodNodeSelector 选项，而 NodeRestriction 为原有选项；
# 2）kube-apiserver 会自动重启

第二步、命名空间添加注解

apiVersion: v1
kind: ns-foo
metadata:
 name: your-namespace
 annotations:
   scheduler.alpha.kubernetes.io/node-selector: kubernetes.io/hostname=hn-foo
spec: {}
status: {}

补充说明：
1）kubernetes.io/hostname=hn-foo 是节点标签，执行 kubectl get nodes –show-lables 查看；
2）当添加该注解后，新创建的 Pod 都会被自动添加 nodeSelector: kubernetes.io/hostname=hn-foo 属性；
3）此时，具有 nodeSelector 的 Pod 将会被分配到 hostname 为 hn-foo 节点；

第三步、删除 Pod 以测试生效

# kubectl delete --all pods --namespace=ns-foo
...

# kubectl describe pod -n ns-foo "<pod name>"
...
Node:         hn-foo/172.16.159.15
...
Node-Selectors:  kubernetes.io/hostname=hn-foo
...

场景：Namespace，专用节点（taint/toleration）

Default Toleration at Namespace Level | by Zhimin Wen | Medium
Using Admission Controllers/PodTolerationRestriction
Taints and Tolerations | Kubernetes

通过 scheduler.alpha.kubernetes.io/node-selector 注解，其虽能够将 Pod 固定在具有特定 Label 节点上，但是其他不具备该 nodeSelector 的 Pod 也会被调度到该节点上。我们希望这些节点仅供给特定 Namespace 使用，其他 Namespace 的 Pod 不要调度到这些节点上；

解决方案：使用 Tain + Tolerantion + PodTolerationRestriction 注解；

第一步、PodTolerationRestriction

首先，修改 API Server 的 Admission Controllers 启用 –enable-admission-plugins=PodTolerationRestriction 插件；

第二步、Taint

然后，为节点添加 Taint 参数：

kubectl taint nodes <node-name> dedicated=myApp:NoSchedule

第三步、defaultTolerantion

最后，在命名空间中，添加 defaultTolerantion 注解：

# kubectl annotate namespace my-namespace \
   'scheduler.alpha.kubernetes.io/defaultTolerations'='[{"operator": "Exists", "effect": "NoSchedule", "key": "reservedFor"}]'

# kubectl edit namespace my-namespace
kind: Namespace
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Exists", "effect": "NoSchedule", "key": "dedicated-node"}]'
...

关于 Daemonset 行为

Running background tasks on nodes automatically with daemonsets
Kubernetes/DaemonSet

鉴于 Daemonset 机制，在该场景下，需要为 Daemonset 手动添加 Toleration 配置；

参考文献

Assign Pods to Nodes – Kubernetes v1.17
Assign Pods to Nodes – Kubernetes v1.16