「Prometheus」- 高可用监控集群，集群监控（Monitoring）

问题描述

该笔记将记录：部署 Prometheus Monitoring 的方法，以及相关问题的解决办法。

解决方案

补充说明

针对应用环境，我们这里讨论的 Prometheus Monitoring 是围绕 Kubernetes 展开的。
针对非容器化环境，传统监控方案经过实践的检验，所以我们相信传统的监控方案会更好。

单集群监控

单集群监控，是指 Prometheus Monitoring 仅监控单个 Kubernetes 集群，即多个 Kubernetes Cluster 需要部署多个 Prometheus Monitoring 实例；

针对实际应用环境，通常具有多个集群需要监控，并且未来将会扩增，所以我们直接跳过单集群讨论，将重点放在 Multiple Kubernetes Cluster 的监控问题上；

多集群监控（调研、学习）

Multiple Kubernetes cluster monitoring with Prometheus | Sysrant
Monitoring a Multi-Cluster Environment Using Prometheus Federation and Grafana

方案一、kubernetes_sd_config/api_server

通过 Prom 的 kubernetes_sd_config 的 api_server 选项，直接连接其他集群
1）优点：该方案简单易于部署，也无需在其他集群进行过多的配置；
2）缺点：跨集群认证需要单独配置；某些指标需要在被监集群部署组件，该方案无法满足该需求；

方案二、Prometheus Federation

通过 Prom 的联邦技术：分别在其他集群中部署 Prom 服务，在通过中心的 Prom 进行采集
1）优点：容易部署；
2）缺点：数据量较大，尤其是对于中心 Prom，其需要采集的指标更多，所需要的时间也更多（延迟）；

方案三、Expose the /metric Interface

通过暴露被监集群的 /metric 接口，以供 Prom 进行爬取；
1）优点：非常容易部署；
2）缺点：仅能小规模使用；没有服务发现；也无法实现自动化；

我们最初的想法是：
1）针对每个集群，部署 Prometheus Monitoring 监控；
2）通过 Grafana Datasoruce 特性，在 Dashboard 中选择不同数据源来显示；

方案四、Prometheus + Thanos Sidecar

通过在多个集群部署 Prometheus + Sidecar 的方式，并通过 Query 进行分别查询；

还有个改进方案，本质是类似的。新引入的 Thanos Query 组件，进而实现 Query 的高可用：

优点：
1）部署简单，对 Oberver Cluster 没有过多要求；
2）能够直接查询外部集群，本质上是对查询进行扩展；
3）数据保存在外部集群中，不需要集中存储；

缺点：
1）需要额外的组件，所以将带来额外的成本；

方案五、Prometheus with Remote Write

通过被监集群的 Promtheus 向中心集群主动推送指标。

Prometheus 主动推送指标（通过 Remote Write API）；
Loadbalance 服务将请求转发到任意的 Thanos Reveiver 实例；
Hashring 是用于确定保存指标的 Thanos Reveiver 实例；

优点：
1）适用于无法拉取指标的场景，由 Promtheus 负责推送指标；
2）对于目标集群，无需过多配置，仅需推送指标即可；

缺点：
1）需要更多技术调研及相关的学习；
2）在推送数据时删除 up 指标；

方案六、Thanos with long term storage

鉴于 Prometheus 特性：每两小时压缩归档一次数据，这些数据被传送到长期存储中；两小时之内的数据，依旧保存在本地；
而 Query 将分别从 Prometheus 和 Long Term Storage 查询数据；

方案总结

简而言之，并没有固定的解决方案，具体方案取决于应用场景。

最后，作者给了一个图示（一种可能的组合）：

Prometheus Stack

Prometheus Operator

Prometheus Operator creates/configures/manages Prometheus clusters atop Kubernetes
Prometheus Operator – Running Prometheus on Kubernetes

The Prometheus Operator uses Kubernetes custom resources to simplify the deployment and configuration of Prometheus, Alertmanager, and related monitoring components.

# 07/18/2022 Star 7.2k

# 09/22/2022 该仓库的 Helm Chart 已移动到 prometheus-community/kube-prometheus-stack 仓库； https://github.com/prometheus-operator/kube-prometheus#installing

我们尝试选择 Prometheus Operator 部署，但其 Helm Chart 移到 prometheus-community/ 管理，并且里面牵扯过多其他组件；

# 03/23/2023 我们想通过 Prometheus Operator 来管理 Prometheus 部署，起因是 Rook Ceph 提供基于 Prometheus Operator 的监控方案（PrometheusRule），要想使用 Rook Ceph 自带的监控方案，我们就需要部署 Prometheus Operator 服务。但是，最后，鉴于我们没有遇到必须使用 Operator 来解决的问题，所以我们还是暂时放弃 Prometheus Operator 方案，具体如下原因：（1）我们终究要熟悉组件相关原理以进行问题排查；（2）Prometheus Operator 的 Helm Chart 集成的内容较多，但又没有覆盖所有场景，导致定制修改比较繁琐；（3）Prometheus Operator 的部分 CRD 只是对配置文件的抽象，甚至与配置文件格式相同；

kube-prometheus

prometheus-operator/kube-prometheus: Use Prometheus to monitor Kubernetes and applications running on Kubernetes

kube-prometheus provides example configurations for a complete cluster monitoring stack based on Prometheus and the Prometheus Operator. This includes deployment of multiple Prometheus and Alertmanager instances, metrics exporters such as the node_exporter for gathering node metrics, scrape target configuration linking Prometheus to various metrics endpoints, and example alerting rules for notification of potential issues in the cluster.

# 07/18/2022 Star 4.3k

针对 Jsonnet 技术，我们目前出于观望状态，还不会引入该技术，所以不会选择 kube-prometheus 部署；

helm chart（prometheus-community/prometheus）

prometheus-community/helm-charts: Prometheus community Helm charts

The prometheus-community/kube-prometheus-stack helm chart provides a similar feature set to kube-prometheus. This chart is maintained by the Prometheus community. For more information, please see the chart’s readme

# 07/18/2022 Star 2.8k

我们选择项目的标准，（1）项目的当前 Star 数量、（2）项目的当前 Star 增长趋势，但这都是借口；

最后，我们选择通过 Helm Chart 来部署 Prometheus 服务，我们也更倾向于通过 Helm Chart 管理配置；