「流水线模型」

# 第一阶段
调试信息（）、代码检查（）、环境搭建（）、创建应用（）、应用发布（）、执行测试（）
现存问题：该阶段的主要问题是Pipeline脚本结构混乱、命名不规范、硬编码
# 第二阶段
调试信息（）、代码检查（）、环境搭建（）、创建应用（）、应用发布（）、执行测试（）
工作任务：优化Pipeline脚本
相关链接
https://www.56dagong.com/info-6945.html[……]

2023-04-12 | k4nz

「Jenkins Pipeline」- 在首次扫描后，禁止自动构建

问题描绘
在创建多分支流水后，会自动进入扫描，扫描之后立即出发首次构建。然而，我们并不需要进行构建，我们需要能够禁用扫描后的自动构建功能。
解决办法
参考 how to get $CAUSE in workflow 问题，虽然使用 getBuildCauses 可以获取构建原因，但是这并不能识别出发生在扫描后的构建。（当然也有可能是我们没有找到正确方法）
方法一、检查 BUILD_NUMBER 变量
在构建开始时，检查 BUILD_NUMBER 变量，如果 BUILD_NUMBER == “1” 成立，则放弃构建。
注意事项，参数 BUILD_NUMBER 为字符串，因此 BUILD_NUMBER == 1 返回假。
参考文献
Jenkins multibranch pipeline Scan without execution Jenkins/Pipeline Examples[……]

| k4nz

「Jenkins Pipeline」- 暂存文件，以用于之后的构建

问题描述
在 Jenkins Pipeline 中，我们的构建将产生各种新文件，在而后的构建又会使用这些文件。
但是 Jenkins 的构建目录并不总是在同一个目录中、也不能保持不变： 1）当作业被重命名之后，构建目录也会发生变更。它会新建与作业同名的构建目录，而不是重命名旧的构建目录，因此无法读取旧的制品； 2）对于相同作业，有时会创建 JOB_NAME@2、JOB_NAME@3 等形式的构建目录（我们没有深究内部实现），导致新构建无法在当前目录读取之前的制品； 3）多项目之间共享制品时，由于构建是 Jenkins 内部实现，未来有可能发生变更，因此我们不能“直接地”读取主机中的目录（不具有可移植性）； 4）如果 pipeline 的多个阶段在不同的节点上执行，我们还需要在多个节点之间共享文件，很显然直接操作主机目录是不可行的；
因此我们应该使用 Jenkins Pipeline 提供的制品管理方法，以在相同构建或者不同构建之间传递制品文件。
解决办法
方法一、使用 stash/unstash 步骤
stash: Stash some files to be used later in the build unstash: Restore files previously stashed
stash，用于保存文件。unstash，用于提取 stash 暂存的文件。但是使用 stash/unstash 具有以下局限性： 1）只能用在同一次构建中，在构建结束后将被丢弃 => 因此可以用在：相同构建的多节点或多工作目录之间传递文件； 2）通过选项 preserveStashes() 可以在重启之后继续读取暂存文件，但依旧限制在单次构建运行中； 3）只适合小数量的文件（5-100 MB），因为文件暂存需要压缩，而压缩需要消耗 Master 的资源；
由于 stash/unstash 并不能解决我们的问题（我们需要在同个项目的多次构建中传递制品），因此不再深入研究与探讨。
方法二、使用 archiveArtifacts/copyArtifacts 步骤
archiveArtifacts – Jenkins Core Copy Artifact | Jenkins plugin
archiveArtifacts，用于归档制品，是 Jenkins 默认（自带）提供的步骤。copyArtifacts，用于从其他构建中复制制品，是 Copy Artifact 插件提供的功能。
该方法能够处理大多数使用场景，参考 groovy – How can I use the Jenkins Copy Artifacts Plugin from within the pipelines (j[……]

| k4nz

「Jenkins Pipeline」- 存储变量，以用于下次构建

问题描述
我们希望在本次构建中存储状态（变量），以用于下次构建。
该笔记将记录：在 Jenkins Pipeline 中，如何持久化变量，以在下次构建时取回。
解决方案
在本地构建中，直接将变量存储到 env（环境变量中）：

this.env[“key”] = “value”

在构建结束时，Jenkins 会自动存储。
在新一轮的构建中，我们可以从前一轮的环境变量中取回该值：

def env = this.currentBuild.previousBuild.getBuildVariables()
println env[“key”]

参考文献
Accessing information from previous Jenkins pipeline run – Stack Overflow[……]

| k4nz

「Jenkins Pipeline」- java.io.NotSerializableException: java.util.regex.Matcher

在 09/11/2020 时，我们再次遇到该错误
问题描述
我们再次遇到该问题，发现导致异常的原因并不是 Matcher 没有匹配到内容。如下代码可以重现错误：

pipeline {
agent any
stages {
stage(‘xterm testing’) {
steps {
script {
def pageContent = “””
foo…xxx random string
mdate: Fri 11 Sep 2020 04:53:46 PM CST
bar…xxx random string
“””
def matcher = pageContent =~ /mdate: (?<date>.+)/
if(matcher.find()) {
ansiColor(‘xterm’) {
echo “mdate = ” + matcher.group(‘date’)
}
echo “LLL”
}
}
}
}
}
}

在程序中 matcher 能够匹配，echo 能在控制台中输出 mdate = Fri 11 Sep 2020 04:53:46 PM CST，但是 echo “LLL” 无法正常执行。也就是说，问题发生在 ansiColor 的执行过程中。
问题原因
我们并不在乎原因：如果想找到这个问题的原因，需要深入排查 ansiColor 实现。但是成本太高，就好像“碗碎了，我们不会等碗补好后再吃饭，我们会换个碗”
解决办法
瞎猫撞死耗子，没有什么异常是 try…catch 解决不了的（为了说明问题再加个注释）：

pipeline {
agent any
stages {
stage(‘xterm testing’) {
steps {
script {
def pageContent = “””
foo…xxx random string
mdate: Fri 11 Sep 2020 04:53:4[……]

| k4nz

「Jenkins」- Note that ‘frame-src’ was not explicitly set, so ‘default-src’ is used as a fallback.

问题描述
在 Jenkins 中，当访问测试报告时，页面无法正常显示（页面没有显示内容）。浏览器控制台显示如下错误消息：

…
Refused to frame ‘https://jenkins.example.com/’ because it violates the following
Content Security Policy directive: “default-src ‘none'”. Note that ‘frame-src’ was
not explicitly set, so ‘default-src’ is used as a fallback.

Blocked script execution in ‘<URL>’ because the document’s frame is sandboxed and
the ‘allow-scripts’ permission is not set.
…

软件版本：Jenkins 2.274 in Docker
解决方案
方案一、放宽规则（不推荐）
在 Script Console 中，执行 System.setProperty(“hudson.model.DirectoryBrowserSupport.CSP”, “”) 脚本，以此来解决问题。
但是该方法放宽规则，官方并不鼓励该做法。并且我们使用该方法也没有生效，可能在新版本 Jenkins 中被废弃。
方案二、Resource Root URL（推荐）
官方推荐的方法： 1）为 Jenkins 服务再绑定新域名（当然要配置 Nginx 指向 Jenkins 服务）； 2）并在 Manage Jenkins / Configure System / .. / Resource Root URL 中填写带地址；
参考文献
Configuring Content Security Policy javascript – Jenkins error – Blocked script execution in <URL>. because the document’s frame is sandboxed and the ‘allow-scripts’ permission is not set – Stack Overflow html – because the document’s frame is sandboxed and the ‘allow-scripts’ permission is not set – Stack Overflow javascript – Blocked script execution in bec[……]

| k4nz

「Jenkins Pipeline」- Excessively nested closures

问题概述
在执行 Jenkins Pipeline 时，产生如下错误：

java.lang.StackOverflowError: Excessively nested closures/functions at WorkflowScript.getProjectPath(WorkflowScript:16) – look for unbounded recursion – call depth: 1025
at com.cloudbees.groovy.cps.impl.CpsFunction.invoke(CpsFunction.java:28)
at com.cloudbees.groovy.cps.impl.CpsCallableInvocation.invoke(CpsCallableInvocation.java:40)
at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:62)
at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixName(FunctionCallBlock.java:77)
at sun.reflect.GeneratedMethodAccessor345.invoke(Unknown Source)
…

（这个错误信息是从 stackoverflow 中复制的，与我遇到的问题是同样的场景，只是我后来整理的笔记所以找不到原因了）
本文将介绍该java.lang.StackOverflowError: Excessively nested closures/functions错误的成因及处理方法；
问题原因
这一切都是因为对 Groovy 语言的不熟悉。产生问题的代码如下：

package com.k4nz.tools

// 我承认我写的代码不够 Groovy 风格
class Zim {

private String outputFolder = “/tmp/build”

public void getOutputFoler() {
return this.outputFolder
}
}

问题就处在上面的getOutputFoler()方法。在 Groovy[……]

| k4nz

「Jenkins」- No valid crumb was included in request for /ajaxExecutors

问题描述
在 Jenkins 2.275 中，出现如下日志消息：

# tail -f /var/log/jenkins/jenkins.log
…
2021-03-01 11:13:06.565+0000 [id=15] WARNING hudson.security.csrf.CrumbFilter#doFilter: No valid crumb was included in request for /ajaxBuildQueue by k4nz. Returning 403.
2021-03-01 11:13:10.571+0000 [id=11] WARNING hudson.security.csrf.CrumbFilter#doFilter: Found invalid crumb a433bb1e4447c0afe7ca04cda88bfd3c0ebddf300f0b52399cf2c47559152b1c. If you are calling this URL with a script, please use the API Token instead. More information: https://jenkins.io/redirect/crumb-cannot-be-used-for-script
…

问题原因
在新版本中，CSRF Token 只能用于创建它们的会话。就是说 Session A 创建某个 CSRF Token，则该 CSRF Token 只能被 Session A 使用。这是为了限制攻击者获取到它们后的影响，即使攻击者拿到 CSRF Token，如果攻击者没有会话信息，依旧无法使用该 CSRF Token 请求。
以前的某些脚本，通过 /crumbIssuer/api 获取 CSRF Token，然后访问接口。现在，由于脚本是没有会话信息的，因此在执行受 CSRF 保护的动作时，会失败。
除非这些脚本能够携带会话信息，或者使用 API token 进行请求。
解决方案
Manage Jenkins / Configure Global Security，禁用 Prevent Cross Site Request Forgery exploits 选项。但是，由于我们 Jenkins 版本较新，没有找到该选项。
或者，如果文档 Upgrading to Jenkins LTS 2.176.x 建议，设置 hudson.security.csrf.DefaultCrumbIssuer.EXCLUDE_SESSION_ID 属性为 true，这是个系统属性，也是在关闭 CSRF Token 与会话的关联。在启动 Jenki[……]

| k4nz

「Jenkins Pipeline」- expected to call xxx but wound up catching xxx

问题描述
在 Jenkins Pipeline 中，我们从控制台中看到类似如下提示信息：

expected to call org.jfrog.hudson.pipeline.common.types.ArtifactoryServer.download but wound up catching artifactoryDownload;
see: https://jenkins.io/redirect/pipeline-cps-method-mismatches/

问题原因
正如页面 Pipeline CPS Method Mismatches / Use of Pipeline steps from \@NonCPS 所说：

but non-CPS-transformed code may not call CPS-transformed code.

解决方法
我们遇到该警告的原因有一下几种：
场景一：在带有 NonCPS 注解的方法中，调用 Jenkins Pipepline 的步骤（Step），导致该提示出现 # 09/21/2020
解决方法：去掉方法的 NonCPS 注解，但是需要对 Closure 进行改写，转为使用普通的语法。
场景二：在 eachLine 中，调用 Jenkins Pipeline 方法，导致该提示出现 # 01/25/2021
Pipeline CPS Method Mismatches
我们在 String 的 eachLine 中使用 Jenkins Pipeline Step sh 导致出现该提示（该问题与「场景一」类似）。
解决方法：我们改用 split(“\\r?\\n”) 进行字符串分割，然后使用 for 循环进行处理。
参考文献
Pipeline CPS Method Mismatches[……]

| k4nz

「Jenkins Pipeline」- hudson.remoting.ProxyException

问题描述
在执行Jenkins Pipeline构建任务时，产生如下错误：

…
[Pipeline] Start of Pipeline
[Pipeline] End of Pipeline
hudson.remoting.ProxyException: com.cloudbees.groovy.cps.impl.CpsCallableInvocation
Finished: FAILURE

问题原因
该错误是由于在构造函数中调用其他类的方法导致的。不能在构造函数中调用NonCPS方法。如果需要调用NonCPS方法，需要为方法添加@NonCPS注解。
另外在构造函数中也无法获取其他类的成员变量。
解决办法
不在构造函数中使用NonCPS方法。或者为方法添加@NonCPS注解。
参考文献
jenkins shared library error com.cloudbees.groovy.cps.impl.CpsCallableInvocation Constructor, using other methods of the class, causes “hudson.remoting.ProxyException: com.cloudbees.groovy.cps.impl.CpsCallableInvocation” [JENKINS-26313] Add javadoc explaining why constructors can’t be transformed #83[……]

| k4nz

「Jenkins」- 常用配置，常见问题

[……]

| k4nz

「Grafana」- 数据展示平台

问题描述
在 Prometheus (from Metrics to Insight):3 Setting Up Environment 中，告警数据将在 Grafana 中展示。我们通过配置 Grafana 面板，以使用多种形式来展示指标数据，让我们直观的了解到服务器的各种指标。
我们将学习 Grafana 的使用方法，并整理学习笔记。但是，该章节仅会简单记录各个功能特性，并不会详细记录每个章节的内容（经常阅读官方文档远优于单纯的翻译文档）。子章节将记录如何使用 Grafana 完成具体的工作任务。
该笔记将记录：Grafana 的使用方法，其主要内容是对官方文档的学习、记录、整理，还包含部分常用配置示例。
解决方案
这里的主要内容均来自于官方文档，但是提取各个章节的主要及关键内容，让我们形成对 Grafana 的整体认识。
What’s new：记录新版本、发生的各种变更 => Installing and Upgrading/版本选择 Introduction to Grafana：对 Grafana 的基本介绍 => Concepts and Fundamentals/功能概述 Setup — Install Grafana => Installing and Upgrading — Configure Grafana => Maintenance, Administration/配置文件 — Restart Grafana => Maintenance, Administration/服务重启 — Sign in to Grafana => Problems Solving and How-to/登录 Grafana 系统 — Upgrade Grafana => Installing and Upgrading/服务升级 — Configure security => Security, User, Permission — Set up Grafana monitoring => Log, Monitoring, Alerting — Set up Grafana for high availability => High-availability Cluster — Set up image rendering => Problems Solving and How-to/图片渲染 — Set up Grafana Live => Problems Solving and How-to/Grafana Live — Enable diagnostics to troubleshoot Grafana =&gt[……]

| k4nz

「Grafana」- 常见问题处理

界面定制（⇒ Setup/Enable custom branding）
Enable custom branding | Grafana documentation
企业版和云版本支持该特性，能够对登录界面等等方面进行修改与定制。
图片渲染（⇒ Setup/Set up image rendering）
Set up image rendering | Grafana documentation
Grafana 能够将 Panel 渲染为图片；能够在告警中显示图片；但是图片也会被定期清理；
需要 Image Renderer 或远程渲染服务来处理图片相关内容。
Monitoring the image renderer
通过 Promethesu 对图片渲染进行监控；
Troubleshooting
问题排查，对常见的图片渲染问题进行排查。
Grafana Live（⇒ Setup/Set up Grafana Live）
Set up Grafana Live | Grafana documentation
该特性允许客户端接收实时消息，通过 WebSocket 技术，在事件发生时，能够将消息快速推动 Grafana Frontend 以展示给用户。
$interval vs. $rate_interval
New in Grafana 7.2: $__rate_interval for Prometheus rate queries that just work What’s $interval mean in Grafana? – Stack Overflow https://grafana.com/docs/grafana/latest/variables/variable-types/add-interval-variable/ what is the default grafana setting for $__rate_interval – Stack Overflow
$interval
$__interval is a built in automatic variable in grafana , and is automatically set based on time range
当时随着时间范围的切换，$__interval 将小于 Prometheus 的 Scrape Interval 进而导致没有数据显示；
$__rate_interval
针对其取值，在 datasource 中，Scrape interval，default=15s，建议与 Prometheus 的 Scrape Int[……]

| k4nz

「Grafana」- 面板（Dashboard）

构建 Dashboard 的方法
方式一、创建面板
Build your first dashboard | Grafana documentation
Textbox 较小，目前暂未添加配置其宽度的特性： The text box is too small! (variables) · Issue #29672 · grafana/grafana · GitHub Variables: Make TextBox variable width adjustable. by STEELBADGE · Pull Request #29794 · grafana/grafana · GitHub
方式二、导入面板
helm-charts/charts/grafana at main · grafana/helm-charts Export and import | Grafana documentation Provision dashboards and data sources | Grafana Labs
导入面板的方法分为两类： 1）通过 Web GUI 直接导入面板配置； 2）此外 Dashboard 仅支持通过目录来自动加载，即：Grafana 将自动扫描目录，并从中加载 Dashboard 配置；
在 Web UI 中，导入面板的方法如下： 1）导入：左侧栏 => + => Import 2）导出：在 Dashboard 中，右上角，单击 Save Dashboard 按钮（软盘形状）；
在 Helm Chart 中，导入面板的方法如下： 1）将 Chart 解压：其中的 dashboards/ 目录用于存放 Dashboard 设置，其将被自动引入到部署中； 2）下载 Dashboard 并将 .json 保存到 dashboards/ 目录中， 3）修改 values.yaml 文件：通过 dashboards.<name>.file: dashboard/dashboard.json 引入； 4）然后，取消 dashboardProviders 的注释； 5）最后，helm install/upgrade 即可； X）此外 Chart 还支持其他方式来获取 Dashboard（例如通过 HTTP 下载等等），但其本质还是目录发现；
将 Panel 模板化
Grafana documentation/Manage library panels
Pannel，用于显示数据（图标），为了完成数据对比，我们需要在多个 Dashboard 中显示相同的图表，即使用相同的 Panel 配置；
在 Grafana 中，通过 Panel Library[……]

| k4nz

「Grafana」- 常见错误汇总

问题排查（⇒ Setup/Enable diagnostics to troubleshoot Grafana）
Enable diagnostics to troubleshoot Grafana | Grafana documentation
通过 Grafana 的 Trace 和 Profiling 对其进行分析和排查；
Panel plugin not found: stat
Grafana panel plugin “stat” not found · Issue #233 · unifi-poller/unifi-poller
问题描述：在 Dashboard 中，某些 View 显示 Panel plugin not found: stat 消息。
解决方案：Either upgrade to Grafana 6.6+ or change the stat panel to singlestat.
在 Grafana 中，如果长时间未操作，将自动退出登录
Grafana Frontend Session Timeout – Grafana – Grafana Labs Community Forums
增加登录状态保持时间：

auth:
login_maximum_inactive_lifetime_duration: 2M
login_maximum_lifetime_duration: 2M
token_rotation_interval_minutes: 1000[……]

| k4nz

「Prometheus」- 监控、告警、时序数据库（学习笔记）

问题描述
该笔记将记录：安装、配置、使用 Prometheus 的方法，以及常见问题的处理方案。
解决方案
官方文档是我们学习的开始，以官方文档为中心进行学习与使用。
该部分笔记将围绕官方文档展开：
INTRODUCTION — Overview => Concepts and Fundamentals — First steps => 4 Solutions to Scenarios — Comparison to alternatives => Concepts and Fundamentals — FAQ => Concepts and Fundamentals — Roadmap => Concepts and Fundamentals — Design Documents => Concepts and Fundamentals — Media => Concepts and Fundamentals — Glossary => Concepts and Fundamentals — Long-Term Support => Concepts and Fundamentals CONCEPTS => Basic Concepts PROMETHEUS — Getting started => 4 Solutions to Scenarios — Installation => Installing and Upgrading — Configuration => Configuration — Querying => PromQL (Prometheus) — Storage => Storage — Federation => Federation — HTTP SD => HTTP SD — Management API => Management API — Migration => Backup, Recover, Migrate — API Stability => Concepts and Fundamentals — Feature flags => Concepts and Fundamentals VISUALIZATION => Visualization INSTRUMENTING => Exporters and Integrations OPERATING — Security => Security, User, Permission — Integrations => 4 Solutions[……]

| k4nz

「Prometheus」- 概念基础（=> CONCEPTS）

数据模型
Prom 使用 Time-Series 来存储数据：时间戳化的值流，这些值属于同个指标和同组标记维度的；除了存储的 Time-Series，Prometheus 可能会生成临时派生的时间序列作为查询的结果。
Metric-Name and Lable
每个 Time-Series 都有 Metric-Name 和 Lable（是可选的，键值对）组成；
Metric-Name 描述指标名称，例如 http_requests_total 形式；命名 [a-zA-Z_:][a-zA-Z0-9_:]* 规则，其中冒号视为用户定义的记录规则而保留的，不建议使用。
Label 使 Prom 具有多维度数据模型： 1）针对相同的 Metric-Name，通过不同的 Label 组合，能够标识出特定维度的指标实例； 2）查询语句允许我们基于这些维度进行过滤和聚合； 3）Label 的变化导致新 Time-Series 的创建； 4）Lable 命名规则 [a-zA-Z_][a-zA-Z0-9_]* 形式；__ 开始的 Lable-Name 用于内部； 5）Lable Value 为空等于未定义该 Lable；
Samples
Samples 组成实际的 Time-Series 数据，两部分组成：1）float64 value；2）毫秒精度的时间戳；
Notaiton
通常用如下符号（Notation）来表示 Time-Series：

<metric name>{<label name>=<label value>, …}

指标类型
这里的指标类型，是指数据类型。该概念仅在客户端中使用，Prome Server 不使用数据类型的概念，所有的数据被以无类型的 Time-Series 存储；
Counter
单调递增的数据指标，或重启置零。不用用于经常上下波动的数值。
Gauge
能够任意上下波动的数值。例如，衡量温度、内存使用；
Histogram
直方图（Histogram）对观察结果进行采样（通常是请求持续时间或响应大小等），并将它们计入可配置的存储桶中。它还提供所有观察值的总和。
直方图通常保留多个指标： 1）basename>_bucket{le=”<upper inclusive bound>”} 2）<basename>_sum 3）<basename>_count
通过 histogram_quantile() 从直方图甚至直方图的聚合中计算分位数。直方图也适用于计算 Apdex 分数。在 bucket 上操作时，请记住直[……]

| k4nz

「Prometheus」- Federation（=> PROMETHEUS/Federation）

联邦，目的是协作，允许 Promethus 从其他 Promethues 爬取数据。
应用场景
联邦具有多个应用场景。常见的通常是实现可扩展的监控系统或从其他 Prom 中获取指标；
分级联邦
以树形结构存在，最高层的 Prom 从其他从属的 Prom 中获取指标数据，并进行聚合。
跨服务联邦
一个服务的 Prometheus 服务器被配置为从另一个服务的 Prometheus 服务器中抓取选定的数据，以启用对单个服务器中的两个数据集的警报和查询。
比如某个 Prom 抓取系统及指标，某个 Prom 抓取应用级指标。后者可能需要了解系统相关的指标，那它则可以从前者抓取相关指标。
配置案例
针对数据源服务器： 1）/federate，为抓取的接口； 2）match[]，为过滤参数；
针对目的服务器： 1）抓取远端 /federate 接口； 2）honor_labels，避免原始标签被覆盖；
如下简单示例：

scrape_configs:
– job_name: ‘federate’
scrape_interval: 15s

honor_labels: true
metrics_path: ‘/federate’

params:
‘match[]’:
– ‘{job=”prometheus”}’
– ‘{__name__=~”job:.*”}’

static_configs:
– targets:
– ‘source-prometheus-1:9090’
– ‘source-prometheus-2:9090’
– ‘source-prometheus-3:9090′[……]

| k4nz

「Prometheus」- HTTP SD

通过文件发现（对文件系统的扫描或接收通知），Prom 能够自动加载文件中的 Target 并进行监控。
除了文件发现，Prom 还提供 HTTP SD，简单说就是通过 HTTP 返回需要监控的 Target
特性特征
具有如下特性： 1）周期性刷新，自动访问接口； 2）使用 JSON 格式； 3）HTTP/HTTPS 4）TLS / Basic Auth / Auth Header / OAuth2
接口要求
HTTP SD 需要我们自己实现，并暴露给 Prom 以抓取： 1）Content-Type: application/json 2）UTF-8 3）200 4）若无数据，则返回 [] 5）prometheus_sd_http_failures_total 定义失败次数 6）……
数据结构

[
{
“targets”: [“10.0.10.2:9100”, “10.0.10.3:9100”, “10.0.10.4:9100”, “10.0.10.5:9100”],
“labels”: {
“__meta_datacenter”: “london”,
“__meta_prometheus_job”: “node”
}
},
{
“targets”: [“10.0.40.2:9100”, “10.0.40.3:9100”],
“labels”: {
“__meta_datacenter”: “london”,
“__meta_prometheus_job”: “alertmanager”
}
},
{
“targets”: [“10.0.40.2:9093”, “10.0.40.3:9093”],
“labels”: {
“__meta_datacenter”: “newyork”,
“__meta_prometheus_job”: “alertmanager”
}
}
]

-」[……]

| k4nz

「Prometheus」- 管理接口

GET /-/healthy，健康检查
GET /-/ready，服务可用
PUT /-/reload，POST /-/reload，重新加载配置及规则，通过 –web.enable-lifecycle 启用。
PUT /-/quit，POST /-/quit，服务退出（或发送 SIGTERM 信号），通过 –web.enable-lifecycle 启用。[……]

| k4nz

「Prometheus」- 数据查询：PromQL（=> PROMETHEUS/Querying）

简单示例（Examples）
Querying examples | Prometheus
基本查询

http_requests_total
http_requests_total{job=”apiserver”, handler=”/api/comments”}

# 五分钟内的结果
# 注意：这些数据无法直接图形化显示，但是能够在表达式浏览器中进行查看；
http_requests_total{job=”apiserver”, handler=”/api/comments”}[5m]

# 通过 RE2 表达式匹配
http_requests_total{job=~”.*server”}
http_requests_total{status!~”4..”}

嵌套查询
即子查询，对指标使用函数进行进一步处理：

rate(http_requests_total[5m])[30m:1m]

函数、运算符……
类似如下形式：

# 计算 HTTP Request 总数：
sum by (job) (
rate(http_requests_total[5m])
)

# 计算 CPU 使用最多的前三个进程
topk(3, sum by (app, proc) (rate(instance_cpu_time_ns[5m])))

概念术语（Basics）
Querying basics | Prometheus
PromQL 用于数据查询，允许用户选择和过滤数据。通过查询得到的数据，既能进行图形化显示，也能通过 HTTP API 返回给外部系统。
表达式语言数据类型
在 PromQL 中，表达式可计算为四种类型之一： 1）Instant vector：一组时间序列，每个时间序列包含一个样本，都共享相同的时间戳 2）Range vector：一组时间序列，其中包含每个时间序列随时间变化的一系列数据点 3）Scalar：浮点型数值； 4）String：字符串，当前未使用；
根据用例（例如，当图形与显示表达式的输出时），这些类型中只有一些是合法的（作为用户指定表达式的结果）。例如，返回即时向量的表达式是唯一可以直接绘制图形的类型。
字面量
字符串类型，通过单引号、双引号、反引号引用，并能使用特殊字符。浮点数类型，即我们常见的浮点数类型表达格式；
Time-Series Selectors
Instant vector selectors：

http_requests_total # 通过名称直接匹配；
http_requests_total{job=”prometheus”,group=”canary[……]

| k4nz

「Prometheus」- 存储，Storage（=> PROMETHEUS/Storage）

数据既能存储在本地，也能存储在远端系统中。
本地存储
样本数据两小时为一个组（块）。

./data
├── 01BKGV7JBM69T2G1BGBGM6KB12 # 两小时
│ └── meta.json
├── 01BKGTZQ1SYQJTR4PB43C8PD98 # 两小时
│ ├── chunks
│ │ └── 000001 # 每 512M 一个
│ ├── tombstones # 通过 API 被删除的记录保存在这里
│ ├── index # 指标名的索引、标签的索引
│ └── meta.json
├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K # 两小时
│ └── meta.json
├── 01BKGV7JC0RY8A6MACW02A2PJD # 两小时
│ ├── chunks
│ │ └── 000001
│ ├── tombstones
│ ├── index
│ └── meta.json
├── chunks_head
│ └── 000001
└── wal
├── 000000002
└── checkpoint.00000001
└── 00000000

新传入指标的当前块在内存中，并未持久化，其实通过 WAL 进行保护。当服务重启时，其将被重放。 WAL 保存在 wal/ 目录，每个段 128M。 WAL 包含原始数据，所以会比较大。最少保留三个 WAL 文件，高负载服务器会更多。
本地存储，并非为大规模扩展场景设计。存储可靠性，需要依赖 RAID 等技术，SNAPSHOT 用于备份。除了本地存储，使用 Remote R/W API 也是一种选择，但要仔细评估这些远程存储。
数据压缩
最初的两小时块最终在后台被压缩成更长的块。
压缩将创建更大的块，其中包含的数据：最多为保留时间的 10%，或 31 天，以较小者为准；
运维方面
Prom 提供多个选项来控制本地存储相关的内容，如字面意思： 1）–storage.tsdb.path 2）–storage.tsdb.retention.time 3）–storage.tsdb.retention.size
如果 time 和 size 同时存在，则先触发的将被使[……]

| k4nz

「Prometheus」- 数据可视化（=> VISUALIZATION）

Expression browser
Expression browser | Prometheus
针对常规指标的浏览和快速查询，直接使用 http://localhost:9090/graph 即可。
Grafana
Grafana | Prometheus
在集成监控系统时，我们通常会选择 Grafana 来显示 Promtheus 的数据。
Console templates
使用Console Template – prometheus-book
官方还提供 Console Template 特性，让我们能够使用 Go Template 来创建页面。但是我们很少见到该特性的应用场景[……]

| k4nz

「Prometheus」- 备份、恢复、迁移（=> PROMETHEUS/Migration）

虽然我们追求稳定与兼容，但是破坏是在所难免的。
官方提供从 1.8 到 2.0 的迁移文档，Migration，当前并不涉及迁移工作，所以相关内容将在后续使用中再关注。 # 07/26/2022 而这其中提及某些新特性，但是这些新特性又没有在其他部分提及。所以，这部分文档并不是简单的迁移文档；
命令选项（Flags）
命令行选项发生变化
Alertmanager service discovery
Prometheus 能够通过：label 来发现 Alertmanager 实例，或通过 static_config 来发现服务；
Recording rules and alerts
通过 YAML 来配置 Alert Rule 与 Reording Rule；
Storage
数据格式不兼容，解决方案是通过 Read API 从旧服务中读取数据；
PromQL
从 PromQL 中移除某些特性；
Miscellaneous
Prometheus 以 non-root 用户运行；如果需要以 Root 用户运行，则需要进行修改；
Prometheus lifecycle（/-/reload），默认被禁用，需要通过 –web.enable-lifecycle 进行开启；[……]

| k4nz

「Prometheus」- 服务部署

快速开始（=> INTRODUCTION/First steps）
First steps | Prometheus PROMETHEUS/Getting started
快速开始 01
官方 INTRODUCTION/First steps 文档，展示如何快速开始，这里不再赘述细节；
配置文件：

global:
scrape_interval: 15s # 每 15 秒，进行一次抓取
evaluation_interval: 15s # 每 15 秒，执行 rule 定义

rule_files: # 这里 rule 定义为空
# – “first.rules”
# – “second.rules”

scrape_configs:
– job_name: prometheus
static_configs:
– targets: [‘localhost:9090’] # 默认抓取 Prom 自身暴露的指标；

运行服务：

./prometheus –config.file=prometheus.yml

指标格式（http://localhost:9090/metrics）：

…
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code=”200″} 1
promhttp_metric_handler_requests_total{code=”500″} 0
promhttp_metric_handler_requests_total{code=”503″} 0

执行查询（http://localhost:9090/graph ）通过 Graph 标签能够显示图示；
如果需要监控更多指标或其他服务，需要程序本身提供 http/metrics[……]

| k4nz

「Prometheus」- 配置设置（=> PROMETHEUS/Configuration）

Configuration
Configuration | Prometheus
通过命令行（./prometheus -h）或配置文件（–config.file）来指定服务配置；命令行多用于指定不变的配置（不经常变动）；
重新加载配置：SIGHUP 或 http:///-/reload（需开启 -web.enable-lifecycl）；如果配置文件存在错误，将不会被重载；
针对配置选项的细节，这里不再深入。在具体场景中，我们将进一步查阅相关内容；
Rules: Recording and Alerting
Recording rules | Prometheus Alerting rules | Prometheus
Prom 支持两种规则： 1）Record Rule：针对已有指标进行计算，以产生新的指标； 3）Alert Rule：告警规则。每当警报表达式在给定时间点产生一个或多个矢量元素时，对于这些元素的标签集，警报就会被视为活动的。
通过 rule_files 参数，来引用 rule 定义（Rule 定义在文件中）；
通过 promtool 命令，来检查 Rule 文件的定义是否正确；
在 Rule File 中： 1）Alert Rule 与 Record Rule 存在与 Group 中； 2）同个 Group 中的规则，按照顺序被执行；
Record Rule：

groups:
– name: cpu-node
rules:
– record: job_instance_mode:node_cpu_seconds:avg_rate5m
expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

Alert Rule：

groups:
– name: example
rules:
– alert: HighRequestLatency
expr: job:request_latency_seconds:mean5m{job=”myjob”} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency

– name: example-02
rules:
# Alert for any instance that is unreachable for >5 minutes.
– alert: InstanceDown
expr:[……]

| k4nz

「Prometheus」- Scrape Exporter

访问 Basic Auth 保护
monitoring – Configure basic_auth for Prometheus Target – Stack Overflow

– job_name: ‘myapp_health_checks’
scrape_interval: 5m
scrape_timeout: 30s
static_configs:
– targets: [‘mywebsite.org’]
metrics_path: “/api/health”
basic_auth:
username: ’email@username.me’
password: ‘cfgqvzjbhnwcomplicatedpasswordwjnqmd’

拆分配置文件、基于文件的服务发现
Configuration | Prometheus Using JSON file service discovery with Prometheus – Robust Perception | Prometheus Monitoring Experts
随着主机和各种 Exporter 的增多，prometheus.yml 配置文件越来越长，长到难以维护。
我们能够使用基于文件的服务发现，以分割 prometheus.yml 文件：

# /etc/prometheus/prometheus.yml
…
scrape_configs:
– job_name: ‘dummy-job’ # This is a default value, it is mandatory.
file_sd_configs:
– files:
– dummy/targets.yaml
…

# /etc/prometheus/dummy/targets.yaml
—
– targets:
– “172.27.254.32:1987”
labels:
hostname: “my-hostname”

为了能够快速拆分 prometheus.yml 文件，我们编写如下 Groovy 脚本（需要结合配置文件调整，多半不能直接使用）：

#!/usr/bin/groovy

import groovy.yaml.YamlSlurper

def configfile = “/etc/prometheus/prometheus.yml”
def yamlObject = new YamlSlurper().parseText(new File(configfile).[……]

| k4nz

「Prometheus」- 安全、保护

系统安全（=> OPERATING/Security）
Security | PrometheusS
官方 Security 文档，其中讨论与 Promtheus 安全相关的方方面面内容。[……]

| k4nz

「PROMETHEUS」- kubelet cAdvisor

cAdvisor 能收集有关给定节点上运行的所有容器的 CPU、内存、文件、网络使用情况的统计信息。kubelet 已集成 cAdvisor 的功能，用于监控资源使用情况并分析容器的性能（cAdvisor 不在 Pod 级别操作，而是针对 Container 级别）；
指标信息
cadvisor/prometheus.md/Prometheus container metrics
磁盘及文件系统相关
Bandwidth = irate(container_fs_(rw)_bytes_total{}[5m]) 1）container_fs_reads_bytes_total Counter Cumulative count of bytes read bytes diskIO 2）container_fs_writes_bytes_total Counter Cumulative count of bytes written bytes diskIO
IOPS = irate(container_fs_(rw)_total{}[5m]) 1）container_fs_reads_total Counter Cumulative count of reads completed diskIO 2）container_fs_writes_total Counter Cumulative count of writes completed diskIO
Latency = irate(container_fs_(rw)_seconds_total{}[5m]) / irate(container_fs_(rw)_total{}[5m]) 1）container_fs_read_seconds_total Counter Cumulative count of seconds spent reading diskIO 2）container_fs_write_seconds_total Counter Cumulative count of seconds spent writing seconds diskIO
Merged = irate(container_fs_(rw)_merged_total{}[5m]) 1）container_fs_reads_merged_total Counter Cumulative count of reads merged diskIO 2）container_fs_writes_merged_total Counter Cumulative count of writes merged diskIO

container_blkio_device_usage_[……]

| k4nz

「Kubernetes」- Prometheus Monitoring（Multiplue Cluster）

问题描述
多集群可用性：在多集群场景下，监控系统的可用性、巨量存储及成本；指标数据庞大：我们具有多个集群，导致集群 metrics 数量庞大，涉及 Promeheus 的性能以及高可用问题；
解决方案
系统架构
整体采用 thanos receiver + prometheus hashmod 的模式。 1）Prometheus hashmod 用于解决大体量集群的问题 2）Thanos receiver 适用于多集群场景

系统组件
1）Prometheus：prometheus 是基于指标的监控系统，云原生首选监控组件； 2）Alertmanager：alertmanager 主要用于接收 prometheus 发出的告警信息； 3）Grafana：grafana 是一个监控仪表系统，此方案中用于展示监控数据； 4）Thanos query：实现 Prometheus API，将来自下游组件提供的数据进行聚合，最终返回给查询数据的客户端 (如 grafana)； 5）Thanos receiver：适配 prometheus 的 remote write API，将其数据提供给 thanos query 查询，并将其上传到对象存储； 6）Thanos store：将对象存储的数据暴露给 thanos query 去查询，缓存 TSDB 索引，优化对象存储的远程调用请求； 7）Thanos ruler：对监控数据进行评估和告警，还可以计算出新的监控数据，将这些新指标数据提供给 Thanos Query 查询，上传指标数据到对象存储，以供长期存储。 9）Thanos compact：将对象存储中的数据进行压缩和降低采样率，加速大时间区间监控数据查询的速度。
相关文章
Kubernetes Multi-Cluster Monitoring using Prometheus and Thanos | by VAIBHAV THAKUR | FAUN Publication Kubernetes Multi-Cluster monitoring with Prometheus and Submariner | by Daniel Bachar | Medium Monitoring Multiple Kubernetes Clusters | by Conor Nevin | THG Tech Blog | Medium Multiple Kubernetes cluster monitoring with Prometheus | Sysrant Create a Multi-Cluster Monitoring Dashboard with Thanos, Grafana and Prometheus |[……]

| k4nz

NOTE

/ 记录问题 / 解决问题 / 技术博客 / 工作笔记 /

Categories

Recent Posts

Archives

「流水线模型」

「Jenkins Pipeline」- 在首次扫描后，禁止自动构建

「Jenkins Pipeline」- 暂存文件，以用于之后的构建

「Jenkins Pipeline」- 存储变量，以用于下次构建

「Jenkins Pipeline」- java.io.NotSerializableException: java.util.regex.Matcher

「Jenkins」- Note that ‘frame-src’ was not explicitly set, so ‘default-src’ is used as a fallback.

「Jenkins Pipeline」- Excessively nested closures

「Jenkins」- No valid crumb was included in request for /ajaxExecutors

「Jenkins Pipeline」- expected to call xxx but wound up catching xxx

「Jenkins Pipeline」- hudson.remoting.ProxyException

「Jenkins」- 常用配置，常见问题

「Grafana」- 数据展示平台

「Grafana」- 常见问题处理

「Grafana」- 面板（Dashboard）

「Grafana」- 常见错误汇总

「Prometheus」- 监控、告警、时序数据库（学习笔记）

「Prometheus」- 概念基础（=> CONCEPTS）

「Prometheus」- Federation（=> PROMETHEUS/Federation）

「Prometheus」- HTTP SD

「Prometheus」- 管理接口

「Prometheus」- 数据查询：PromQL（=> PROMETHEUS/Querying）

「Prometheus」- 存储，Storage（=> PROMETHEUS/Storage）

「Prometheus」- 数据可视化（=> VISUALIZATION）

「Prometheus」- 备份、恢复、迁移（=> PROMETHEUS/Migration）

「Prometheus」- 服务部署

「Prometheus」- 配置设置（=> PROMETHEUS/Configuration）

「Prometheus」- Scrape Exporter

「Prometheus」- 安全、保护

「PROMETHEUS」- kubelet cAdvisor

「Kubernetes」- Prometheus Monitoring（Multiplue Cluster）