Prometheus主要是通過exporter進行監(jiān)控信息的采集，在linux系統(tǒng)上主要是node_exporter采集主機信息，如CPU, 內(nèi)存，磁盤等信息。

安裝配置

Node Exporter采用Golang編寫，不存在任何的第三方依賴，下載地址為 prometheus.io/download，下載解壓即可運行。

# 下載node_exporter
cd /usr/local/src/
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar -xzf node_exporter-1.3.1.linux-amd64.tar.gz
mv node_exporter-1.3.1.linux-amd64 node_exporter 
## 用專用用戶運行
useradd prometheus
chown -R prometheus:prometheus ./node_exporter
su prometheus
./node_exporter/node_exporter

設(shè)置為系統(tǒng)服務(wù)，自動啟動。

cat > /usr/lib/systemd/system/node_exporter.service <<EOF
#node_exporter.service
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After.NETwork.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/src/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF
## 啟動
systemctl enable node_exporter.service
systemctl start node_exporter.service

默認(rèn)配置是開啟9100端口，如需更改則啟動時加參數(shù)--web.listen-address=0.0.0.0:9100。它還提供了一系列指標(biāo)開頭，--no-collector.<name>參數(shù)來指定不想收集的指標(biāo)，也可以通過--collector.<name>參數(shù)來指定想要打開的額外指標(biāo)。

啟動后，即可使用ip:9100/metrics訪問，看到系統(tǒng)的即時指標(biāo)。

Prometheus Server收集

Prometheus Server收集當(dāng)前node exporter的監(jiān)控數(shù)據(jù)，配置prometheus.yml，并在scrape_configs節(jié)點下添加此節(jié)點，有多種方式，直接配置在prometheus.yml：

scrape_configs:
  # 采集node exporter監(jiān)控數(shù)據(jù)
  - job_name: 'linux_node'
    static_configs:
      - targets: ['192.168.16.230:9100']

或者基于文件的服務(wù)發(fā)現(xiàn)：

# prometheus.yml中配置
scrape_configs:
 - job_name: "linux_node"
    file_sd_configs:
      - files:
        - static_config_linux.yml
# static_config_linux.yml中配置
- targets:
  # node_exporter配置
  - '192.168.16.230:9100'

如果獲取主機監(jiān)控信息時想指定獲取的內(nèi)容，需要配合使用Job的params參數(shù)。

配置完之后，需要重啟，如果基于文件發(fā)現(xiàn)的，加入節(jié)點會被定時自動掃描。prometheus提供的web頁面的status-->targets中查看到監(jiān)控的節(jié)點信息。

數(shù)據(jù)可視化

數(shù)據(jù)采集后，一般都使用grafana來展現(xiàn)各種視圖。點擊加號菜單中的“import”，輸入Dashboard ID后點確定，即可從官網(wǎng)導(dǎo)入指定的儀表盤模板。

其中對linux監(jiān)控數(shù)據(jù)呈現(xiàn)得比較好的圖表是，ID為1860的“Node Exporter Full”儀表盤，能看到單節(jié)點的詳情。

另外一個是以列表的方式查看總體指標(biāo)，ID為16098的“1 Node Exporter Dashboard 通用Job分組版”儀表盤。

預(yù)警配置

Prometheus預(yù)警由Server觸發(fā)，再通過獨立的Alertmanager服務(wù)發(fā)送到指定目的地，如釘釘群，郵件，企業(yè)微信。配置prometheus.yml指定預(yù)警規(guī)則文件路徑：

rule_files:
 - ./rules/*yml

預(yù)警規(guī)則文件(例rules/linux.yml)如下：

groups:
- name: NodeStatsAlert
  rules:
  - alert: mem使用率
    expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 95
    for: 5m
    labels:
      severity: 嚴(yán)重
    annotations:
      description: "{{ $labels.instance }} 內(nèi)存利用率,5分鐘持續(xù)>95%"
      summary: "內(nèi)存利用率超標(biāo)"
      value: '{{ $value }}%'

配置如果沒問題，則會在展示到alerts菜單下。

幾個常用的預(yù)警項的表達式是：

節(jié)點不可用：up == 0
內(nèi)存，可用內(nèi)存的比例：(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
CPU，5分鐘平均的CPU空閑：(avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m]) ))
硬盤，掛載的硬盤的可用比例：(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes

表達式可以先在Server端的graph菜單中驗證，或者在Grafana的圖表中找參數(shù)。