Monitoring Setup

Set up monitoring for VMProber using Prometheus and Grafana.

Prerequisites

Prometheus installed and running
Grafana installed and running (optional)
VMProber running and accessible

Prometheus Configuration

1. Add VMProber to Prometheus

Edit prometheus.yml:

scrape_configs:
  - job_name: 'vmprober'
    scrape_interval: 30s
    scrape_timeout: 10s
    static_configs:
      - targets: ['vmprober:8429']
    metrics_path: '/metrics'

2. Reload Prometheus

# If using systemd
sudo systemctl reload prometheus

# Or send SIGHUP
kill -HUP <prometheus-pid>

3. Verify Scraping

Check Prometheus targets page: http://prometheus:9090/targets

Grafana Dashboard

1. Create Dashboard

Go to Grafana: http://grafana:3000
Create new dashboard
Add panels for key metrics

2. Key Panels

Success Rate

rate(vmprober_probe_success_total[5m]) / rate(vmprober_probe_attempts_total[5m])

Average RTT

rate(vmprober_probe_rtt_seconds_sum[5m]) / rate(vmprober_probe_rtt_seconds_count[5m])

95th Percentile RTT

histogram_quantile(0.95, rate(vmprober_probe_rtt_seconds_bucket[5m]))

Failure Rate by Target

rate(vmprober_probe_failure_total[5m]) / rate(vmprober_probe_attempts_total[5m])

Alerts

High Failure Rate

- alert: HighProbeFailureRate
  expr: |
    rate(vmprober_probe_failure_total[5m]) /
    rate(vmprober_probe_attempts_total[5m]) > 0.1
  for: 5m
  annotations:
    summary: "High probe failure rate for {{ $labels.target }}"

High Latency

- alert: HighProbeLatency
  expr: |
    histogram_quantile(0.95,
      rate(vmprober_probe_rtt_seconds_bucket[5m])
    ) > 1
  for: 5m
  annotations:
    summary: "High probe latency for {{ $labels.target }}"

Service Discovery

For Kubernetes, use ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vmprober
spec:
  selector:
    matchLabels:
      app: vmprober
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

Next Steps

Operations: Monitoring - Advanced monitoring
Reference: Metrics - All available metrics