Skip to content

Monitoring Guide

This guide covers monitoring Bifrost using Prometheus metrics, Grafana dashboards, and health checks.

Bifrost exposes metrics in Prometheus format at the configured metrics endpoint.

metrics:
enabled: true
listen: ":7090"
path: "/metrics"

Add to your prometheus.yml:

scrape_configs:
- job_name: 'bifrost-server'
static_configs:
- targets: ['bifrost-server:7090']
scrape_interval: 15s
- job_name: 'bifrost-client'
static_configs:
- targets: ['bifrost-client:7090']
scrape_interval: 15s
MetricTypeDescription
bifrost_connections_totalCounterTotal connections handled
bifrost_connections_activeGaugeCurrent active connections
bifrost_connections_errors_totalCounterTotal connection errors
MetricTypeLabelsDescription
bifrost_requests_totalCountermethod, backend, statusTotal requests
bifrost_request_duration_secondsHistogrammethod, backendRequest duration
bifrost_request_size_bytesHistogramdirectionRequest/response size
MetricTypeLabelsDescription
bifrost_backend_connections_totalCounterbackendConnections per backend
bifrost_backend_connections_activeGaugebackendActive connections per backend
bifrost_backend_healthyGaugebackendBackend health (1=healthy, 0=unhealthy)
bifrost_backend_latency_secondsHistogrambackendBackend response latency
MetricTypeLabelsDescription
bifrost_bytes_totalCounterdirection, backendTotal bytes transferred
bifrost_bandwidth_bytes_per_secondGaugedirectionCurrent bandwidth usage
MetricTypeLabelsDescription
bifrost_cache_hits_totalCounterdomainTotal cache hits
bifrost_cache_misses_totalCounterdomain, reasonTotal cache misses
bifrost_cache_bytes_served_totalCountersourceBytes served (cache vs origin)
bifrost_cache_storage_size_bytesGaugetierCurrent storage size
bifrost_cache_storage_entriesGaugetierCurrent entry count
bifrost_cache_storage_usage_percentGaugetierStorage usage percentage
bifrost_cache_evictions_totalCountertier, reasonCache evictions
bifrost_cache_operation_duration_secondsHistogramoperationCache operation latency
bifrost_cache_active_rulesGaugeNumber of active cache rules
bifrost_cache_active_presetsGaugeNumber of enabled presets
MetricTypeDescription
bifrost_uptime_secondsGaugeServer uptime
bifrost_goroutinesGaugeNumber of goroutines
bifrost_memory_bytesGaugeMemory usage
# Request rate per second
rate(bifrost_requests_total[5m])
# Average request duration
rate(bifrost_request_duration_seconds_sum[5m]) / rate(bifrost_request_duration_seconds_count[5m])
# Error rate
rate(bifrost_connections_errors_total[5m]) / rate(bifrost_connections_total[5m])
# Active connections by backend
bifrost_backend_connections_active
# Bandwidth usage (MB/s)
rate(bifrost_bytes_total[5m]) / 1024 / 1024
# Unhealthy backends
bifrost_backend_healthy == 0
# Cache hit rate
bifrost_cache_hits_total / (bifrost_cache_hits_total + bifrost_cache_misses_total)
# Cache storage usage
bifrost_cache_storage_usage_percent{tier="memory"}
bifrost_cache_storage_usage_percent{tier="disk"}
# Bytes saved by cache (vs fetching from origin)
rate(bifrost_cache_bytes_served_total{source="cache"}[1h])

The included Docker Compose file starts Grafana with Prometheus:

Terminal window
cd docker
docker-compose up -d grafana prometheus

Access Grafana at http://localhost:3000 (default: admin/admin).

  1. Go to Configuration → Data Sources
  2. Add data source → Prometheus
  3. URL: http://prometheus:7090
  4. Save & Test
{
"title": "Active Connections",
"type": "stat",
"targets": [{
"expr": "bifrost_connections_active",
"legendFormat": "Connections"
}]
}
{
"title": "Request Rate",
"type": "graph",
"targets": [{
"expr": "rate(bifrost_requests_total[5m])",
"legendFormat": "{{method}} - {{backend}}"
}]
}
{
"title": "Backend Health",
"type": "table",
"targets": [{
"expr": "bifrost_backend_healthy",
"format": "table",
"instant": true
}]
}
{
"title": "Request Latency",
"type": "heatmap",
"targets": [{
"expr": "rate(bifrost_request_duration_seconds_bucket[5m])",
"format": "heatmap"
}]
}

Terminal window
curl http://localhost:7082/api/v1/health

Response:

{
"status": "healthy",
"time": "2024-01-15T10:30:00Z"
}

Status values:

  • healthy - All backends healthy
  • degraded - Some backends unhealthy
Terminal window
curl http://localhost:7082/api/v1/backends

Response:

[
{
"name": "direct",
"type": "direct",
"healthy": true,
"stats": {
"total_connections": 1234,
"active_connections": 5,
"bytes_sent": 1048576,
"bytes_received": 2097152
}
}
]
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:7090/metrics"]
interval: 30s
timeout: 5s
retries: 3
start_period: 5s
livenessProbe:
httpGet:
path: /api/v1/health
port: 8082
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/v1/health
port: 8082
initialDelaySeconds: 5
periodSeconds: 5

Create alerts.yml:

groups:
- name: bifrost
rules:
# High error rate
- alert: BifrostHighErrorRate
expr: rate(bifrost_connections_errors_total[5m]) / rate(bifrost_connections_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on Bifrost"
description: "Error rate is {{ $value | humanizePercentage }}"
# Backend down
- alert: BifrostBackendDown
expr: bifrost_backend_healthy == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Bifrost backend {{ $labels.backend }} is down"
# High latency
- alert: BifrostHighLatency
expr: histogram_quantile(0.95, rate(bifrost_request_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on Bifrost"
description: "P95 latency is {{ $value | humanizeDuration }}"
# No connections
- alert: BifrostNoConnections
expr: bifrost_connections_active == 0
for: 10m
labels:
severity: info
annotations:
summary: "No active connections on Bifrost"
# High connection count
- alert: BifrostHighConnections
expr: bifrost_connections_active > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High connection count"
description: "{{ $value }} active connections"
  1. Edit a panel
  2. Go to Alert tab
  3. Create alert rule
  4. Set conditions and notifications

logging:
level: info
format: json
output: stdout # or file path
LevelDescription
debugVerbose debugging information
infoNormal operational messages
warnWarning conditions
errorError conditions
# docker-compose.yml
services:
loki:
image: grafana/loki:2.9.0
ports:
- "3100:3100"
volumes:
- loki-data:/loki
bifrost-server:
logging:
driver: loki
options:
loki-url: "http://localhost:3100/loki/api/v1/push"
labels: "app=bifrost,service=server"
logging:
format: json
output: stdout

Use Filebeat to ship logs to Elasticsearch:

# filebeat.yml
filebeat.inputs:
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
output.elasticsearch:
hosts: ["elasticsearch:9200"]

  1. Scrape interval: 15-30 seconds for most use cases
  2. Retention: Keep at least 15 days of metrics
  3. Labels: Avoid high-cardinality labels (e.g., user IDs)
  1. Start simple: Begin with basic alerts, add more as needed
  2. Avoid alert fatigue: Only alert on actionable issues
  3. Document runbooks: Link alerts to troubleshooting guides
  1. Overview first: Start with high-level health metrics
  2. Drill-down: Allow navigation to detailed views
  3. Time ranges: Support common ranges (1h, 6h, 24h, 7d)
  1. Structured logs: Use JSON format for parsing
  2. Correlation IDs: Include request IDs for tracing
  3. Log rotation: Prevent disk space issues