Esc

    Observability

    Metrics, tracing, and logging across InferaDB services.

    Prometheus Metrics

    Each service exposes a /metrics endpoint in Prometheus exposition format.

    Engine Metrics

    Authorization

    Metric Type Description
    inferadb_checks_total Counter Total authorization checks performed
    inferadb_check_duration_seconds Histogram Authorization check latency

    Cache

    Metric Type Description
    inferadb_cache_hits_total Counter Cache hits
    inferadb_cache_misses_total Counter Cache misses

    Storage

    Metric Type Description
    inferadb_storage_read_duration_seconds Histogram Storage read latency
    inferadb_storage_write_duration_seconds Histogram Storage write latency
    inferadb_replication_lag_seconds Gauge Replication lag from leader

    API

    Metric Type Description
    inferadb_api_requests_total Counter Total API requests by method and path
    inferadb_api_errors_total Counter Total API errors by status code

    Auth Metrics

    Metric Type Description
    inferadb_auth_attempts_total Counter Authentication attempts
    inferadb_auth_success_total Counter Successful authentications
    inferadb_auth_failure_total Counter Failed authentications
    inferadb_auth_duration_seconds Histogram Authentication processing time
    inferadb_jwks_cache_hits_total Counter JWKS cache hits
    inferadb_jwks_cache_misses_total Counter JWKS cache misses
    inferadb_jwt_validation_errors_total Counter JWT validation errors by reason

    Scrape Configuration

    Prometheus scrape config for a Docker deployment:

    scrape_configs:
      - job_name: inferadb-engine
        static_configs:
          - targets: ["engine:8080"]
      - job_name: inferadb-control
        static_configs:
          - targets: ["control:9090"]
      - job_name: inferadb-ledger
        static_configs:
          - targets: ["ledger:50051"]
    

    For Kubernetes, enable the ServiceMonitor in the Helm chart:

    engine:
      serviceMonitor:
        enabled: true
        interval: 15s
    

    OpenTelemetry Tracing

    Traces are exported via OTLP, spanning the full request lifecycle (API ingestion through evaluation and response).

    Configuration

    Standard OpenTelemetry environment variables:

    Variable Default Description
    OTEL_EXPORTER_OTLP_ENDPOINT OTLP collector endpoint (e.g., http://otel-collector:4317)
    OTEL_SERVICE_NAME inferadb-engine Service name in traces
    OTEL_TRACES_SAMPLER parentbased_traceidratio Sampling strategy
    OTEL_TRACES_SAMPLER_ARG 1.0 Sampling rate (0.0 to 1.0)

    Example

    docker run -p 8080:8080 -p 8081:8081 \
      -e OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
      -e OTEL_SERVICE_NAME=inferadb-engine \
      inferadb/inferadb-engine:latest
    

    Traces are compatible with any OTLP-capable backend — Jaeger, Tempo, Honeycomb, Datadog, etc.

    Structured Logging

    Log levels are controlled per-module via RUST_LOG:

    # Set global level to info, with debug for the evaluator
    RUST_LOG=info,inferadb_core::evaluator=debug
    
    # Trace-level logging for auth (verbose)
    RUST_LOG=info,inferadb_auth=trace
    

    Log Format

    Each log line is a JSON object:

    {
      "timestamp": "2026-03-24T10:15:30.123Z",
      "level": "INFO",
      "target": "inferadb_api::handler",
      "message": "check completed",
      "vault_id": "v_abc123",
      "duration_ms": 1.8,
      "result": "ALLOW",
      "span_id": "a1b2c3d4e5f6"
    }
    

    Audit Logging

    Security events are logged and persisted to the Ledger:

    Event Description
    AuthenticationSuccess Successful token validation
    AuthenticationFailure Failed authentication attempt
    ScopeViolation Request exceeded the token’s granted scopes
    TenantIsolationViolation Attempt to access data outside the token’s vault

    These events are always logged at WARN or ERROR level regardless of the configured log level.

    Grafana Dashboards

    Pre-built Grafana dashboards:

    • Engine Overview — Check rate, latency percentiles, cache hit ratio, error rate
    • Ledger Health — Raft leader status, write latency, replication lag, snapshot status
    • Authentication — Auth success/failure rate, JWKS cache performance, JWT error breakdown
    • Tenant Activity — Per-vault check volume, write rate, and cache efficiency

    Available as JSON files in the repository. Import directly or use Grafana dashboard provisioning.