Monitoring & Observability

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5168

    #1

    Monitoring & Observability

    Metrics tell us something is wrong. Logs tell us why. We need both. This post covers how I set up the full observability stack for ASTRING — Prometheus and Grafana for metrics, Fluent Bit and Loki for logs, and Alertmanager.


    Metrics: Prometheus and Grafana

    Why Prometheus

    Prometheus is the standard for Kubernetes monitoring. It scrapes /metrics endpoints from our services and stores everything as time series data. PromQL lets us query and aggregate across that data. It also handles alerting rules, which I'll get to later.


    The easiest way to get the full stack running on Kubernetes is kube-prometheus-stack — it bundles Prometheus, Grafana, Alertmanager, and a set of pre-built dashboards and alerting rules for Kubernetes components.


    Installing kube-prometheus-stack





    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update

    helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace prometheus \
    --create-namespace \
    --values values.yaml







    A minimal values.yaml to get started:






    grafana:
    adminUser: admin
    adminPassword: your_password
    service:
    type: NodePort
    nodePort: 30000







    Once it's running, Grafana comes with pre-built dashboards for cluster health, node resource usage, pod performance, and Kubernetes component metrics. I actively use these — mostly for checking memory and CPU trends across the cluster.


    Logs: Why Not ELK

    The standard alternative to what I'm using is the ELK stack — Elasticsearch, Logstash, Kibana. It's powerful but heavy. Elasticsearch automatically creates indexes for everything it ingests, which means significant memory and CPU overhead even at low log volumes. On a 3-node cluster with limited resources, running Elasticsearch alongside everything else didn't make sense. It also adds Kibana as a separate UI, which means maintaining two dashboards.


    Loki takes a different approach — it indexes only metadata (labels like pod name, namespace, container), not the full log content. The logs themselves are stored compressed in object storage. This makes it much lighter to run and cheaper to store. Since it's built by Grafana Labs, it integrates directly into Grafana as a data source — same dashboard for metrics and logs.


    Fluent Bit runs as a DaemonSet on every node, tails container log files, and ships them to Loki. It's lightweight by design, built for high-throughput log forwarding without consuming much memory.


    Setting Up Loki

    Loki stores logs in S3-compatible object storage — I use Cloudflare R2.






    helm repo add grafana https://grafana.github.io/helm-charts
    helm repo update

    helm install loki grafana/loki \
    --namespace logging \
    --create-namespace \
    -f values.yaml







    The important parts of values.yaml:






    deploymentMode: SingleBinary

    loki:
    commonConfig:
    replication_factor: 1
    ingester:
    chunk_encoding: snappy
    querier:
    max_concurrent: 2
    schemaConfig:
    configs:
    - from: "2024-06-01"
    index:
    period: 24h
    prefix: loki_index_
    object_store: s3
    schema: v13
    store: tsdb
    storage:
    bucketNames:
    admin:
    chunks:
    ruler:
    s3:
    accessKeyId:
    secretAccessKey:
    s3: s3://:@/
    s3ForcePathStyle: false
    insecure: false
    type: s3

    singleBinary:
    replicas: 1
    resources:
    limits:
    cpu: 3
    memory: 3Gi
    requests:
    cpu: 2
    memory: 1Gi
    extraEnv:
    - name: GOMEMLIMIT
    value: 2750MiB

    minio:
    enabled: false







    SingleBinary mode runs everything in one pod — suitable for a small cluster. chunk_encoding: snappy compresses logs before storing them in R2, which reduces storage costs. GOMEMLIMIT caps Go's memory usage to stay within the pod's memory limit — same issue as GOMAXPROCS but for memory.


    Setting Up Fluent Bit

    Fluent Bit runs as a DaemonSet — one pod per node, tailing all container logs at /var/log/containers/*.log and forwarding to Loki.






    helm repo add fluent https://fluent.github.io/helm-charts
    helm repo update

    helm install fluent-bit fluent/fluent-bit \
    --namespace logging \
    -f values.yaml







    The important parts of values.yaml:






    args:
    - -e
    - /fluent-bit/bin/out_grafana_loki.so
    - --workdir=/fluent-bit/etc
    - --config=/fluent-bit/etc/conf/fluent-bit.conf

    config:
    inputs: |
    [INPUT]
    Name tail
    Tag kube.*
    Path /var/log/containers/*.log
    multiline.parser docker, cri
    Mem_Buf_Limit 5MB
    Skip_Long_Lines On

    outputs: |
    [Output]
    Name grafana-loki
    Match kube.*
    Url ${FLUENT_LOKI_URL}
    TenantID foo
    Labels {job="fluent-bit"}
    LabelKeys level,app
    BatchWait 1
    BatchSize 1001024
    LineFormat json
    LogLevel info
    AutoKubernetesLabels true

    env:
    - name: FLUENT_LOKI_URL
    value: http://loki-gateway.logging.svc.clus...ki/api/v1/push

    image:
    repository: grafana/fluent-bit-plugin-loki
    tag: main-e2ed1c0







    AutoKubernetesLabels true automatically attaches Kubernetes metadata (pod name, namespace, container name) as Loki labels — this makes filtering logs in Grafana much more useful. LabelKeys level,app promotes those specific fields into Loki stream labels, everything else becomes structured metadata.


    Connecting Loki to Grafana

    In Grafana, add Loki as a data source:

    1. Go to Configuration → Data Sources → Add data source
    2. Select Loki
    3. Set URL to http://loki-gateway.logging.svc.cluster.local
    4. Click Save & Test


    Now logs are queryable in the Explore tab using LogQL, and we can build dashboards that combine metrics from Prometheus and logs from Loki in the same view.


    Current State

    The full observability stack running on the cluster:
    • Prometheus — scraping metrics from all services and Kubernetes components
    • Grafana — dashboards for cluster health, pod performance, and logs
    • Alertmanager — firing alerts to Telegram on pod crashes, high memory, and disk usage
    • Loki — storing logs in Cloudflare R2
    • Fluent Bit — collecting and forwarding logs from every node


    I actively use Grafana for both metrics and logs. When something goes wrong, Telegram fires first, then I open Grafana to dig into what happened.




    More...
Working...