Monitoring & Observability

**MyrinNew** · 04-19-2026, 11:50 PM

Metrics tell us something is wrong. Logs tell us why. We need both. This post covers how I set up the full observability stack for ASTRING — Prometheus and Grafana for metrics, Fluent Bit and Loki for logs, and Alertmanager.

Metrics: Prometheus and Grafana

Why Prometheus

Prometheus is the standard for Kubernetes monitoring. It scrapes /metrics endpoints from our services and stores everything as time series data. PromQL lets us query and aggregate across that data. It also handles alerting rules, which I'll get to later.

The easiest way to get the full stack running on Kubernetes is kube-prometheus-stack — it bundles Prometheus, Grafana, Alertmanager, and a set of pre-built dashboards and alerting rules for Kubernetes components.

Installing kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus \
--create-namespace \
--values values.yaml

A minimal values.yaml to get started:

grafana:
adminUser: admin
adminPassword: your_password
service:
type: NodePort
nodePort: 30000

Once it's running, Grafana comes with pre-built dashboards for cluster health, node resource usage, pod performance, and Kubernetes component metrics. I actively use these — mostly for checking memory and CPU trends across the cluster.

Logs: Why Not ELK

The standard alternative to what I'm using is the ELK stack — Elasticsearch, Logstash, Kibana. It's powerful but heavy. Elasticsearch automatically creates indexes for everything it ingests, which means significant memory and CPU overhead even at low log volumes. On a 3-node cluster with limited resources, running Elasticsearch alongside everything else didn't make sense. It also adds Kibana as a separate UI, which means maintaining two dashboards.

Loki takes a different approach — it indexes only metadata (labels like pod name, namespace, container), not the full log content. The logs themselves are stored compressed in object storage. This makes it much lighter to run and cheaper to store. Since it's built by Grafana Labs, it integrates directly into Grafana as a data source — same dashboard for metrics and logs.

Fluent Bit runs as a DaemonSet on every node, tails container log files, and ships them to Loki. It's lightweight by design, built for high-throughput log forwarding without consuming much memory.

Setting Up Loki

Loki stores logs in S3-compatible object storage — I use Cloudflare R2.

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install loki grafana/loki \
--namespace logging \
--create-namespace \
-f values.yaml

The important parts of values.yaml:

deploymentMode: SingleBinary

loki:
commonConfig:
replication_factor: 1
ingester:
chunk_encoding: snappy
querier:
max_concurrent: 2
schemaConfig:
configs:
- from: "2024-06-01"
index:
period: 24h
prefix: loki_index_
object_store: s3
schema: v13
store: tsdb
storage:
bucketNames:
admin:
chunks:
ruler:
s3:
accessKeyId:
secretAccessKey:
s3: s3://:@/
s3ForcePathStyle: false
insecure: false
type: s3

singleBinary:
replicas: 1
resources:
limits:
cpu: 3
memory: 3Gi
requests:
cpu: 2
memory: 1Gi
extraEnv:
- name: GOMEMLIMIT
value: 2750MiB

minio:
enabled: false

SingleBinary mode runs everything in one pod — suitable for a small cluster. chunk_encoding: snappy compresses logs before storing them in R2, which reduces storage costs. GOMEMLIMIT caps Go's memory usage to stay within the pod's memory limit — same issue as GOMAXPROCS but for memory.

Setting Up Fluent Bit

Fluent Bit runs as a DaemonSet — one pod per node, tailing all container logs at /var/log/containers/*.log and forwarding to Loki.

helm repo add fluent https://fluent.github.io/helm-charts
helm repo update

helm install fluent-bit fluent/fluent-bit \
--namespace logging \
-f values.yaml

The important parts of values.yaml:

args:
- -e
- /fluent-bit/bin/out_grafana_loki.so
- --workdir=/fluent-bit/etc
- --config=/fluent-bit/etc/conf/fluent-bit.conf

config:
inputs: |
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
multiline.parser docker, cri
Mem_Buf_Limit 5MB
Skip_Long_Lines On

outputs: |
[Output]
Name grafana-loki
Match kube.*
Url ${FLUENT_LOKI_URL}
TenantID foo
Labels {job="fluent-bit"}
LabelKeys level,app
BatchWait 1
BatchSize 1001024
LineFormat json
LogLevel info
AutoKubernetesLabels true

env:
- name: FLUENT_LOKI_URL
value: http://loki-gateway.logging.svc.clus...ki/api/v1/push

image:
repository: grafana/fluent-bit-plugin-loki
tag: main-e2ed1c0

AutoKubernetesLabels true automatically attaches Kubernetes metadata (pod name, namespace, container name) as Loki labels — this makes filtering logs in Grafana much more useful. LabelKeys level,app promotes those specific fields into Loki stream labels, everything else becomes structured metadata.

Connecting Loki to Grafana

In Grafana, add Loki as a data source:

Go to Configuration → Data Sources → Add data source
Select Loki
Set URL to http://loki-gateway.logging.svc.cluster.local
Click Save & Test

Now logs are queryable in the Explore tab using LogQL, and we can build dashboards that combine metrics from Prometheus and logs from Loki in the same view.

Current State

The full observability stack running on the cluster:

Prometheus — scraping metrics from all services and Kubernetes components
Grafana — dashboards for cluster health, pod performance, and logs
Alertmanager — firing alerts to Telegram on pod crashes, high memory, and disk usage
Loki — storing logs in Cloudflare R2
Fluent Bit — collecting and forwarding logs from every node

I actively use Grafana for both metrics and logs. When something goes wrong, Telegram fires first, then I open Grafana to dig into what happened.

More...