Observability Strategies¶
Observability strategies are approaches and methodologies used to design, implement, and maintain systems that are observable. Observability is the ability to understand the internal state of a system based on its external outputs, such as logs, metrics, and traces. Effective observability strategies enable organizations to monitor, troubleshoot, and optimize their systems efficiently.
1. Category¶
1.1. USE¶
The USE (Utilization, Saturation, Errors) method is a framework for analyzing infrastructure performance.
-
Utilization
The amount of a resource currently in use (e.g., CPU, memory, disk I/O, or network).
-
Saturation
The degree to which extra work is queued because a resource is near capacity (e.g., load average, CPU run queue, or disk I/O queue).
-
Errors
The number of failures observed (e.g., software, hardware, or network errors).
1.1.1. CPU¶
PromQL queries for CPU in Grafana dashboards using Prometheus and Node Exporter.
Note
The metrics available may depend on the exact version of the Node Exporter and the enabled collectors. Check the metrics exposed by the Node Exporter by accessing its /metrics endpoint.
-
Utilization
-
Queries
-
CPU Utilization
The percentage of CPU time that is not idle, which can be used as a proxy for CPU utilization.
-
-
Alerts
-
High CPU Utilization
The metric represents the percentage of CPU utilization. High values can indicate potential CPU saturation, which may lead to degraded performance or system instability.
- alert: HighCPUUtilization expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU utilization (instance {{ $labels.instance }})" description: "CPU utilization is above 80% for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
-
Saturation
Note
The metrics are just proxies for saturation and may not indicate an actual saturation state. Always interpret these metrics in the context of the system performance and other indicators.
-
Queries
-
CPU Saturation (System Load)
The system load average over 1 minute periods divided by the number of CPU cores can be used as a proxy for CPU saturation. A load average of 1.0 per CPU core indicates that the CPU is fully utilized.
-
CPU Saturation (Load Average)
The system load average over 1 minute periods can be used as a proxy for CPU saturation. A load average of 1.0 per CPU core indicates that the CPU is fully utilized.
-
CPU Saturation (Run Queue Length)
The number of tasks in the runnable or uninterruptible state. A high run queue length can indicate CPU saturation.
-
CPU Saturation (I/O Wait)
The amount of time the CPU is idle due to waiting to handle I/O operations. High I/O wait times can indicate CPU saturation.
-
-
Alerts
-
High CPU Load Average
The metric represents the system load average over 1 minute periods. High values compared to the number of available CPU cores can indicate potential CPU saturation, which may lead to degraded performance or system instability.
- alert: HighCPULoadAverage expr: node_load1 > count(count(node_cpu_seconds_total) by (cpu)) for: 5m labels: severity: warning annotations: summary: "High CPU load average (instance {{ $labels.instance }})" description: "CPU load average is above the number of CPU cores for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
-
Errors
Note
The metrics are just proxies for errors and may not indicate an actual error state. Always interpret these metrics in the context of the system performance and other indicators.
-
Queries
-
CPU Errors (Context Switches)
Context switches can be a sign of CPU contention or other issues. A high rate of context switches may indicate that the CPU is experiencing performance problems.
-
CPU Errors (Steal Time)
In virtualized environments, steal time refers to the time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor. High steal time can indicate that the host machine is overloaded.
-
CPU Errors (Soft Lockups)
Soft lockups occur when a CPU is stuck in a loop and cannot schedule other tasks. A high rate of soft lockups can indicate CPU issues.
-
CPU Errors (Guest Time)
In virtualized environments, this refers to the time a CPU was busy executing a virtual CPU for guest operating systems under the control of the Linux kernel. High guest time can indicate that the virtual machine is experiencing performance issues.
-
-
Alerts
-
CPU Errors (Potential Disk Issues)
The metric represents the rate of CPU context switches. High values can indicate potential CPU contention or other issues that may lead to degraded performance or system instability.
- alert: HighCPUContextSwitches expr: rate(node_context_switches_total[5m]) > 1000 for: 5m labels: severity: warning annotations: summary: "High CPU context switches (instance {{ $labels.instance }})" description: "CPU context switches are above 1000 per second for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
1.1.2. Memory¶
PromQL queries for Memory in Grafana dashboards using Prometheus and Node Exporter.
Note
The metrics available may depend on the exact version of the Node Exporter and the enabled collectors. Check the metrics exposed by the Node Exporter by accessing its /metrics endpoint.
-
Utilization
-
Queries
-
Memory Utilization
The percentage of memory in use, calculated as the total memory minus the free memory, cached memory, buffers, and reclaimable memory, divided by the total memory.
-
-
Alerts
-
High Memory Utilization
The metric represents the percentage of memory in use. High values can indicate potential memory saturation, which may lead to degraded performance or system instability.
- alert: HighMemoryUtilization expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 5m labels: severity: warning annotations: summary: "High memory utilization (instance {{ $labels.instance }})" description: "Memory utilization is above 80% for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
-
Saturation
Note
Measuring memory saturation can be complex and often requires taking multiple factors into account. Depending on the system and workload, different metrics might be more or less relevant.
-
Queries
-
Memory Saturation (SWAP)
The degree to which the extra workload will be queued using swap usage as a proxy. High swap usage can indicate memory pressure and potential performance issues.
-
Memory Saturation (Page Swap Activity)
The rate of page swaps can indicate memory pressure. High rates of page swapping can lead to performance degradation.
-
Memory Saturation (Memory Buffer Size)
The amount of memory used for buffers can indicate memory pressure. High buffer usage can lead to performance degradation.
-
-
Alerts
-
High Memory Swap Saturation
The metric represents the percentage of swap memory in use. High values can indicate potential memory pressure, which may lead to degraded performance or system instability.
- alert: HighMemorySwapSaturation expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50 for: 5m labels: severity: warning annotations: summary: "High memory swap saturation (instance {{ $labels.instance }})" description: "Memory swap saturation is above 50% for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
-
Errors
-
Queries
-
Memory Errors (OOM Kills)
The metric represents the number of Out-Of-Memory (OOM) kills that have occurred. OOM kills happen when the system runs out of memory and the kernel terminates processes to free up memory. High values can indicate potential memory issues.
-
Memory Errors (Memory Page Faults)
The metric represents the rate of major page faults. High values can indicate potential memory issues, such as insufficient memory or memory leaks.
-
-
Alerts
-
OOM Kills Detected
The metric represents the number of Out-Of-Memory (OOM) kills that have occurred. OOM kills happen when the system runs out of memory and the kernel terminates processes to free up memory. High values can indicate potential memory issues.
-
High Memory Page Faults (Potential Memory Issues)
The metric represents the rate of major page faults. High values can indicate potential memory issues, such as insufficient memory or memory leaks.
- alert: HighMemoryPageFaults expr: rate(node_vmstat_pgmajfault[5m]) > 1000 for: 5m labels: severity: warning annotations: summary: "High memory page faults (instance {{ $labels.instance }})" description: "Memory page faults are above 1000 per second for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
1.1.3. Network¶
PromQL queries for Network in Grafana dashboards using Prometheus and Node Exporter.
Note
The metrics available may depend on the exact version of the Node Exporter and the enabled collectors. Check the metrics exposed by the Node Exporter by accessing its /metrics endpoint.
-
Utilization
-
Queries
-
Traffic Receive
The fraction of time that the network is being used.
-
Traffic Transmit
The fraction of time that the network is being used.
-
Traffic Bandwidth
Measured as the amount of data sent or received per unit time. High rates can indicate that the network interface is heavily utilized.
-
Packet Rate
The rate of packets sent or received can be indicative of network utilization.
-
-
Alerts
-
High Network Utilization
The metric represents the fraction of time that the network is being used. High values can indicate that the network interface is heavily utilized.
- alert: HighNetworkUtilization expr: sum by(instance, device) (rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m])) / 1024 / 1024 > 100 for: 5m labels: severity: warning annotations: summary: "High network utilization (instance {{ $labels.instance }}, device {{ $labels.device }})" description: "Network utilization is above 100 MB/s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
-
Saturation
-
Queries
-
Network Saturation (Traffic Drop)
The number of received/transmitted network packets that are dropped as a proxy to indicate network saturation.
-
Network Saturation (Transmit Queue Length)
Indicates if the network interface is saturated. A growing queue length can indicate higher saturation levels.
-
-
Alerts
-
High Network Dropped Packets
The metric represents the number of network packets that were dropped. High values can indicate network congestion or hardware issues.
- alert: HighNetworkDroppedPackets expr: sum by(instance, device) (rate(node_network_receive_drop_total[5m]) + rate(node_network_transmit_drop_total[5m])) > 10 for: 5m labels: severity: warning annotations: summary: "High network dropped packets (instance {{ $labels.instance }}, device {{ $labels.device }})" description: "Network dropped packets are above 10 per second for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
-
Errors
-
Queries
-
Network Errors (Traffic Errors)
Error counts can be a sign of network problems.
-
-
Alerts
-
High Network Errors
The metric represents the number of network errors that have occurred. High values can indicate network hardware issues, configuration problems, or other network-related errors.
- alert: HighNetworkErrors expr: sum by(instance, device) (rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m])) > 10 for: 5m labels: severity: warning annotations: summary: "High network errors (instance {{ $labels.instance }}, device {{ $labels.device }})" description: "Network errors are above 10 per second for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
1.1.4. Disk I/O¶
PromQL queries for Disk I/O in Grafana dashboards using Prometheus and Node Exporter.
Note
The metrics available may depend on the exact version of the Node Exporter and the enabled collectors. Check the metrics exposed by the Node Exporter by accessing its /metrics endpoint.
-
Utilization
-
Queries
-
Disk I/O Utilization
The percentage of time a disk device is busy servicing I/O.
-
-
Alerts
-
High Disk I/O Utilization
The metric represents the percentage of time a disk device is busy servicing I/O. High values can indicate that the disk is heavily utilized, which may lead to performance degradation.
- alert: HighDiskIOUtilization expr: rate(node_disk_io_time_seconds_total[5m]) / rate(node_disk_io_time_weighted_seconds_total[5m]) > 0.8 for: 5m labels: severity: warning annotations: summary: "High disk I/O utilization (instance {{ $labels.instance }})" description: "Disk I/O utilization is above 80% for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
-
Saturation
Note
The metrics are just proxies for saturation and may not indicate an actual saturation state. Always interpret these metrics in the context of the system performance and other indicators.
-
Queries
-
Disk I/O Saturation (I/O Weighted Time)
The degree to which the extra workload will be queued.
-
Disk I/O Saturation (I/O Wait Time)
The time operations had to wait for disk access.
-
Disk I/O Saturation (I/O Service Time)
The time it takes to service I/O requests can indicate saturation. A high service time can mean that the disk is saturated with requests.
-
Disk I/O Saturation (I/O Queue Operations)
The metric represents the number of I/O operations currently in progress for a given disk. A high value may suggest a high load or potentially a performance issue.
-
-
Alerts
-
High Disk I/O Wait
The metric represents the percentage of time that I/O operations had to wait for disk access. High values can indicate that the disk is experiencing high latency or contention.
- alert: HighDiskIOWait expr: rate(node_disk_io_time_seconds_total[5m]) / rate(node_disk_io_time_weighted_seconds_total[5m]) > 0.2 for: 5m labels: severity: warning annotations: summary: "High disk I/O wait (instance {{ $labels.instance }})" description: "Disk I/O wait is above 20% for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
-
Errors
Note
The metrics are just proxies for errors and may not indicate an actual error state. Always interpret these metrics in the context of the system performance and other indicators.
-
Queries
-
Disk IO Errors (Read Errors)
Read metric with a status label indicating failure, or another dedicated error metric provided by the node exporter.
-
Disk IO Errors (Write Errors)
Write metrics with a status label indicating failure, or another dedicated error metric provided by the node exporter.
-
-
Alerts
-
Low Disk Read Rate (Potential Disk Issues)
The metric represents the number of disk read operations completed per second. Low values can indicate potential disk issues or underutilization.
- alert: LowDiskReadRate expr: rate(node_disk_reads_completed_total[5m]) < 100 for: 5m labels: severity: warning annotations: summary: "Low disk read rate (instance {{ $labels.instance }})" description: "Disk read rate is below 100 ops/sec for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" -
Low Disk Write Rate (Potential Disk Issues)
The metric represents the number of disk write operations completed per second. Low values can indicate potential disk issues or underutilization.
- alert: LowDiskWriteRate expr: rate(node_disk_writes_completed_total[5m]) < 100 for: 5m labels: severity: warning annotations: summary: "Low disk write rate (instance {{ $labels.instance }})" description: "Disk write rate is below 100 ops/sec for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
1.1.5. Storage¶
PromQL queries for Storage in Grafana dashboards using Prometheus and Node Exporter.
Note
The metrics available may depend on the exact version of the Node Exporter and the enabled collectors. Check the metrics exposed by the Node Exporter by accessing its /metrics endpoint.
-
Utilization
-
Queries
-
Storage Utilization
The percentage of disk space used on the filesystem.
-
-
Alerts
-
Low Disk Space
The metric represents the available disk space as a percentage of the total disk space. Low values can indicate that the disk is running out of space, which may lead to performance degradation or system instability.
- alert: LowDiskSpace expr: node_filesystem_avail_bytes{mountpoint="/", fstype !~ "tmpfs|shmpfs|overlay"} / node_filesystem_size_bytes{mountpoint="/", fstype !~ "tmpfs|shmpfs|overlay"} * 100 < 20 for: 5m labels: severity: warning annotations: summary: "Low disk space (instance {{ $labels.instance }})" description: "Disk space is below 20% for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
-
Saturation
-
Queries
-
Storage Saturation
The metric represents the percentage of time a disk device is busy servicing I/O. High values can indicate that the disk is heavily utilized, which may lead to performance degradation.
-
-
Alerts
-
Low Free File Nodes (Potential Inode Exhaustion)
The metric represents the percentage of free inodes on the filesystem. Low values can indicate potential inode exhaustion, which may lead to the inability to create new files.
- alert: LowFreeInodes expr: node_filesystem_files_free{mountpoint="/", fstype !~ "tmpfs|shmpfs|overlay"} / node_filesystem_files{mountpoint="/", fstype !~ "tmpfs|shmpfs|overlay"} * 100 < 10 for: 5m labels: severity: warning annotations: summary: "Low free inodes (instance {{ $labels.instance }})" description: "Free inodes are below 10% for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
-
Errors
-
Queries
-
Storage Errors
The metric represents the number of disk or filesystem errors that have occurred. High values can indicate potential hardware issues, configuration problems, or other storage-related errors.
-
-
Alerts
-
Disk Filesystem Error (Potential Permissions or Configuration Issues)
The metric represents the number of disk or filesystem errors that have occurred. High values can indicate potential hardware issues, configuration problems, or other storage-related errors.
- alert: DiskFilesystemError expr: node_filesystem_device_error{mountpoint="/", fstype !~ "tmpfs|shmpfs|overlay"} > 0 for: 5m labels: severity: warning annotations: summary: "Disk/Filesystem error (instance {{ $labels.instance }})" description: "Disk or filesystem errors have occurred for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
-
-
1.2. RED¶
RED (Rate, Errors, Duration) is a method for monitoring services that focuses on the health and performance of services.
-
Rate
Number of requests that are coming in per second. Requests per second, network I/O rate, or transactions per second.
-
Errors
Rate of requests that are failing. Percentage of requests that are failing, or the number of errors per second.
-
Duration
Amount of time it takes to process a request. Average request duration, 95th percentile request duration, or the maximum request duration.
1.2.1. HTTP¶
-
Rate
-
Queries
-
HTTP Rate
The rate of incoming HTTP requests per second.
-
-
-
Errors
-
Queries
-
HTTP Error Ratio (5xx)
Percentage of HTTP 5xx errors over a period of time.
-
HTTP Error Rate (5xx)
The metric represents the rate of HTTP 5xx errors over a period of time.
-
HTTP Error Rate (4xx)
The metric represents the rate of HTTP 4xx errors over a period of time.
-
-
-
Duration
-
Queries
-
HTTP Duration
The amount of time taken to process requests, usually visualized as the 95th or 99th percentile.
-
-
1.3. The Four Golden Signals¶
The Four Golden Signals are a set of key monitoring metrics recommended by Google in Site Reliability Engineering (SRE) practices. They provide a high-level overview of distributed system health and performance.
Note
The signals are considered golden because they offer a broad yet detailed view of a system's health and performance from the perspective of end-users. They help SREs and DevOps teams to quickly diagnose and address issues, ensuring that the system remains reliable and available for users.
-
Latency
The time it takes to service a request. It's important to distinguish between the latency of successful requests and the latency of failed requests. Usually visualized as the 95th or 99th percentile.
-
Traffic
The amount of demand being placed on the system, often measured in HTTP requests per second, an audio streaming system on network I/O rate or concurrent sessions, a key-value storage system on transactions and retrievals per second, or the number of concurrently active users.
-
Errors
The rate of failed requests, due to internal server errors, failed requests, policy, condition result in an HTTP 500/400 status code. Typically measured as a fraction of total requests.
-
Saturation
The degree to which the system is overloaded. For example, CPU saturation can be measured by the percentage of time the CPU is busy, while disk saturation can be measured by the percentage of time the disk is busy servicing I/O requests.
1.3.1. Microservice¶
-
Latency
-
Queries
-
HTTP
The time taken to respond to requests.
-
-
-
Traffic
-
Queries
-
HTTP
The rate of requests per second.
-
Network
Network I/O rate over the last minute.
-
Transaction
Transactions per second over the last minute.
-
-
-
Errors
-
Queries
-
HTTP
Percentage of HTTP 5xx errors over the last 5 minutes.
-
HTTP
Rate of HTTP 5xx errors over the last minute.
-
HTTP
Rate of HTTP 4xx errors over the last minute.
-
-
-
Saturation
-
Queries
-
CPU
The metric represents the percentage of CPU utilization. High values can indicate potential CPU saturation, which may lead to degraded performance or system instability.
-
-
2. References¶
- Google SRE book.
- Prometheus Exporters page.
- Grafana Dashboards Best Practice page.
- Grafana Play The Four Golden Signals page.