In Part one of our three-part series on monitoring Kubernetes, we dove deep into the Kubernetes architecture to get a firm foundation on how Kubernetes works. Understanding what an application does and how it functions is critical to monitoring it effectively. Now it’s time to journey a bit deeper into each of those components and understand what you need to keep a close eye on.
What It Means to Monitor an Application
When monitoring any application, you always need to be able to answer two critical questions: What is happening? Why is this happening? Access to machine data from your applications to address these questions ensures that you can quickly identify issues and take action to remedy. This machine data comes in two forms, metrics and logs.
Time series metrics tell you what is happening, for example how much load is on the system or what current resource consumption is. Metrics are like the various indicators in your car similar to the check engine light that goes on when there is a problem. They provide you insight into the current state of behavior and can be a warning sign of issues to come.
Logs tell you why something is happening. They provide a record of the events that are occurring in your application. They can be informative, providing context around the actions the app is taking, or they can be errors indicating why something is breaking. Logs are like the results of the diagnostic tool your mechanic plugs in to determine why the check engine light is on.
[EBOOK] Kubernetes Observability
Learn how to monitor, troubleshoot, and secure your Kubernetes environment with Sumo Logic.
Machine Data From Kubernetes And What It Orchestrates
Like any application, Kubernetes has a comprehensive set of machine data to enable you always answer the what and why when it comes to monitoring it. You also need machine data for the applications running inside of Kubernetes. Let’s dive into the critical data you need to collect from every Kubernetes cluster.
Common Metrics
Kubernetes is written in GoLang and reveals some essential metrics about the GoLang runtime. These metrics are necessary to keep an eye on the state of what is happening in your GoLang processes. There are also critical metrics related to etcd. Multiple components interact with etcd and keeping an eye on those interactions gives you insights into potential etcd issues. Below are some of the top GoLang stats and common etcd metrics to collect that are exposed by most Kubernetes components.
Metric | Components | Description |
go_gc_duration_seconds | All | A summary of the GC invocation durations. |
go_threads | All | Number of OS threads created. |
go_goroutines | All | Number of goroutines that currently exist. |
etcd_helper_cache_hit_count | API Server, Controller Manager | Counter of etcd helper cache hits. |
etcd_helper_cache_miss_count | API Server, Controller Manager | Counter of etcd helper cache miss. |
etcd_request_cache_add_latencies_summary | API Server, Controller Manager | Latency in microseconds of adding an object to etcd cache. |
etcd_request_cache_get_latencies_summary | API Server, Controller Manager | Latency in microseconds of getting an object from etcd cache. |
etcd_request_latencies_summary | API Server, Controller Manager | Etcd request latency summary in microseconds for each operation and object type. |
Kubernetes Control Plane
As we learned, the Kubernete Control Plane is the engine that powers Kubernetes. It consists of multiple parts working together to orchestrate your containerized applications. Each piece serves a specific function and exposes its own set of metrics to monitor the health of that component. To effectively monitor the Control Plane, visibility into each components health and state is critical.
API Server
The API Server provides the front-end for the Kubernetes cluster and is the central point that all components interact. The following table presents the top metrics you need to have clear visibility into the state of the API Server.
Metric | Description |
apiserver_request_count | Count of apiserver requests broken out for each verb, API resource, client, and HTTP response contentType and code. |
apiserver_request_latencies | Response latency distribution in microseconds for each verb, resource and subresource. |
Etcd
Etcd is the backend for Kubernetes. It is a consistent and highly-available key-value store where all Kubernetes cluster data resides. All the data representing the state of the Kubernetes cluster resides in Etcd. The following are some of the top metrics to watch in Etcd.
Metric | Description |
etcd_server_has_leader | 1 if a leader exists, 0 if not. |
etcd_server_leader_changes_seen_total | Number of leader changes. |
etcd_server_proposals_applied_total | Number of proposals that have been applied. |
etcd_server_proposals_committed_total | Number of proposals that have been committed. |
etcd_server_proposals_pending | Number of proposals that are pending. |
etcd_server_proposals_failed_total | Number of proposals that have failed. |
etcd_debugging_mvcc_db_total_size_in_bytes | Actual size of database usage after a history compaction. |
etcd_disk_backend_commit_duration_seconds | Latency distributions of commit called by the backend. |
etcd_disk_wal_fsync_duration_seconds | Latency distributions of fsync calle by wal. |
etcd_network_client_grpc_received_bytes_total | Total number of bytes received by gRPC clients. |
etcd_network_client_grpc_sent_bytes_total | Total number of bytes sent by gRPC clients. |
grpc_server_started_total | Total number of gRPC’s started on the server. |
grpc_server_handled_total | Total number of gRPC’s handled on the server. |
Scheduler
Scheduler watches the Kubernetes API for newly created pods and determines which node should run those pods. It makes this decision based on the data it has available including the collective resource availability as well as the resource requirements of the pod. Monitoring scheduling latency ensures you have visibility into any delays the Scheduler is facing.
Metric | Description |
scheduler_e2e_scheduling_latency_microseconds | The end-to-end scheduling latency, which is the sum of the scheduling algorithm latency and the binding latency. |
Controller Manager
Controller Manager is a daemon which embeds all the various control loops that run to ensure the desired state of your cluster is met. It watches the API server and takes action depending on the current state versus the desired state. It’s important to keep an eye on the requests it is making to your Cloud provider to ensure the controller manager can successfully orchestrate. Currently, these metrics are available for AWS, GCE, and OpenStack.
Metric | Description |
cloudprovider_*_api_request_duration_seconds | The latency of the cloud provider API call. |
cloudprovider_*_api_request_errors | Cloud provider API request errors. |
Kube-State-Metrics
Kube-State-Metrics is a Kubernetes add-on that provides insights into the state of Kubernetes. It watches the Kubernetes API and generates various metrics, so you know what is currently running. Metrics are generated for just about every Kubernetes resource including pods, deployments, daemonsets, and nodes. Numerous metrics are available, capturing various information and below are some of the key ones.
Metric | Description |
kube_pod_status_phase | The current phase of the pod. |
kube_pod_container_resource_limits_cpu_cores | Limit on CPU cores that can be used by the container. |
kube_pod_container_resource_limits_memory_bytes | Limit on the amount of memory that can be used by the container. |
kube_pod_container_resource_requests_cpu_cores | The number of requested cores by a container. |
kube_pod_container_resource_requests_memory_bytes | The number of requested memory bytes by a container. |
kube_pod_container_status_ready | Will be 1 if the container is ready, and 0 if it is in a not ready state. |
kube_pod_container_status_restarts_total | Total number of restarts of the container. |
kube_pod_container_status_terminated_reason | The reason that the container is in a terminated state. |
kube_pod_container_status_waiting | The reason that the container is in a waiting state. |
kube_daemonset_status_desired_number_scheduled | The number of nodes that should be running the pod. |
kube_daemonset_status_number_unavailable | The number of nodes that should be running the pod, but are not able to. |
kube_deployment_spec_replicas | The number of desired pod replicas for the Deployment. |
kube_deployment_status_replicas_unavailable | The number of unavailable replicas per Deployment. |
kube_node_spec_unschedulable | Whether a node can schedule new pods or not. |
kube_node_status_capacity_cpu_cores | The total CPU resources available on the node. |
kube_node_status_capacity_memory_bytes | The total memory resources available on the node |
kube_node_status_capacity_pods | The number of pods the node can schedule. |
kube_node_status_condition | The current status of the node. |
Node Components
As we learned in Part 1 of our journey into Kubernetes, the Nodes of a Kubernetes cluster are made up of multiple parts, and as such you have numerous pieces to monitor.
Kubelet
Keeping a close eye on Kubelet ensures that the Control Plane can always communicate with the node that Kubelet is running on. In addition to the common GoLang runtime metrics, Kubelet exposes some internals about its actions that are good to track.
Metric | Description |
kubelet_running_container_count | The number of containers that are currently running. |
kubelet_runtime_operations | The cumulative number of runtime operations available by the different operation types. |
kubelet_runtime_operations_latency_microseconds | The latency of each operation by type in microseconds. |
Node Metrics
Visibility into the standard host metrics of a node ensures you can monitor the health of each node in your cluster, avoiding any downtime as a result of an issue with a particular node. You need visibility into all aspects of the node, including CPU and Memory consumption, System Load, filesystem activity and network activity.
Container Metrics
Monitoring of all of the Kubernetes metrics is just one piece of the puzzle. It is imperative that you also have visibility into your containerized applications that Kubernetes is orchestrating. At a minimum, you need access to the resource consumption of those containers. Kubelet access the container metrics from CAdvisor, a tool that can analyze resource usage of containers and makes them available. These include the standard resource metrics like CPU, Memory, File System and Network usage.
As we can see, there are many vital metrics inside your Kubernetes cluster. These metrics ensure you can always answer what is happening not only in Kubernetes and the components of it, but also your applications running inside of it.
Logs
Logs are how we can answer why something is happening. They provide information regarding what the code is doing and the actions it is taking. Kubernetes delivers a wealth of logging for each of its components giving you insights into the decisions it is making. All of your containerized workloads are also generating logs, providing information into the decisions the code is making and actions it is taking. Access to these logs ensures you have comprehensive visibility to monitor and troubleshoot your applications.
How Do I Collect This Data?
Now that we understand what machine data is available to us, how do we get to this data? The good news is that Kubernetes makes most of this data readily available, you just need the right tool to gather and view it.
Collecting Logs
As containers are running inside of Kubernetes, the logs files are written to the Node that the container is running on. Every node in your cluster will have the logs from every container that runs on that node. We developed an open-source FluentD plugin that runs on every node in your cluster as a Daemonset. The plugin is responsible for reading the logs and securely sending them to Sumo Logic. It also enriches the logs with valuable metadata from Kubernetes.
When you create objects in Kubernetes, you can assign custom key-value pairs to each of those objects, called labels. These labels can help you organize and track additional information about each object. For example, you might have a label that represents the application name, the environment the pod is running in or perhaps what team owns this resource. These labels are entirely flexible and can be defined as you need. Our FluentD plugin ensures those labels get captured along with the logs giving you continuity between the resources you have created and the log files they are producing.
The plugin can be used on any Kubernetes cluster, whether using a managed service like Amazon EKS or a cluster you are running entirely on your own. It works as simple as deploying a Daemonset to Kubernetes. We provide default configuration that can work out-of-box with nearly any Kubernetes cluster and a rich set of configuration options so you can fine tune it to your needs.
Collecting Metrics
Every component of Kubernetes exposes its metrics in a Prometheus format. The running processes behind those components serve up the metrics on an HTTP URL that you can access. For example, the Kubernetes API Server serves it metrics on https://$API_HOST:443/metrics. You simply need to scrape these various URLs to obtain the metrics.
We developed a new tool specifically designed to ingest Prometheus formatted metrics into Sumo Logic. This tool can run standalone as a script on some box or can run as a container. It can ingest data from any target that provides Prometheus formatted metrics, including those from Kubernetes. And today, we are open-sourcing that tool for all of our customers to use to start ingesting Prometheus formatted metrics.
The Sumo Logic Prometheus Scraper can be configured to point to multiple targets serving up Prometheus metrics. It supports the ability to include or exclude metrics and provides you full control to the metadata that you send to Sumo Logic. The ability to include metadata ensures that when it comes to Kubernetes, you can capture the same valuable metadata that you have in logs in the metrics. You can use this metadata when searching thru your logs and your metrics, and use them together to have a unified experience when navigating your machine data.
Next Steps
Great, now that we know how we can get this data into Sumo Logic, let’s see these tools in action and start to make this data work for to gain profound insights into everything in our Kubernetes Clusters.
In the third and final post in this series, we will go through the steps to use these tools to collect all of this data into Sumo Logic and show how Sumo Logic can help you monitor and troubleshoot everything in your Kubernetes Cluster.