Certain questions about Kubernetes seem to come up again and again:
- What’s up with this init container stuff?
- What’s a CNI plugin?
- Why is Kubernetes complaining about pods not finishing initialisation?
Kubernetes is a complex system with a simple overall purpose: run user workloads in a way that permits the authors of the workloads to not care (much) about the messy details of the hardware underneath. The workload authors are supposed to be able to just focus on Pods and Services; in turn, Kubernetes is meant to arrange things such that workloads get mapped to Pods, Pods get deployed on Nodes, and the network in between looks flat and transparent.
This is simple to state, but extremely complex to implement in practice. (This is an area where Kubernetes is doing a great job of making things complex for the Kubernetes implementors so that they can be easier for the users – nicely done!) Under the hood, Kubernetes is leaning heavily on a number of technologies to make all this happen.
Clusters and cgroup
s and Pods...
The first major area that Kubernetes has to manage is actually running the workloads within the cluster. It relies heavily on OS-level isolation mechanisms for this:
Clusters are composed of one of more Nodes, which are (possibly virtualised) machines. For this article, we’ll be talking about Linux Nodes.
Since different Nodes are different machines (virtual or physical), everything on one Node is isolated from all other Nodes.
Pods are composed of one or more containers, all of which are isolated from one another within the same Node using Linux
cgroup
s andnamespace
s.It’s worth noting that Linux itself runs at the Node level. Pods and containers don’t have distinct copies of the operating system, which is why isolation between them is such a big deal.
This multi-layer approach gives Kubernetes a way to orchestrate which workloads run where within the cluster, and to keep track of resource availability and consumption: workload containers are mapped to Pods, Pods are scheduled onto Nodes, and Nodes are all connected to a network.
Deployments, ReplicaSets, DaemonSets, etc., are all bookkeeping mechanisms for figuring out exactly which Pods gets scheduled onto which Nodes, but the fundamental scheduling mechanism is the same across all of them.
Kubernetes Networking
The other major area that Kubernetes has to manage is the network. Kubernetes requires that Pods see a network that is flat and transparent: every Pod must be able to directly communicate with every other Pod, whether on the same Node or not. This implies that each Pod has to have its own IP address, which I’ll call a Pod IP in a fit of originality.
(Technically, any container within a Pod must be able to talk to containers in other Pods – but these IP addresses exist at the Pod level, not the container level. Multiple containers in one Pod share the same IP address.)
You could write a workload to use Pod IPs directly to talk to other workloads, but it’s not a good idea: Pod IPs change as Pods go up and down. Instead, we generally refer to workloads using a Kubernetes Service. Services are actually fairly complex (even though I’m glossing over headless Services and such here!):
A Service causes a DNS entry to be allocated, so that workloads can refer to the Service using a name.
The Service also allocates a Cluster IP address for the Service, which refers only to this Service and is distinct from any other IP address in the cluster. (I’m calling it a Cluster IP to reinforce the idea that it is not tied to a single Pod.)
The Service also defines a selector, which defines which Pods will be matched with the Service.
Finally, the Service collects the Pod IP addresses of all the Pods it matches, and keeps track of them as its endpoints.
When a workload tries to connect to the Cluster IP belonging to a Service, by default Kubernetes will pick one of the Service’s endpoints, and route the connection there (remember that the Service endpoints are Pod IP addresses). In this way, Services do simple load balancing at the connection level.
It should be fairly apparent from all this that there is a lot of networking magic happening in Kubernetes. This is all handled by the low-level firewall built into the Linux kernel.
IPTables
The Linux kernel contains a fairly powerful mechanism to examine network traffic at the packet level and make decisions about what to do with each packet. This might involve letting the packet continue on unchanged, altering the packet, redirecting the packet, or even dropping the packet entirely. I’m going to refer to the whole of this mechanism as IPTables
– technically, that’s the name of an early implementation, but we tend to use it to refer to the whole mechanism around Buoyant, so I’ll stick with it. (And if you think this sounds like something you might do with eBPF, you’re right! This is an area where eBPF shines, although many implementations of this mechanism predate eBPF and don’t depend on it.)
What this means is that Kubernetes can - and does - use IPTables
to handle the complex, dynamic routing that it needs for network traffic within the cluster. For example, IPTables
can catch a packet being sent to a Service’s Cluster IP address and rewrite it to instead go to a specific Pod IP address; likewise, it can know the difference between a Pod IP address on the same Node as the sender and a Pod IP address on a different Node, and manage routing to make everything work out. In turn, this requires Kubernetes to constantly update the IPTables
rules as Pods are created and destroyed.
The Container Networking Interface
The specific way that a given Kubernetes implementation needs to update the network configuration depends on the details of the implementation. The Kubernetes Container Networking Interface, or CNI, is a standard that tries to provide a uniform interface for implementors to request network-configuration changes, to make Kubernetes easier to port.
A critical aspect of the CNI is that it allows for CNI plugins, which - in turn - can permit swapping out the network layer even while keeping the rest of the Kubernetes implementation the same. For example, k3d
uses Flannel as its networking layer by default, but it’s easy to switch it to use Calico instead.
Kubernetes Pod Startup
Putting it all together, here’s how Kubernetes handles things when a Pod starts.
- Find a Node to run the new Pod.
- Execute any CNI plugins defined by the Node, in the context of the new Pod. Fail if any don’t work.
- Execute any init containers defined for the new Pod, in order. Fail if any don’t work.
- Start all the containers defined by the Pod.
When starting the Pod’s containers, it’s important to note that they will be started in the order defined by the Pod’s spec
, but that normally Kubernetes will not wait for a given container before proceeding to the next container. However, if the container defines a postStartHook
, Kubernetes will start the container, then run the postStartHook
to completion, before starting the next container.
Comments
Post a Comment