Skip to main content

Kubernetes – init containers, CNI and more

Certain questions about Kubernetes seem to come up again and again:

  • What’s up with this init container stuff?
  • What’s a CNI plugin?
  • Why is Kubernetes complaining about pods not finishing initialisation?

Kubernetes is a complex system with a simple overall purpose: run user workloads in a way that permits the authors of the workloads to not care (much) about the messy details of the hardware underneath. The workload authors are supposed to be able to just focus on Pods and Services; in turn, Kubernetes is meant to arrange things such that workloads get mapped to Pods, Pods get deployed on Nodes, and the network in between looks flat and transparent.

This is simple to state, but extremely complex to implement in practice. (This is an area where Kubernetes is doing a great job of making things complex for the Kubernetes implementors so that they can be easier for the users – nicely done!) Under the hood, Kubernetes is leaning heavily on a number of technologies to make all this happen.

Clusters and cgroups and Pods...

The first major area that Kubernetes has to manage is actually running the workloads within the cluster. It relies heavily on OS-level isolation mechanisms for this:

  • Clusters are composed of one of more Nodes, which are (possibly virtualised) machines. For this article, we’ll be talking about Linux Nodes.

  • Since different Nodes are different machines (virtual or physical), everything on one Node is isolated from all other Nodes.

  • Pods are composed of one or more containers, all of which are isolated from one another within the same Node using Linux cgroups and namespaces.

  • It’s worth noting that Linux itself runs at the Node level. Pods and containers don’t have distinct copies of the operating system, which is why isolation between them is such a big deal.

This multi-layer approach gives Kubernetes a way to orchestrate which workloads run where within the cluster, and to keep track of resource availability and consumption: workload containers are mapped to Pods, Pods are scheduled onto Nodes, and Nodes are all connected to a network.

Deployments, ReplicaSets, DaemonSets, etc., are all bookkeeping mechanisms for figuring out exactly which Pods gets scheduled onto which Nodes, but the fundamental scheduling mechanism is the same across all of them.

Kubernetes Networking

The other major area that Kubernetes has to manage is the network. Kubernetes requires that Pods see a network that is flat and transparent: every Pod must be able to directly communicate with every other Pod, whether on the same Node or not. This implies that each Pod has to have its own IP address, which I’ll call a Pod IP in a fit of originality.

(Technically, any container within a Pod must be able to talk to containers in other Pods – but these IP addresses exist at the Pod level, not the container level. Multiple containers in one Pod share the same IP address.)

You could write a workload to use Pod IPs directly to talk to other workloads, but it’s not a good idea: Pod IPs change as Pods go up and down. Instead, we generally refer to workloads using a Kubernetes Service. Services are actually fairly complex (even though I’m glossing over headless Services and such here!):

  • A Service causes a DNS entry to be allocated, so that workloads can refer to the Service using a name.

  • The Service also allocates a Cluster IP address for the Service, which refers only to this Service and is distinct from any other IP address in the cluster. (I’m calling it a Cluster IP to reinforce the idea that it is not tied to a single Pod.)

  • The Service also defines a selector, which defines which Pods will be matched with the Service.

  • Finally, the Service collects the Pod IP addresses of all the Pods it matches, and keeps track of them as its endpoints.

When a workload tries to connect to the Cluster IP belonging to a Service, by default Kubernetes will pick one of the Service’s endpoints, and route the connection there (remember that the Service endpoints are Pod IP addresses). In this way, Services do simple load balancing at the connection level.

It should be fairly apparent from all this that there is a lot of networking magic happening in Kubernetes. This is all handled by the low-level firewall built into the Linux kernel.


IPTables

The Linux kernel contains a fairly powerful mechanism to examine network traffic at the packet level and make decisions about what to do with each packet. This might involve letting the packet continue on unchanged, altering the packet, redirecting the packet, or even dropping the packet entirely. I’m going to refer to the whole of this mechanism as IPTables – technically, that’s the name of an early implementation, but we tend to use it to refer to the whole mechanism around Buoyant, so I’ll stick with it. (And if you think this sounds like something you might do with eBPF, you’re right! This is an area where eBPF shines, although many implementations of this mechanism predate eBPF and don’t depend on it.)

What this means is that Kubernetes can - and does - use IPTables to handle the complex, dynamic routing that it needs for network traffic within the cluster. For example, IPTables can catch a packet being sent to a Service’s Cluster IP address and rewrite it to instead go to a specific Pod IP address; likewise, it can know the difference between a Pod IP address on the same Node as the sender and a Pod IP address on a different Node, and manage routing to make everything work out. In turn, this requires Kubernetes to constantly update the IPTables rules as Pods are created and destroyed.

The Container Networking Interface

The specific way that a given Kubernetes implementation needs to update the network configuration depends on the details of the implementation. The Kubernetes Container Networking Interface, or CNI, is a standard that tries to provide a uniform interface for implementors to request network-configuration changes, to make Kubernetes easier to port.

A critical aspect of the CNI is that it allows for CNI plugins, which - in turn - can permit swapping out the network layer even while keeping the rest of the Kubernetes implementation the same. For example, k3d uses Flannel as its networking layer by default, but it’s easy to switch it to use Calico instead.

Kubernetes Pod Startup

Putting it all together, here’s how Kubernetes handles things when a Pod starts.

  1. Find a Node to run the new Pod.
  2. Execute any CNI plugins defined by the Node, in the context of the new Pod. Fail if any don’t work.
  3. Execute any init containers defined for the new Pod, in order. Fail if any don’t work.
  4. Start all the containers defined by the Pod.

When starting the Pod’s containers, it’s important to note that they will be started in the order defined by the Pod’s spec, but that normally Kubernetes will not wait for a given container before proceeding to the next container. However, if the container defines a postStartHook, Kubernetes will start the container, then run the postStartHook to completion, before starting the next container.


Comments

Popular posts from this blog

Sexy C#

Download samples   Table of Contents   1.   Introduction  2.   Background    3.   Sexy Features 3.1.   Extension Methods   3.2.   Anonymous Type   3.3.   Delegate   3.4.   Lambda Expression 3.5.   Async-Await Pair   3.6.   Generics   4.   Conclusion   1. Introduction     C#  is a very popular programming language. It is mostly popular in the .NET arena. The main reason behind that is the C# language contains so many useful features. It is actually a multi-paradigm programming language. Q.   Why do we call C# a muti-paradigm programming language? A.  Well, C# has the following characteristics:  Strongly typed   Object Oriented  Functional  Declarative Programming  Imperative Programming   Component based Programming Dynamic Programming ...

Python Subprocess Module

Subprocess A running program is called a  process . Each process has its own system state, which includes memory, lists of open files, a program counter that keeps track of the instruction being executed, and a call stack used to hold the local variables of functions. Normally, a process executes statements one after the other in a single sequence of control flow, which is sometimes called the main thread of the process. At any given time, the program is only doing one thing. A program can create new processes using library functions such as those found in the os or subprocess modules such as  os.fork() ,  subprocess.Popen() , etc. However, these processes, known as  subprocesses , run as completely independent entities-each with their own private system state and main thread of execution. Because a subprocess is independent, it executes concurrently with the original process. That is, the process that created the subprocess can go on to work on other thing...

How To Configure a Linux Service to Start Automatically After a Crash or Reboot

Part 1: Practical Examples Tutorial Series Introduction This tutorial shows you how to configure system services to automatically restart after a crash or a server reboot. The example uses MySQL, but you can apply these principles to other services running on your server, like Nginx, Apache, or your own application. We cover the three most common init systems in this tutorial, so be sure to follow the one for your distribution. (Many distributions offer multiple options, or allow an alternate init system to be installed.) System V  is the older init system: Debian 6 and earlier Ubuntu 9.04 and earlier CentOS 5 and earlier Upstart : Ubuntu 9.10 to Ubuntu 14.10, including Ubuntu 14.04 CentOS 6 systemd  is the init system for the most recent distributions featured here: Debian 7 and Debian 8 Ubuntu 15.04 and newer CentOS 7 Background Your running Linux or Unix system will have a number of background processes executing at any time. These proc...