Optimizing the construction of the VM ecosystem with KubeVirt

Two months ago, we were thrilled to share insights in the article “Best Practices for Migrating VM Clusters to KubeVirt 1.0.” As previously mentioned, we have selected AlmaLinux and Kubernetes 1.28 as the foundation for virtualization, employing cgroup v2 for resource isolation. Before moving to the production phase, we encountered additional challenges, particularly related to Kubernetes, containerd, and specific issues within KubeVirt. Therefore, in this second article, our goal is to share practical experiences and insights gained before the deployment of KubeVirt into a production environment.

Latest Developments

KubeVirt containerizes the trusted virtualization layer of QEMU and libvirt, enabling the management of VMs as standard Kubernetes resources. This approach offers users a more flexible, scalable, and contemporary solution for virtual machine management. As the project progresses, we’ve identified specific misconceptions, configuration errors, and opportunities to enhance KubeVirt functionality, especially in the context of utilizing Kubernetes 1.28 and containerd. The details are outlined below:

kubernetes

  • kubelet ready-only port

To address security concerns, we have taken measures to mitigate potential malicious attacks on pods and containers. Specifically, we have discontinued the default opening of the insecure read-only port 10255 for the kubelet in K8s clusters running Kubernetes 1.26 or later. Instead, the authentication port 10250 is now opened and utilized by the kubelet.

  • service account token expiration

To enhance data security, Kubernetes 1.21 defaults to enabling the BoundServiceAccountTokenVolume feature. This feature specifies the validity period of service account tokens, automatically renews them before expiration, and invalidates tokens after associated pods are deleted. If using client-go version 11.0.0 or later, or 0.15.0 or later, the kubelet automatically reloads service account tokens from disks to facilitate token renewal.

  • securing controller-manager and scheduler metrics

Secure serving on port 10257 to kube-controller-manager (configurable via –secure-port) is now enabled. Delegated authentication and authorization are to be configured using the same flags as for aggregated API servers. Without configuration, the secure port will only allow access to /healthz. (#64149, @sttts) Courtesy of SIG API Machinery, SIG Auth, SIG Cloud Provider, SIG Scheduling, and SIG Testing

Added secure port 10259 to the kube-scheduler (enabled by default) and deprecate old insecure port 10251. Without further flags self-signed certs are created on startup in memory. (#69663, @sttts)

containerd

  • private registry

Modify your config.toml file (usually located at /etc/containerd/config.toml) as shown below:

1
2
3
4
version = 2

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"

In containerd registry configuration, a registry host namespace refers to the path of the hosts.toml file specified by the registry host name or IP address, along with an optional port identifier. When submitting a pull request for an image, the typical format is as follows:

1
pull [registry_host_name|IP address][:port][/v2][/org_path]<image_name>[:tag|@DIGEST]

The registry host namespace part is [registry_host_name|IP address][:port]. For example, the directory structure for docker.io looks like this:

1
2
3
4
plaintextCopy code$ tree /etc/containerd/certs.d
/etc/containerd/certs.d
└── docker.io
└── hosts.toml

Alternatively, you can use the _default registry host namespace as a fallback if no other namespace matches.

  • systemd cgroup

While containerd and Kubernetes default to using the legacy cgroupfs driver for managing cgroups, it is recommended to utilize the systemd driver on systemd-based hosts to adhere to the “single-writer” rule of cgroups.

To configure containerd to use the systemd driver, add the following option in /etc/containerd/config.toml:

1
2
3
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true

Additionally, apart from configuring containerd, you need to set the KubeletConfiguration to use the “systemd” cgroup driver. The KubeletConfiguration is typically found at /var/lib/kubelet/config.yaml:

1
2
3
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: "systemd"
  • [community issue]containerd startup hangs when /etc is ready-only

We observed that, following the update from containerd v1.6.21 to v1.6.22, the systemd service failed to start successfully. Upon closer inspection during debugging, it was revealed that containerd did not fully initialize (lacking the “containerd successfully booted in …” message) and did not send the sd notification READY=1 event.

  • migration docker to containerd

you have to configure the KubeletConfiguration to use the “containerd” endpoint. The KubeletConfiguration is typically located at /var/lib/kubelet/config.yaml:

1
2
3
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
containerRuntimeEndpoint: "unix:///run/containerd/containerd.sock"

Because /var/lib/docker is mounted on a separate disk, switching to containerd requires navigating to the root directory of containerd.

kubevirt

  • containerDisk data persistent
    The containerDisk feature provides the ability to store and distribute VM disks in the container image registry. containerDisks can be assigned to VMs in the disks section of the VirtualMachineInstance spec.containerDisks are ephemeral storage devices that can be assigned to any number of active VirtualMachineInstances.We can persist data locally through incremental backups.
  • hostdisk support qcow2 format
  • hostdisk support hostpath capacity expansion

Storage Solution

VM Image storage Soultion

In KubeVirt, the original virtual machine image file is incorporated into the /disk path of the Docker base image and subsequently pushed to the image repository for utilization in virtual machine creatio.

Example: we could Inject a local VirtualMachineInstance disk into a container image

1
2
3
4
5
6
cat << END > Dockerfile
FROM scratch
ADD --chown=107:107 almalinux.qcow2 /disk/
END

docker build -t kubevirt/alamlinux:latest .

When initiating a virtual machine, a Virtual Machine Instance (VMI) Custom Resource Definition (CRD) is created, capturing the specified virtual machine image’s name. Subsequent to VMI creation, the virt-controller generates a corresponding virt-launcher pod for the VMI. This pod comprises threee containers: compute container hosting the compute process for virt-launcher, named container-disk, responsible for managing the storage of the virtual machine image and guest-console-log container. The imageName of the container-disk container corresponds to the virtual machine image name recorded in the VMI. Once the virt-launcher pod is created, kubelet retrieves the container-disk image and initiates the container-disk container. During startup, the container-disk consistently monitors the disk_0.sock file under the -copy-path, with the sock file mapped to the path /var/run/kubevirt/container-disk/{vmi-uuid}/ on the host machine through hostPath.

To facilitate the retrieval of necessary information during virtual machine creation, the virt-handler pod utilizes HostPid, enabling visibility of the host machine’s pid and mount details within the virt-handler container. During the virtual machine creation process, virt-handler identifies the pid of the container-disk process by referencing the disk_0.sock file of the VMI. It proceeds to determine the disk number of the container-disk container’s root disk using /proc/{pid}/mountInfo. Subsequently, by cross-referencing the disk number of the container-disk root disk with the mount information of the host machine , it pinpoints the physical location of the container-disk root disk. Finally, it constructs the path for the virtual machine image file (/disk/disk.qcow2), retrieves the actual storage location (sourceFile) of the original virtual machine image on the host machine, and mounts the sourceFile to the targetFile for subsequent use as a backingFile during virtual machine creation.

Host Disk Storage

A hostDisk volume type provides the ability to create or use a disk image located somewhere on a node. It works similar to a hostPath in Kubernetes and provides two usage types:

DiskOrCreate if a disk image does not exist at a given location then create one
Disk a disk image must exist at a given location
need to enable the HostDisk feature gate.

Currently, hostdisk feature has some limitations. The expansion of hostdisk is only supported in the manner of using Persistent Volume Claims (PVC), and the disk format is limited to raw files.

Details regarding the above will be elaborated in the Feature Expansion section.

Feature Expansion

Support VM static expansion

The CPU/Mem is also provided with a synchronous interface when the CPU/Mem disk is stopped and expanded. The CPU hotplug feature was introduced in KubeVirt v1. 0, making it possible to configure the VM workload to allow for adding or removing virtual CPUs while the VM is running,While the current version supports online expansion, we still opt for static expansion, primarily due to the temporary nature of VMs. The challenge here is that when resources are insufficient, the VM will not start.

hostdisk support qcow2 and online expand

The current hostdisk has some limitations. The expansion of hostdisk is only supported in the manner of using Persistent Volume Claims (PVC), and the disk is limited to raw format,To implement this feature, we made minor adjustments to all components.

cold migration

We refrain from employing live migration capabilities due to their complexity and several limitations in our specific scenario. Instead, with data locally persisted and VMs scheduled in a fixed manner, we utilize cold migration through the rsync command.

Others

In addition to the enhanced features mentioned earlier, we have integrated support for both static and dynamic addition or removal of host disks for virtual machines, password reset capabilities, pass-through of physical machine disks, and addressed various user requirements to deliver a more versatile and comprehensive usage experience.

Conclusion

KubeVirt simplifies running virtual machines on Kubernetes, making it as easy as managing containers. It provides a cloud-native approach to managing virtual machines. KubeVirt addresses the challenge of unifying the management of virtual machines and containers, effectively harnessing the strengths of both. However, there is still a long way to go in practice.

https://github.com/k8snetworkplumbingwg/multus-cni/issues/1132

https://segmentfault.com/a/1190000040926384/en

https://www.alibabacloud.com/help/en/ack/product-overview/solution-to-serviceaccount-token-expiration-after-upgrading-122-version

https://github.com/containerd/containerd/issues/9139

https://github.com/containerd/containerd/blob/main/docs/cri/config.md

https://www.cncf.io/blog/2023/09/22/best-practices-for-transitioning-vm-clusters-to-kubevirt-1-0/https://kubevirt.io/user-guide/virtualmachines/disksand_volumes/#hostdisk