Stopping systemd workloads in Openshift
Are you using systemd workloads? Then this article could be of interest. In this article we are going to see how workloads based on systemd can be stopped gracefully on Openshift.
We are going to do hands-on activities, using a simple systemd workload which runs an nginx service. We will see the differences between using the workload in Podman and using the workload in Openshift. Finally we will see how to overcome the limitation in Openshift by using container lifecycle hooks.
Prerequisites
- Podman is installed in your environment.
- OpenShift client is installed into your environment.
- You have access to an OpenShift cluster.
You can install a single node OpenShift using kcli or Code Ready Containers.
Updates:
- This is happening in Openshift but it will be fixed in 4.10 (verified on Openshift 4.10.0-ci-20220107).
- Here is the change at cri-o that fix this situation: https://github.com/cri-o/cri-o/pull/5366
Defining the workload
We are going to use the following simple Dockerfile.stopsignal-systemd Dockerfile to build our workload.
FROM quay.io/fedora/fedora:35 RUN dnf -y install procps nginx \ && dnf clean all \ && systemctl enable nginx EXPOSE 80 # https://docs.docker.com/engine/reference/builder/#stopsignal # https://www.freedesktop.org/software/systemd/man/systemd.html#SIGRTMIN+3 STOPSIGNAL SIGRTMIN+3 ENTRYPOINT ["/sbin/init"]
The
STOPSIGNALinstruction is not needed by podman as it detects that the signal to be sent bypodman stopshould beSIGRTMIN+3, because the container process is systemd.
Now we build:
export IMG="quay.io/avisied0/demos:stopsignal-systemd" podman build -t "${IMG}" -f Dockerfile.stopsignal-systemd .
Runnning container with podman
Firstly, let's see what happens with the workload when running with podman or docker:
CONTAINER_ID=$( podman run -it -d "${IMG}" ) podman logs --follow "${CONTAINER_ID}" & podman stop "${CONTAINER_ID}"
And we get a result like the below:
[ OK ] Removed slice Slice /system/getty. [ OK ] Removed slice Slice /system/modprobe. [ OK ] Stopped target Graphical Interface. [ OK ] Stopped target Multi-User System. [ OK ] Stopped target Login Prompts. [ OK ] Stopped target Timer Units. [ OK ] Stopped dnf makecache --timer. [ OK ] Stopped Daily rotation of log files. [ OK ] Stopped Daily Cleanup of Temporary Directories. . . . [ OK ] Stopped target Swaps. [ OK ] Reached target System Shutdown. [ OK ] Reached target Unmount All Filesystems. [ OK ] Reached target Late Shutdown Services. Starting System Halt... Sending SIGTERM to remaining processes... Sending SIGKILL to remaining processes... All filesystems, swaps, loop devices, MD devices and DM devices detached. Halting system. Exiting container. [1]+ Done podman logs --follow "${CONTAINER_ID}"
What about OpenShift?
Let's try now our workload on OpenShift; you will need an OpenShift cluster or a single node OpenShift (you can get one by using kcli or Code Ready Containers).
-
Push the image to your image registry:
# Previously IMG was defined as below: # export IMG="quay.io/avisied0/demos:stopsignal-systemd" podman push "${IMG}"
Ensure the repository is public so that the cluster can pull the image.
-
Access your cluster as a cluster admin and create a new project:
oc login -u kubeadmin https://api.crc.testing:6443 oc new-project stopsignal
-
Create a serviceaccount with the necessary permissions for creating and running the workload; this is, edit role and anyuid SecurityContextConstraint:
oc create serviceaccount runasanyuid oc adm policy add-scc-to-user anyuid -z runasanyuid --as system:admin oc adm policy add-role-to-user edit -z runasanyuid --as system:admin
-
Create the Pod from the following
pod-stopsignal-systemd.yamlfile:apiVersion: v1 kind: Pod metadata: name: stopsignal-systemd labels: app: nginx spec: serviceAccountName: runasanyuid automountServiceAccountToken: false containers: - name: nginx image: quay.io/avisied0/demos:stopsignal-systemd imagePullPolicy: Always command: ["/sbin/init"] tty: true privileged: false
-
Create the workload using the new serviceaccount:
oc create -f pod-stopsignal-systemd.yaml --as system:serviceaccount:stopsignal:runasanyuid oc get all
-
Print out and follow the logs in the background.
oc logs pod/stopsignal-systemd -f --as system:serviceaccount:stopsignal:runasanyuid &
-
Try to stop the workload.
oc delete -f pod-stopsignal-systemd.yaml --as system:serviceaccount:stopsignal:runasanyuid
We get something like the below in the log output, but systemd and the pod are still running:
pod "systemd-nginx" deleted systemd-nginx login: systemd v249.7-2.fc35 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified) Detected virtualization podman. Detected architecture x86-64.
We can see that systemd does not begin the stop sequence as was the
case with podman. This is because Openshift did not translate the
STOPSIGNAL instruction specified in the Dockerfile (this will be fixed
at Openshift 4.10). To work around this situation we will
use container lifecycle hooks,
to explicitly send SIGRTMIN+3 to PID 1 (systemd).
Trying more isolated
Let's see if this happens only for SIGRTMIN+3 or for any signal
specified via the STOPSIGNAL instruction. To investigate that, we
will use the following Dockerfile.stopsignal-demo Dockerfile:
FROM quay.io/fedora/fedora:35 COPY demo-signal.sh /demo-signal.sh RUN chmod a+x /demo-signal.sh STOPSIGNAL SIGINT CMD ["/demo-signal.sh"]
The demo-signal.sh should have execute permission. The content is:
#!/bin/bash function trap_signal { local signal="$1" echo -e "\nExiting by ${signal}" >&2 exit 0 } for signal in SIGINT SIGTERM SIGUSR1 "SIGRTMIN+3" do trap "trap_signal '${signal}'" "${signal}" done while true; do echo -n "." sleep 1 done
Update: Script updated based on PR at: https://github.com/avisiedo/freeipa-kustomize/blob/idmocp-331-stopping-with-kind-and-podman/incubator/013-signalstop/demo-signal.sh
Finally we define a workload with the following pod-stopsignal-demo.yaml:
apiVersion: v1 kind: Pod metadata: name: stopsignal-demo labels: app: stopsignals spec: automountServiceAccountToken: false containers: - name: main image: quay.io/avisied0/demos:stopsignal-demo imagePullPolicy: Always command: - /demo-signal.sh tty: true privileged: false
Build the image and push:
export IMG="quay.io/avisied0/demos:stopsignal-demo" podman build -t "${IMG}" -f Dockerfile.stopsignal-demo . podman push "${IMG}"
And we try the scenario by:
oc create -f pod-stopsignal-demo.yaml --as system:serviceaccount:stopsignal:runasanyuid oc logs pod/stopsignal-demo -f --as system:serviceaccount:stopsignal:runasanyuid & oc delete -f pod-stopsignal-demo.yaml --as system:serviceaccount:stopsignal:runasanyuid
Getting the output below:
pod "stopsignal-demo" deleted ............ Exiting by SIGINT
When the SIGINT is specified into the STOPSIGNAL instruction in the Dockerfile
Openshift is sending SIGINT signal to the pod when we delete the resource.
When the
STOPSIGNAL 37(RTMIN+3) is specified as a numeric value, Openshift is sending SIGTERM instead of the expectedSIGRTMIN+3indicated into the Dockerfile file.
Update:
Another test was made in Openshift 4.10 ci build on Wed Jan 5, 2022 and it worked as expected, by sending the
SIGRTMIN+3to the container workload. So this will be fixed in future releases.
Solution: container lifecycle hooks
-
Create
pod-stopsignal-lifecycle.yamlwith the content below:apiVersion: v1 kind: Pod metadata: name: stopsignal-lifecycle labels: app: nginx spec: serviceAccount: runasanyuid containers: - name: nginx image: quay.io/avisied0/demos:stopping-systemd imagePullPolicy: Always command: ["/sbin/init"] tty: true privileged: false lifecycle: # (1) preStop: # (2) exec: # (3) command: ["kill", "-RTMIN+3", "1"] # (4)
- (1) The lifecycle hooks for that container.
- (2) A
preStophook is called before stopping the container. - (3) It will be an
execcommand. - (4) The command to be executed; the executable must exist in the container.
-
And we try again by:
oc create -f pod-stopsignal-lifecycle.yaml --as=system:serviceaccount:stopsignal:runasanyuid oc logs pod/stopsignal-lifecycle -f --as=system:serviceaccount:stopsignal:runasanyuid & oc delete -f pod-stopsignal-lifecycle.yaml --as=system:serviceaccount:stopsignal:runasanyuid
And the log output immediately shows the below:
pod "systemd-nginx" deleted systemd-nginx login: [ OK ] Removed slice Slice /system/getty. [ OK ] Removed slice Slice /system/modprobe. [ OK ] Stopped target Graphical Interface. [ OK ] Stopped target Multi-User System. [ OK ] Stopped target Login Prompts. [ OK ] Stopped target Timer Units. [ OK ] Stopped dnf makecache --timer. [ OK ] Stopped Daily rotation of log files. [ OK ] Stopped Daily Cleanup of Temporary Directories. [ OK ] Closed Process Core Dump Socket. Stopping Console Getty... Stopping The nginx HTTP and reverse proxy server... Stopping User Login Management... [ OK ] Stopped Console Getty. Stopping Permit User Sessions... [ OK ] Stopped User Login Management. [ OK ] Stopped Permit User Sessions. systemd v249.7-2.fc35 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified) Detected virtualization podman. Detected architecture x86-64. [ OK ] Stopped The nginx HTTP and reverse proxy server. [ OK ] Stopped target Network is Online. [ OK ] Stopped target Host and Network Name Lookups. [ OK ] Stopped target Remote File Systems. Stopping Home Area Activation... Stopping Network Name Resolution... [ OK ] Stopped Network Name Resolution. [ OK ] Stopped Home Area Activation. Stopping Home Area Manager... [ OK ] Stopped Home Area Manager. [ OK ] Stopped target Basic System. [ OK ] Stopped target Path Units. [ OK ] Stopped Dispatch Password …ts to Console Directory Watch. [ OK ] Stopped Forward Password R…uests to Wall Directory Watch. [ OK ] Stopped target Slice Units. [ OK ] Removed slice User and Session Slice. [ OK ] Stopped target Socket Units. Stopping D-Bus System Message Bus... [ OK ] Stopped D-Bus System Message Bus. [ OK ] Closed D-Bus System Message Bus Socket. [ OK ] Stopped target System Initialization. [ OK ] Stopped target Local Verity Protected Volumes. [ OK ] Stopped Update is Completed. [ OK ] Stopped Rebuild Dynamic Linker Cache. [ OK ] Stopped Rebuild Journal Catalog. Stopping Record System Boot/Shutdown in UTMP... [ OK ] Stopped Record System Boot/Shutdown in UTMP. [ OK ] Stopped Create Volatile Files and Directories. [ OK ] Stopped target Local File Systems. Unmounting /etc/hostname... Unmounting /etc/hosts... Unmounting /etc/resolv.conf... Unmounting /run/lock... Unmounting /run/secrets/kubernetes.io/serviceaccount... Unmounting Temporary Directory /tmp... Unmounting /var/log/journal... [ OK ] Stopped Create System Users. [FAILED] Failed unmounting /etc/hosts. [FAILED] Failed unmounting /run/lock. [FAILED] Failed unmounting /run/sec…/kubernetes.io/serviceaccount. Unmounting /run/secrets... [FAILED] Failed unmounting /etc/resolv.conf. [FAILED] Failed unmounting Temporary Directory /tmp. [FAILED] Failed unmounting /var/log/journal. [FAILED] Failed unmounting /etc/hostname. [FAILED] Failed unmounting /run/secrets. [ OK ] Stopped target Swaps. [ OK ] Reached target System Shutdown. [ OK ] Reached target Unmount All Filesystems. [ OK ] Reached target Late Shutdown Services. Starting System Halt... Sending SIGTERM to remaining processes... Sending SIGKILL to remaining processes... All filesystems, swaps, loop devices, MD devices and DM devices detached. Halting system. Exiting container.
Wrap up
In this article we have seen that:
- systemd workloads need
SIGRTMIN+3for stopping the workload gracefully. - OpenShift does not send the signal specified in the container
image (via the
STOPSIGNALinstruction). It does starting in Openshift 4.10. - We can use a container lifecycle hook to
interact with the workload when stopping the container until the fix is
available. For this scenario, we can use the
killbinary (which must exist in the container) to sendSIGRTMIN+3to PID 1 (systemd).
Updates:
- The reason the STOPSINAL instruction is not interpreted in Openshift is because the signal name RTMIN+3 is not properly parsed. Actually there are a fix for this situation (this PR), that has been seen that will be included in OpenShift 4.10. Until this version is released, the solution above could make the works.