28 KiB
Security and hardening options for systemd service units
A common and reliable pattern in service unit files could be :
[Service]
MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
PrivateBPF=yes
PrivateDevices=yes
PrivateIPC=yes
PrivateTmp=yes
ProtectClock=yes
ProtectControlGroups=yes
ProtectHome=read-only
ProtectHostname=yes
ProtectKernelLogs=yes
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectProc=noaccess
ProcSubset=pid
ProtectSystem=strict
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
LockPersonality=yes
SystemCallArchitectures=native
But there's so much more you can do. Here are some security-related options excerpting and linking to their corresponding documentation from systemd's manual pages :
[Unit]
-
ConditionPathIsEncrypted: Checks that the underlying file system's backing block device is encrypted using dm-crypt/LUKS. -
ConditionSecurity: May be used to check whether the given security technology is enabled on the system. Supported values areselinux,apparmor,tomoyo,smack,ima,audit,uefi-secureboot,tpm2,cvmandmeasured-uki. The test may be negated by prepending an exclamation mark.
[Service]
Capabilities
-
AmbientCapabilities: Controls which capabilities to include in the ambient capability set for the executed process. Takes a whitespace-separated list of capability names, e.g.CAP_SYS_ADMIN,CAP_DAC_OVERRIDE,CAP_SYS_PTRACE. -
CapabilityBoundingSet: Controls which capabilities to include in the capability bounding set for the executed process. See capabilities(7) for details. Takes a whitespace-separated list of capability names, e.g.CAP_SYS_ADMIN,CAP_DAC_OVERRIDE,CAP_SYS_PTRACE.
Devices
-
DeviceAllow: Control access to specific device nodes by the executed processes. Takes two space-separated strings: a device node specifier followed by a combination ofr,w,mto control reading, writing, or creation of the specific device nodes by the unit (mknod), respectively. When access to all physical devices should be disallowed,PrivateDevices=may be used instead (see below). -
DevicePolicy: Control the policy for allowing device access. It accepts :strictmeans to only allow types of access that are explicitly specifiedclosedin addition, allows access to standard pseudo devices including /dev/null, /dev/zero, /dev/full, /dev/random, and /dev/urandomauto(default) in addition, allows access to all devices if no explicitDeviceAllow=is present
-
PrivateDevices: If set, sets up a new /dev/ mount for the executed processes and only adds API pseudo devices such as /dev/null, /dev/zero or /dev/random (as well as the pseudo TTY subsystem) to it, but no physical devices such as /dev/sda, system memory /dev/mem, system ports /dev/port and others. This is useful to turn off physical device access by the executed process.
Filesystem
-
ProtectHome: Takes a boolean argument or the special valuesread-onlyortmpfs. If true, the directories /home/, /root, and /run/user are made inaccessible and empty for processes invoked by this unit. If set toread-only, the three directories are made read-only instead. If set totmpfs, temporary file systems are mounted on the three directories in read-only mode. The valuetmpfsis useful to hide home directories not relevant to the processes invoked by the unit, while still allowing necessary directories to be made visible when listed inBindPaths=orBindReadOnlyPaths=. Setting this toyesis mostly equivalent to setting the three directories inInaccessiblePaths=. Similarly,read-onlyis mostly equivalent toReadOnlyPaths=, andtmpfsis mostly equivalent toTemporaryFileSystem=with:ro. -
ProtectSystem: Takes a boolean argument or the special valuesfullorstrict. If true, mounts the /usr and /boot directories read-only for processes invoked by this unit. If set tofull, the /etc directory is mounted read-only, too. If set tostrictthe entire file system hierarchy is mounted read-only, except for the API file system subtrees /dev, proc and /sys (protect these directories usingPrivateDevices=,ProtectKernelTunables=,ProtectControlGroups=). This setting ensures that any modification of the vendor-supplied operating system (and optionally its configuration, and local mounts) is prohibited for the service. It is recommended to enable this setting for all long-running services, unless they are involved with system updates or need to modify the operating system in other ways. If this option is used,ReadWritePaths=may be used to exclude specific directories from being made read-only. This setting is implied ifDynamicUser=is set. -
RestrictFileSystems: Restricts the set of filesystems processes of this unit can open files on. Takes a space-separated list of filesystem names, such asext4ortmpfs. Any filesystem listed is made accessible to the unit's processes, access to filesystem types not listed is prohibited (allow-listing). If the first character of the list is "~", the effect is inverted: access to the filesystems listed is prohibited (deny-listing). If the empty string is assigned, access to filesystems is not restricted. -
UMask: Controls the file mode creation mask.
Kernel
-
ProtectKernelLogs: If set, access to the kernel log ring buffer will be denied. It is recommended to turn this on for most services that do not need to read from or write to the kernel log ring buffer. Enabling this option removesCAP_SYSLOGfrom the capability bounding set for this unit, and installs a system call filter to block the syslog(2) system call. The kernel exposes its log buffer to userspace via /dev/kmsg and /proc/kmsg. If enabled, these are made inaccessible to all the processes in the unit. -
ProtectKernelModules: If set, explicit module loading will be denied. This allows to turn off module load and unload operations on modular kernels. It is recommended to turn this on for most services that do not need special file systems or extra kernel modules to work. Enabling this option removesCAP_SYS_MODULEfrom the capability bounding set for the unit, and installs a system call filter to block module system calls, also /usr/lib/modules is made inaccessible. For this setting the same restrictions regarding mount propagation and privileges apply as forReadOnlyPaths=and related calls. Note that limited automatic module loading due to user configuration or kernel mapping tables might still happen as side effect of requested user operations, both privileged and unprivileged. -
ProtectKernelTunables: If set, kernel variables accessible through /proc/sys/, /sys/, /proc/sysrq-trigger, /proc/latency_stats, /proc/acpi, /proc/timer_stats, /proc/fs and /proc/irq will be made read-only and /proc/kallsyms as well as /proc/kcore will be inaccessible to all processes of the unit.
Linux Security Modules
Mandatory Access Control
-
AppArmorProfile: Takes a profile name as argument. The process executed by the unit will switch to this profile when started. Profiles must already be loaded in the kernel, or the unit will fail. If prefixed by "-", all errors will be ignored. This setting has no effect if AppArmor is not enabled. -
SELinuxContext: Set the SELinux security context of the executed process. If set, this will override the automated domain transition. However, the policy still needs to authorize the transition. This directive is ignored if SELinux is disabled. -
SmackProcessLabel: Takes aSMACK64security label as argument. The process executed by the unit will be started under this label and SMACK will decide whether the process is allowed to run or not, based on it. The process will continue to run under the label specified here unless the executable has its ownSMACK64EXEClabel, in which case the process will transition to run under that label. When not specified, the label that systemd is running under is used. This directive is ignored if SMACK is disabled.
Namespaces
RestrictNamespaces: Restricts access to Linux namespace functionality for the processes of this unit. Either takes a boolean argument, or a space-separated list of namespace type identifiers. If true, access to any kind of namespacing is prohibited. Otherwise, a space-separated list of namespace type identifiers must be specified, consisting of any combination of: cgroup, ipc, net, mnt, pid, user, uts, and time. By prepending the list with a single tilde character ("~") the effect may be inverted: only the listed namespace types will be made inaccessible, all unlisted ones are permitted (deny-listing). If the empty string is assigned, the default namespace restrictions are applied, which is equivalent to false.
Cgroup
ProtectControlGroups: Takes a boolean argument or the special valuesprivateorstrict. If true, the Linux Control Groups (cgroups(7)) hierarchies accessible through /sys/fs/cgroup/ will be made read-only to all processes of the unit. If set toprivate, the unit will run in a cgroup namespace with a private writable mount of /sys/fs/cgroup/. If set tostrict, the unit will run in a cgroup namespace with a private read-only mount of/sys/fs/cgroup/. Noteprivateandstrictare downgraded to false and true respectively unless the system is using the unified control group hierarchy and the kernel supports cgroup namespaces.
Clock
ProtectClock: If set, writes to the hardware clock or system clock will be denied. It is recommended to turn this on for most services that do not need modify the clock. Enabling this option removesCAP_SYS_TIMEandCAP_WAKE_ALARMfrom the capability bounding set for this unit, installs a system call filter to block calls that can set the clock, andDeviceAllow=char-rtc ris implied. This ensures /dev/rtc* are made read-only to the service. If this setting is on, but the unit doesn't have theCAP_SYS_ADMINcapability,NoNewPrivileges=yesis implied.
IPC
PrivateIPC: Takes a boolean argument. If true, sets up a new IPC namespace for the executed processes. Each IPC namespace has its own set of System V IPC identifiers and its own POSIX message queue file system. This is useful to avoid name clash of IPC identifiers.
Mount
-
PrivateMounts: If set, the processes of this unit will be run in their own private file system (mount) namespace with all mount propagation from the processes towards the host's main file system namespace turned off. This means any file system mount points established or removed by the unit's processes will be private to them and not be visible to the host. -
PrivateTmp: Takes a boolean argument, ordisconnected. If enabled, a new file system namespace will be set up for the executed processes, and /tmp/ and /var/tmp/ directories inside it are not shared with processes outside of the namespace, plus all temporary files created by a service in these directories will be removed after the service is stopped. For this setting, the same restrictions regarding mount propagation and privileges apply as forReadOnlyPaths=and related calls, see below. This setting is useful to secure access to temporary files of the process, but makes sharing between processes via /tmp/ or /var/tmp/ impossible. If set toyes, the backing storage of the private temporary directories will remain on the host's /tmp/ and /var/tmp/ directories. Ifdisconnected, the directories will be backed by a completely new tmpfs instance, meaning that the storage is fully disconnected from the host namespace. -
ReadWritePaths,ReadOnlyPaths,InaccessiblePaths,ExecPaths,NoExecPaths: Sets up a new file system namespace for executed processes. These options may be used to limit access a process has to the file system. Each setting takes a space-separated list of paths relative to the host's root directory (i.e. the system running the service manager).
Network
PrivateNetwork: If set, sets up a new network namespace for the executed processes and configures only the loopback network device "lo" inside it. No other network devices will be available to the executed process. This is useful to turn off network access by the executed process.
PID
-
ProtectProc: Takes one ofnoaccess,invisible,ptraceableordefault(which it defaults to). When set, this controls thehidepid=mount option of the "procfs" instance for the unit that controls which directories with process metainformation (/proc/PID) are visible and accessible: when set tonoaccessthe ability to access most of other users' process metadata in /proc/ is taken away for processes of the service. When set toinvisibleprocesses owned by other users are hidden from /proc/. Ifptraceableall processes that cannot be ptrace()'ed by a process are hidden to it. This option is implemented via file system namespacing, and thus cannot be used with services that shall be able to install mount points in the host file system hierarchy. Note that the root user is unaffected by this option, so to be effective it has to be used together withUser=orDynamicUser=yes, and also without theCAP_SYS_PTRACEcapability, which also allows a process to bypass this feature. It cannot be used for services that need to access metainformation about other users' processes. -
ProcSubset: Takes one ofall(the default) andpid. Ifpid, all files and directories not directly associated with process management and introspection are made invisible in the /proc/ file system configured for the unit's processes. This controls thesubset=mount option of the "procfs" instance for the unit.
User
PrivateUsers: If enabled, sets up a new user namespace for the executed processes and configures a user and group mapping. It accepts :yesorself: a minimal user and group mapping is configured that maps therootuser and group as well as the unit's own user and group to themselves and everything else to thenobodyuser and group. This is useful to securely detach the user and group databases used by the unit from the rest of the system, and thus to create an effective sandbox environment. All files, directories, processes, IPC objects and other resources owned by users/groups not equalingrootor the unit's own will stay visible from within the unit but appear owned by thenobodyuser and group.identity: user namespacing is set up with an identity mapping for the first 65536 UIDs/GIDs. Any UIDs/GIDs above 65536 will be mapped to thenobodyuser and group, respectively. While this does not provide UID/GID isolation, since all UIDs/GIDs are chosen identically it does provide process capability isolation, and hence is often a good choice if proper user namespacing with distinct UID maps is not appropriate.full: user namespacing is set up with an identity mapping for all UIDs/GIDs. In addition, for system services, it allows the unit to call setgroups() system calls (by setting /proc/pid/setgroups toallow). Similar toidentity, this does not provide UID/GID isolation, but it does provide process capability isolation. If this mode is enabled, all unit processes are run without privileges in the host user namespace (regardless of whether the unit's own user/group isrootor not). Specifically this means that the process will have zero process capabilities on the host's user namespace, but full capabilities within the service's user namespace. Settings such as CapabilityBoundingSet= will affect only the latter, and there's no way to acquire additional capabilities in the host's user namespace.managed: a transient, dynamically allocated range of 65536 UIDs/GIDs is allocated for the unit, and a UID/GID mapping is assigned to the unit's process so the UID/GID 0 from inside the unit maps to the first UID/GID of the allocated mapping. Note that in this mode the UID/GID the service process will run as is different depending if looking from the host side (where it will be a high, dynamically assigned UID) or from inside the unit (where it will be 0). Also note that this mode will enable file system UID mapping for the file systems this service accesses, mapping the "foreign" UID range on disk to the selected dynamic UID range at runtime.
UTS (hostname)
ProtectHostname: If set, sets up a new UTS namespace for the executed processes. In addition, changing hostname or domainname is prevented.
Seccomp
-
LockPersonality: If set, locks down the personality(2) system call so that the kernel execution domain may not be changed from the default or the personality selected withPersonality=directive. This may be useful to improve security, because odd personality emulations may be poorly tested and source of vulnerabilities. -
SystemCallArchitectures: Takes a space-separated list of architecture identifiers to include in the system call filter. If this setting is used, processes of this unit will only be permitted to call native system calls, and system calls of the specified architectures. The special identifiernativeimplicitly maps to the native architecture of the system (or more precisely: to the architecture the system manager is compiled for). -
SystemCallFilter: Takes a space-separated list of system call names or system call groups. If this setting is used, system calls executed by the unit processes except for the listed ones will result in the system call being denied (allow-listing). If the first character of the list is "~", the effect is inverted: only the listed system calls will be denied (deny-listing). This option may be specified more than once, in which case the filter masks are merged. If the empty string is assigned, the filter is reset, all prior assignments will have no effect. The default action when a system call is denied is to terminate the processes with aSIGSYSsignal. This can changed usingSystemCallErrorNumber=.
Miscellaneous
BPF
PrivateBPF: If set, mount a private instance of the BPF filesystem on /sys/fs/bpf/, effectively hiding the host bpffs which contains information about loaded programs and maps. Otherwise, ifProtectKernelTunables=is set, the instance from the host is inherited but mounted read-only.
IPC
RemoveIPC: If set, all System V and POSIX IPC objects owned by the user and group the processes of this unit are run as are removed when the unit is stopped. This setting only has an effect if at least one ofUser=,Group=andDynamicUser=are used. It has no effect on IPC objects owned by the root user. Specifically, this removes System V semaphores, as well as System V and POSIX shared memory segments and message queues. If multiple units use the same user or group the IPC objects are removed when the last of these units is stopped.
Memory
MemoryDenyWriteExecute: If set, attempts to create memory mappings that are writable and executable at the same time, or to change existing memory mappings to become executable, or mapping shared memory segments as executable are prohibited. Specifically, a system call filter is added that rejects mmap(2) system calls with bothPROT_EXECandPROT_WRITEset, mprotect(2) or pkey_mprotect(2) system calls withPROT_EXECset and shmat(2) system calls withSHM_EXECset. Note that this option is incompatible with programs and libraries that generate program code dynamically at runtime, including JIT execution engines, executable stacks, and code "trampoline" feature of various C compilers. This option improves service security, as it makes harder for software exploits to change running code dynamically. However, the protection can be circumvented, if the service can write to a filesystem, which is not mounted with noexec (such as /dev/shm), or it can use memfd_create(). This can be prevented by making such file systems inaccessible to the service (e.g.InaccessiblePaths=/dev/shm) and installing further system call filters (SystemCallFilter=~memfd_create).
Networking
-
IPAddressAllow,IPAddressDeny: Turn on network traffic filtering for IP packets sent and received overAF_INETandAF_INET6sockets. Both directives take a space separated list of IPv4 or IPv6 addresses, each optionally suffixed with an address prefix length in bits after a "/" character. If the suffix is omitted, the address is considered a host address, i.e. the filter covers the whole address (32 bits for IPv4, 128 bits for IPv6). -
RestrictAddressFamilies: Restricts the set of socket address families accessible to the processes of this unit. Takesnone, or a space-separated list of address family names to allow-list, such asAF_UNIX,AF_INETorAF_INET6, see address_families(7) for all possible options. Whennoneis specified, then all address families will be denied. When prefixed with "~" the listed address families will be applied as deny list, otherwise as allow list. -
RestrictNetworkInterfaces: Takes a list of space-separated network interface names. This option restricts the network interfaces that processes of this unit can use. By default, processes can only use the network interfaces listed (allow-list). If the first character of the rule is "~", the effect is inverted: the processes can only use network interfaces not listed (deny-list). -
SocketBindAllow,SocketBindDeny: Configures restrictions on the ability of unit processes to invoke bind(2) on a socket. Both allow and deny rules to be defined that restrict which addresses a socket may be bound to.
Privileges
-
NoNewPrivileges: If set, ensures that the service process and all its children can never gain new privileges through execve() (e.g. via setuid or setgid bits, or filesystem capabilities). This is the simplest and most effective way to ensure that a process and its children can never elevate privileges again. Defaults to false, but certain settings override this and ignore the value of this setting. This is the case whenSystemCallFilter=,SystemCallArchitectures=,RestrictAddressFamilies=,RestrictNamespaces=,PrivateDevices=,ProtectKernelTunables=,ProtectKernelModules=,MemoryDenyWriteExecute=,RestrictRealtime=,RestrictSUIDSGID=,DynamicUser=orLockPersonality=are specified. Note that even if this setting is overridden by them, systemctl shows the original value of this setting. See also: No New Privileges Flag. -
RestrictSUIDSGID: If set, any attempts to set the set-user-ID (SUID) or set-group-ID (SGID) bits on files or directories will be denied (for details on these bits see inode(7). -
SecureBits: Controls the secure bits set for the executed process. Takes a space-separated combination of options from the following list:keep-caps,keep-caps-locked,no-setuid-fixup,no-setuid-fixup-locked,noroot, andnoroot-locked.
Scheduler
RestrictRealtime: If set, any attempts to enable realtime scheduling in a process of the unit are refused. This restricts access to realtime task scheduling policies such asSCHED_FIFO,SCHED_RRorSCHED_DEADLINE. See sched(7) for details about these scheduling policies. Realtime scheduling policies may be used to monopolize CPU time for longer periods of time, and may hence be used to lock up or otherwise trigger Denial-of-Service situations on the system.