Preventing outages with pkill's new --require-handler flag

A little while ago, I wrote an article about the dangers of signals in production, where I detailed a particularly frustrating outage in Meta production caused by the removal of a SIGHUP handler. At Meta, we have a service called LogDevice. As many other daemons do, LogDevice had a signal handler registered for SIGHUP, with the goal being to reopen the log file after rotation. All pretty innocuous stuff.

One day, the team cleaned up their codebase and removed some unused features, including the code that responded to SIGHUP signals, which was now unneeded as it was handled in another way. They thought they had also removed all callsites which sent SIGHUP.

Everything worked fine for a few weeks until, suddenly, service nodes started dropping at an alarming rate in the middle of the night, and a LogDevice outage occurred. As it turned out, there was a logrotate configuration that was still configured to send SIGHUP to the process after rotating logs. With the handler now removed, the default behaviour for SIGHUP in the kernel kicked in instead – to immediately terminate the process.

Why are signals so problematic?

The fundamental issue is that many signals, including the widely used SIGHUP, have a default disposition of terminating the process. This reflects the literal meaning embedded in SIGHUP's name – "hangup" – which indicates the terminal connection was severed, making process continuation unnecessary. This original meaning remains valid today, and is used in places like remote SSH sessions where disconnection triggers process termination.

SIGHUP has, however, also started moonlighting with an additional purpose: requesting applications to reload their configuration or rotate their logs. This mixed interpretation of SIGHUP emerged because daemon processes typically run detached from any controlling terminal, so since these background services wouldn't naturally receive terminal disconnection signals, SIGHUP was repurposed by application authors. Unfortunately, these mixed signals about SIGHUP's meaning create a confusing interface with competing interpretations.

This exact scenario has caused multiple production incidents at Meta, and I've heard similar stories from colleagues at other companies. It's particularly insidious because:

  1. The change that removes the signal handler often seems harmless
  2. Testing rarely catches it because the signal isn't triggered until much later
  3. The failure can occur weeks or months after the code change

Mitigating with --require-handler

As part of my work to mitigate some of the dangers around signals, I've added a new flag to pkill called --require-handler (or -H for short). This flag ensures that signals are only sent to processes that have actually registered a handler for that signal.

For an example of how it can avoid these kinds of incidents, let's look at a typical logrotate configuration that you might find in the wild:

/var/log/cron
/var/log/maillog
/var/log/messages
/var/log/secure
/var/log/spooler
{
    sharedscripts
    postrotate
        /bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true
    endscript
}

This pattern is extremely common in system administration scripts. It reads a PID from a file and sends a signal directly. However, it has no way to verify whether the process actually has a handler for that signal before sending it.

We can combine the new pkill -H flag with pkill's existing -F flag (which reads PIDs from a file) to create a much safer alternative:

/var/log/cron
/var/log/maillog
/var/log/messages
/var/log/secure
/var/log/spooler
{
    sharedscripts
    postrotate
        pkill -H -HUP -F /var/run/syslogd.pid 2> /dev/null || true
    endscript
}

This version only sends a SIGHUP signal if the process has a SIGHUP handler, preventing accidental termination. It's also a lot simpler than the original shell command, with fewer moving parts and subshells.

How does it work?

So how does pkill detect whether a process has a signal handler registered? The answer lies in the Linux procfs interface, specifically in /proc/[pid]/status.

If you examine this file for any extant process, you'll find several signal-related fields:

% grep '^Sig' /proc/self/status
SigQ:   0/254456
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000400

The key field here is SigCgt (that is, "signals caught"), which shows which signals have userspace handlers registered in this process. These fields are hexadecimal bitmaps where each bit represents a signal number. If bit N-1 is set, it means there's a handler for signal N. From man 5 proc_pid_status:

SigBlk, SigIgn, SigCgt: Masks (expressed in hexadecimal) indicating signals being blocked, ignored, and caught (see signal(7)).

So to interpret these, we first decode these from hexadecimal, and then check the relevant bit position. For SIGUSR1, for example, we can see the value the value is 10:

% kill -l USR1
10

So since we have a userspace signal handler if bit N-1 is set, we should check if bit 9 is set.

That's exactly how this new feature works, too. When you call pkill -H, we read the bitmap from /proc/[pid]/status and check if the bit corresponding to the signal is set:

static unsigned long long unhex(const char *restrict in) {
    unsigned long long ret;
    char *rem;
    errno = 0;
    ret = strtoull(in, &rem, 16);
    if (errno || *rem != '\0') {
        xwarnx(_("not a hex string: %s"), in);
        return 0;
    }
    return ret;
}

static int match_signal_handler(const char *restrict sigcgt,
                                const int signal) {
    return sigcgt &&
           (((1UL << (signal - 1)) & unhex(sigcgt)) != 0);
}
src/pgrep.c from procps-ng

Let's see this in action with a concrete example. First, let's create a simple program that installs a SIGHUP handler:

#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

static volatile sig_atomic_t hup;
void handler(int sig) {
    (void)sig;
    hup = 1;
}

int main() {
    struct sigaction sa = {.sa_handler = handler};
    if (sigaction(SIGHUP, &sa, 0) < 0) {
        exit(1);
    }

    FILE *f = fopen("/tmp/handler.pid", "w");
    if (!f) {
        exit(1);
    }
    if (fprintf(f, "%d\n", getpid()) < 0) {
        exit(1);
    }
    fclose(f);

    for (;;) {
        pause();
        if (hup) {
            hup = 0;
            printf("Reloading...\n");
        }
    }
}

If we compile and run this program, we can then check if it has a SIGHUP handler using our new flag:

$ grep SigCgt /proc/12345/status
SigCgt: 0000000000000001

$ pkill -H -HUP -F /tmp/handler.pid
$ # Process is still alive and handled the signal

$ pkill -H -USR1 -F /tmp/handler.pid
$ # No match (no USR1 handler), so no signal sent

Now let's try a process without a handler:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main() {
    FILE *f = fopen("/tmp/nohandler.pid", "w");
    if (!f) {
        exit(1);
    }
    if (fprintf(f, "%d\n", getpid()) < 0) {
        exit(1);
    }
    fclose(f);
    pause();
}

And test it:

$ grep SigCgt /proc/12346/status
SigCgt: 0000000000000000

$ pkill -H -HUP -F /tmp/nohandler.pid
$ # No match, nothing happens
$ pkill -HUP -F /tmp/nohandler.pid
$ # Process terminated due to unhandled SIGHUP

Real world implementation

pkill -H is most valuable in system management contexts where signals are traditionally used for control, such as:

  1. When telling a service to reopen or rotate its logs
  2. When asking a service to reload its config
  3. When controlling long running processes through signals

For example, you can audit your /etc/logrotate.d/ directory and update any postrotate scripts that send signals from something like this:

/var/log/nginx/*.log {
    daily
    rotate 14
    # ...
    postrotate
        kill -HUP $(cat /var/run/nginx.pid)
    endscript
}

…to something like this:

/var/log/nginx/*.log {
    daily
    rotate 14
    # ...
    postrotate
        pkill -H -HUP -F /var/run/nginx.pid || true
    endscript
}

Due diligence

In general, when removing a signal handler from your application, it pays to do some spelunking and dig through any code or configuration that might send that signal to your process. This might include:

Here is a small bpftrace program which can also help. Pass it a pid on the command line, and it will tell you which signals are being send to that pid, and by which process:

#!/usr/bin/env bpftrace

BEGIN {
    if ($# < 1) {
        printf("Usage: ./sigcheck.bt PID\n");
        exit();
    }
    @target = $1;
    printf("Tracing signals sent to PID %d, ^C to end\n", @target);
}

tracepoint:signal:signal_generate
/ args->pid == @target /
{
    @sig_count[args->sig]++;
    /* pid/comm are from the _sending_ pid */
    printf("pid=%-6d comm=%-16s sig=%d\n", pid, comm, args->sig);
}

END {
    printf("\nSignal summary for PID %d:\n", @target);
    print(@sig_count);
    clear(@target);
    clear(@sig_count);
}
$ sudo ./sigcheck.bt 882492
Attaching 3 probes...
Tracing signals sent to PID 882492, ^C to end
from pid=1      comm=systemd          sig=1
from pid=885998 comm=kill             sig=10
from pid=1      comm=systemd          sig=1
^C
Signal summary for PID 882492:
@sig_count[1]: 2
@sig_count[10]: 1

This program can help you find signals being sent that you might not be expecting before you land a change to remove a signal handler.

Using pkill -H adds a safety net, but where possible it's still ideal to clean up all signal senders when removing a handler, and prefer to use other kinds of IPC (like varlink or similar) to signal for state changes where possible. All in all, pkill -H provides a simple but effective safeguard against one of the most common signal-related problems in production environments. This isn't a silver bullet for all signal related issues – signals still have many other problems as detailed in my previous article – but for systems where signals can't be entirely avoided, this flag adds a meaningful layer of protection.

pkill -H was released as part of procps-ng 4.0.3, which should be in most distributions now. While signals might be deeply entrenched in Unix and Linux, that doesn't mean we can't make them safer to use, and pkill -H is one step towards that goal. For new applications, I still recommend using more explicit IPC mechanisms when possible, but for every system where that's not possible, using pkill -H is a great way to eliminate an entire class of outage entirely.

Many thanks to Craig for reviewing the patch, and Diego and Michael for providing suggestions on this post.