Preventing outages with pkill's new --require-handler flag

tl;dr:


A little while ago, I wrote an article about the dangers of signals in production, where I detailed a particularly frustrating outage in Meta production caused by the removal of a SIGHUP handler. At Meta, we have a service called LogDevice. As many other daemons do, LogDevice had a signal handler registered for SIGHUP, with the goal being to reopen the log file after rotation. All pretty innocuous stuff.

One day, the team cleaned up their codebase and removed some unused features, including the code that responded to SIGHUP signals, which was now unneeded as it was handled in another way. They thought they had also removed all callsites which sent SIGHUP.

Everything worked fine for a few weeks until, suddenly, service nodes started dropping at an alarming rate in the middle of the night, and a LogDevice outage occurred. As it turned out, there was a logrotate configuration that was still configured to send SIGHUP to the process after rotating logs. With the handler now removed, the default behaviour for SIGHUP in the kernel kicked in instead – to immediately terminate the process.

Importantly this wasn't just on one machine, but across a whole cluster. Services, of course, must be designed to handle restarts or machines or racks falling offline, but when thousands of servers terminate at once, things become far more complicated. Replicas need to promote primaries en masse, connection pools collapse, cache coherency breaks down, and eventually, you simply don't have enough capacity to run your service properly.

Why are signals so problematic?

The fundamental issue is that many signals, including the widely used SIGHUP, have a default disposition of terminating the process. This reflects the literal meaning embedded in SIGHUP's name – "hangup" – which indicates the terminal connection was severed, making continued process execution unnecessary. This original meaning remains valid today, and is used in places like remote SSH sessions where disconnection triggers process termination.

SIGHUP has, however, also started moonlighting with an additional purpose: requesting applications to reload their configuration or rotate their logs. This mixed interpretation of SIGHUP emerged because daemon processes typically run detached from any controlling terminal, so since these background services wouldn't naturally receive terminal disconnection signals, SIGHUP was repurposed by application authors. Unfortunately, these mixed signals about SIGHUP's meaning create a confusing interface with competing interpretations.

This exact scenario has caused multiple production incidents at Meta, and I've heard similar stories from colleagues at other companies. It's particularly insidious because:

  1. The change that removes the signal handler often seems harmless
  2. Testing rarely catches it because the signal isn't triggered until much later
  3. The failure can occur weeks or months after the code change

Mitigating with --require-handler

As part of my work to mitigate some of the dangers around signals, I've added a new flag to pkill called --require-handler (or -H for short). This flag ensures that signals are only sent to processes that have actually registered a handler for that signal.

For an example of how it can avoid these kinds of incidents, let's look at a typical logrotate configuration that you might find in the wild:

/var/log/cron
/var/log/maillog
/var/log/messages
/var/log/secure
/var/log/spooler
{
    sharedscripts
    postrotate
        /bin/kill -HUP `cat /var/run/syslogd.pid` || true
    endscript
}

This pattern is extremely common in system administration scripts. It reads a PID from a file and sends a signal directly. However, it has no way to verify whether the process actually has a handler for that signal before sending it.

We can combine the new pkill -H flag with pkill's existing -F flag (which reads PIDs from a file) to create a much safer alternative:

/var/log/cron
/var/log/maillog
/var/log/messages
/var/log/secure
/var/log/spooler
{
    sharedscripts
    postrotate
        pkill -H -HUP -F /var/run/syslogd.pid
    endscript
}

This version only sends a SIGHUP signal if the process has a SIGHUP handler, preventing accidental termination. It's also a lot simpler than the original shell command, with fewer moving parts and subshells.

It's worth also noting that pkill -H doesn't silently hide the problem. When no process with a matching handler is found, it returns exit code 1, which means your log rotation or other scripts will explicitly fail, triggering alerts in monitoring systems. Instead of an unexpected process termination that might go undetected until service degradation occurs, you now have a specific, actionable error that points directly to the root cause: a missing signal handler.

To be clear: pkill -H is not intended to replace proper engineering practices like thorough auditing of signal senders, rigorous testing, canaries, or careful code reviews. Rather, it provides an additional safety net when, inevitably, things slip through the cracks when those processes inevitably miss something in complex systems. The goal is to transform an uncontrolled, catastrophic failure (termination of a service process widely across a fleet) into a controlled, traceable one (a failed log rotation that triggers an alert).

At the scale we, Google, and similar companies operate at, even "perfect" solutions are fundamentally asymptotic, and we have to mitigate for that.

Isn't this just papering over the cracks?

One counterargument I've seen is basically this:

If the handler is missing, shouldn't we let the process crash so that monitoring detects it?

The problem isn't a single process crashing, we can tolerate that. The problem is the cascading failure when thousands of processes crash simultaneously. Consider that:

Whereas, in the pkill -H case:

Quite a difference. This shifts the risk if a signal sender is missed from being a catastrophic, fleetwide outage to a localised operational failure that can be debugged during business hours. At scale, this is the difference between a midnight, all hands on deck, outage and a high priority task. I think I speak for anyone who works on these systems when I say I'd much prefer the latter :-)

pkill -H doesn't hide the problem, it transforms it into something more manageable. The failure is still detected and fixed, just without the midnight wakeup call and frantic incident management.

How does it work?

So how does pkill detect whether a process has a signal handler registered? The answer lies in the Linux procfs interface, specifically in /proc/[pid]/status.

If you examine this file for any extant process, you'll find several signal-related fields:

% grep '^Sig' /proc/self/status
SigQ:   0/254456
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000400

The key field here is SigCgt (that is, "signals caught"), which shows which signals have userspace handlers registered in this process. These fields are hexadecimal bitmaps where each bit represents a signal number. If bit N-1 is set, it means there's a handler for signal N. From man 5 proc_pid_status:

SigBlk, SigIgn, SigCgt: Masks (expressed in hexadecimal) indicating signals being blocked, ignored, and caught (see signal(7)).

So to interpret these, we first decode these from hexadecimal, and then check the relevant bit position. For SIGUSR1, for example, we can see the value the value is 10:

% kill -l USR1
10

So since we have a userspace signal handler if bit N-1 is set, we should check if bit 9 is set.

That's exactly how this new feature works, too. When you call pkill -H, we read the bitmap from /proc/[pid]/status and check if the bit corresponding to the signal is set:

static unsigned long long unhex(const char *restrict in) {
    unsigned long long ret;
    char *rem;
    errno = 0;
    ret = strtoull(in, &rem, 16);
    if (errno || *rem != '\0') {
        xwarnx(_("not a hex string: %s"), in);
        return 0;
    }
    return ret;
}

static int match_signal_handler(const char *restrict sigcgt,
                                const int signal) {
    return sigcgt &&
           (((1UL << (signal - 1)) & unhex(sigcgt)) != 0);
}
src/pgrep.c from procps-ng

Let's see this in action with a concrete example. First, let's create a simple program that installs a SIGHUP handler:

#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

static volatile sig_atomic_t hup;
void handler(int sig) {
    (void)sig;
    hup = 1;
}

int main() {
    struct sigaction sa = {.sa_handler = handler};
    if (sigaction(SIGHUP, &sa, 0) < 0) {
        exit(1);
    }

    FILE *f = fopen("/tmp/handler.pid", "w");
    if (!f) {
        exit(1);
    }
    if (fprintf(f, "%d\n", getpid()) < 0) {
        exit(1);
    }
    fclose(f);

    for (;;) {
        pause();
        if (hup) {
            hup = 0;
            printf("Reloading...\n");
        }
    }
}

If we compile and run this program, we can then check if it has a SIGHUP handler using our new flag:

$ grep SigCgt /proc/12345/status
SigCgt: 0000000000000001

$ pkill -H -HUP -F /tmp/handler.pid
$ # Process is still alive and handled the signal

$ pkill -H -USR1 -F /tmp/handler.pid
$ # No match (no USR1 handler), so no signal sent

Now let's try a process without a handler:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main() {
    FILE *f = fopen("/tmp/nohandler.pid", "w");
    if (!f) {
        exit(1);
    }
    if (fprintf(f, "%d\n", getpid()) < 0) {
        exit(1);
    }
    fclose(f);
    pause();
}

And test it:

$ grep SigCgt /proc/12346/status
SigCgt: 0000000000000000

$ pkill -H -HUP -F /tmp/nohandler.pid
$ # No match, nothing happens
$ pkill -HUP -F /tmp/nohandler.pid
$ # Process terminated due to unhandled SIGHUP

Real world implementation

pkill -H is most valuable in system management contexts where signals are traditionally used for control, such as:

  1. When telling a service to reopen or rotate its logs
  2. When asking a service to reload its config
  3. When controlling long running processes through signals

For example, you can audit your /etc/logrotate.d/ directory and update any postrotate scripts that send signals from something like this:

/var/log/nginx/*.log {
    daily
    rotate 14
    # ...
    postrotate
        kill -HUP $(cat /var/run/nginx.pid)
    endscript
}

…to something like this:

/var/log/nginx/*.log {
    daily
    rotate 14
    # ...
    postrotate
        pkill -H -HUP -F /var/run/nginx.pid
    endscript
}

With this, when there is no signal handler registered, the pkill command will exit with a non-zero status code, causing your logrotate job to fail. This failure becomes a useful signal in itself, as you can leverage this behaviour to find no longer needed signal senders and investigate the best course of action.

Due diligence

In general, when removing a signal handler from your application, it pays to do some spelunking and dig through any code or configuration that might send that signal to your process. This might include:

Here is a small bpftrace program which can also help. Pass it a pid on the command line, and it will tell you which signals are being sent to that pid, and by which process:

#!/usr/bin/env bpftrace

BEGIN {
    if ($# < 1) {
        printf("Usage: ./sigcheck.bt PID\n");
        exit();
    }
    @target = $1;
    printf("Tracing signals sent to PID %d, ^C to end\n", @target);
}

tracepoint:signal:signal_generate
/ args->pid == @target /
{
    @sig_count[args->sig]++;
    /* pid/comm are from the _sending_ pid */
    printf("pid=%-6d comm=%-16s sig=%d\n", pid, comm, args->sig);
}

END {
    printf("\nSignal summary for PID %d:\n", @target);
    print(@sig_count);
    clear(@target);
    clear(@sig_count);
}
$ sudo ./sigcheck.bt 882492
Attaching 3 probes...
Tracing signals sent to PID 882492, ^C to end
from pid=1      comm=systemd          sig=1
from pid=885998 comm=kill             sig=10
from pid=1      comm=systemd          sig=1
^C
Signal summary for PID 882492:
@sig_count[1]: 2
@sig_count[10]: 1

This program can help you find signals being sent that you might not be expecting before you land a change to remove a signal handler.

Using pkill -H adds a safety net, but where possible it's still ideal to clean up all signal senders when removing a handler, and prefer to use other kinds of IPC (like varlink or similar) to signal for state changes where possible. All in all, pkill -H provides a simple but effective defence in depth safeguard against one of the most common signal-related problems in production environments. This isn't a silver bullet for all signal related issues – signals still have many other problems as detailed in my previous article – but for systems where signals can't be entirely avoided, this flag adds a meaningful layer of protection.

In terms of eliminating the straggling signal senders in our production environment, we've found that a combined strategy works best:

This multi-layered approach helps us balance our production safety needs without producing long-term cruft in the form of unused signals.

pkill -H was released as part of procps-ng 4.0.3, which should be in most distributions now. While signals might be deeply entrenched in Unix and Linux, that doesn't mean we can't make them safer to use, and pkill -H is one step towards that goal. For new applications, I still recommend using more explicit IPC mechanisms when possible, but for every system where that's not practical, using pkill -H is a great way to eliminate this class of outage entirely.

Many thanks to Craig for reviewing the patch, and Diego and Michael for providing suggestions on this post.