tmpfs inode corruption: introducing inode64
A few months ago I got a report from a service owner that their service was crashing when trying to access certain files stored in tmpfs. This service provides a way to do things like builds on niche architectures and platforms, and as part of that it retrieves and stores a large cache of required files for the requisite job in
/dev/shm, preventing the need to then later fetch the same files for other jobs with the same dependencies.
As part of this lifecycle, this service stores a mapping from each file's inode number to its own metadata relevant to the data retrieved. Simple enough, and on the face of it this seems fine: generally the contract the kernel provides is that as long as the device, inode number, and generation are the same, this is guaranteed to point to the same data during the lifetime of the kernel.
This is made even simpler by the fact that tmpfs doesn't support inode generations –
% : > /dev/shm/test % ./generation /dev/shm/test ioctl: Inappropriate ioctl for device
As such, that just leaves device and inode number to distinguish identity, and since we're always using /dev/shm as the base with no descendant mountpoints, storing the inode number alone should be enough.
Production says no
So far, everything sounds good, but it was not so in production. On machines with high uptimes, the team responsible for this service was seeing intermittent failures that appeared to be caused by inode number collisions:
[root@service ~]# cd /dev/shm/svc [root@service svc]# stat ApplyTuple.h File: 'ApplyTuple.h' Size: 7270 Blocks: 16 IO Block: 4096 Device: 14h/20d Inode: 3924537249 Links: 5 Access: (0555/-r-xr-xr-x) Uid: ( 0/ root) Access: 2020-12-19 09:30:01.378353516 -0800 Modify: 2020-12-19 09:30:01.378353516 -0800 Change: 2020-12-19 09:30:01.378353516 -0800 Birth: - [root@service svc]# stat PriorityUnboundedQueueSet.h File: 'PriorityUnboundedQueueSet.h' Size: 3302 Blocks: 8 IO Block: 4096 Device: 14h/20d Inode: 3924537249 Links: 3 Access: (0555/-r-xr-xr-x) Uid: ( 0/ root) Access: 2020-12-19 09:29:53.007252542 -0800 Modify: 2020-12-19 09:27:23.370451978 -0800 Change: 2020-12-19 09:27:29.575526501 -0800 Birth: - [root@service svc]# cksum ApplyTuple.h PriorityUnboundedQueueSet.h 2624198482 7631 ApplyTuple.h 3677425165 5658 PriorityUnboundedQueueSet.h
Well, this is not good. Notice that the reported inode number for both of these files is 3924537249, but we not only have two clearly distinct inodes – for example, inodes store the modification times, and those don't line up – but also completely different data entirely. Clearly something has gone quite wrong.
Volatile inode numbers
On non-volatile filesystems, inode numbers are typically finite and often have some kind of implementation-defined semantic meaning (e.g. implying where the inode is disk). Take for example ext4's bytes-per-inode ratio, which is configurable with
mkfs.ext4 -i. This behaviour means that it's typically very hard to get inode number collisions without significant filesystem corruption.
On the other hand, tmpfs doesn't have such limits, since it doesn't actually need to store inode data outside of memory, and thus worry about inode placement. In reality, a tmpfs inode is very bare bones, and the inode number (for example) is just taken from a global incrementing counter,
get_next_ino(). The code backing that counter is literally as simple as this:
What this says is essentially this:
- Each CPU has its own
last_inovariable, with its own distinct value.
- Get the value of this CPU's
- Assuming you're on a multiprocessing (SMP) system – which you are, because it's 2021 – try to do some batching so that CPUs don't have to synchronise too often. Each CPU gets a batch of 1024 slots, and then it has to come back to get another 1024, and so on.
Looks pretty simple, right? Unfortunately, there's something awry that you might miss if you just skim over the code and don't look closely at the function and variable types: they're all
unsigned int. On an x86_64 machine, an
unsigned int is 32 bits wide, and it's pretty trivial to exceed 2^32 incrementations of the get_next_ino() counter if you're creating a lot of files. We also don't try to reuse inode numbers from freed inodes (and we'll come back to that a little later).
If you're a kernel developer, you probably already know that the kernel actually has a special type for inode numbers,
ino_t, and on x86_64 machines, it's 64 bits wide. As such, a trivial fix might simply be to change
last_ino to be of type
ino_t. When I tested this, it appeared to work with no obvious ramifications, but there was a comment above which gave some pause:
So the problem is that if you have a 32-bit executable compiled without
_FILE_OFFSET_BITS=64 on a system with a 64-bit wide
stat() will simply fail with EOVERFLOW for inode numbers greater than 2^32. As such, we need a support option for those who need to preserve this legacy behaviour while still allowing more modern systems to avoid this issue.
Intermediate storage issues
There is also one more problem: I mentioned earlier that
get_next_ino() is a global pool – we also use it for things like sockets, pipes, and other things that are disconnected from physical storage on the backing filesystem. While the original issue was largely isolated to tmpfs, changing
get_next_ino() as a whole meant that some ramifications would be harder to predict.
For example, we have code in prod that tries to produce a mapping of sockets to a list of pids which use them. In order to do this, we use the netlink interface to query for sockets in a particular state we care about (say,
LISTEN), and then iterate /proc/pid/fd to find the process or processes which have this socket open.
The net subsystem stores some of its information in its own structs for reference later. One of the things which it stores is the inode number for the socket:
If you just skim this, it looks fine. But what about if we actually look at the type of
struct inet_diag_msg explicitly stores the inode number as a 32-bit wide unsigned integer, resulting in a similar problem to the one described earlier with a 32-bit userland, but with no nice error handling this time. Instead, we simply overflow entirely. This means that this code strategy no longer works because stat() (using
ino_t) and netlink (using
__u32) no longer agree about the socket's inode number:
At this point, it became pretty clear that changing
get_next_ino itself was not going to work – there were too many callsites already depending on its behaviour, and there are even callsites like the example here that transcend into userspace. No, if this was to be fixed in a way that could be accepted upstream, one would have to change how tmpfs handles inode numbering itself, and get it away from
After quite a bit of back and forth, my patches in "tmpfs: per-superblock i_ino support" and "tmpfs: support 64-bit inums per-sb" were merged into kernel 5.9.
In essence, the patches move tmpfs' inode allocation away from the get_next_ino shared pool, in favour of our own superblock-based system, and add support for 64-bit inodes using the
inode64 tmpfs mount option (or allow having it on by default by enabling
CONFIG_TMPFS_INODE64 when compiling the kernel).
We now use
ino_t everywhere, and for legacy users we simply emulate the old behaviour by doing manual wraparounds when we reach UINT_MAX. This also allows us to catch these cases when they happen and print a warning for the system administrator to act on, suggesting to move to inode64.
Arch Linux also moved to enable this by default from kernel 5.9-ARCH onwards, which means this feature has been tested widely on real workloads. Debian has also followed suit for 5.14+. Thanks, Arch and Debian folks!
- This issue has likely been the root cause of a non-trivial number of unreproducible or transient failures in systems that are dependent on inode numbering. Reproducing this issue is heavily dependent on the system state, requires a workload which practises heavy file creation in tmpfs, and is typically likely to manifest as transient errors which go away when retried. As such, this issue has likely been a significant cause of hard to pin down unreliability in such systems for a significant period of time.
- If you rely on tmpfs inode numbering in production, I strongly recommend mounting it with
inode64or compiling your prod kernels with
CONFIG_TMPFS_INODE64in order to get 64-bit inode numbering for tmpfs.