tmpfs inode corruption: introducing inode64
A few months ago I got a report from a service owner that their service was crashing when trying to access certain files stored in tmpfs. This service provides a way to do things like builds on niche architectures and platforms, and as part of that it retrieves and stores a large cache of required files for the requisite job in /dev/shm
, preventing the need to then later fetch the same files for other jobs with the same dependencies.
As part of this lifecycle, this service stores a mapping from each file's inode number to its own metadata relevant to the data retrieved. Simple enough, and on the face of it this seems fine: generally the contract the kernel provides is that as long as the device, inode number, and generation are the same, this is guaranteed to point to the same data during the lifetime of the kernel.
This is made even simpler by the fact that tmpfs doesn't support inode generations – ioctl(FS_IOC_GETVERSION)
returns ENOTTY
:
% : > /dev/shm/test
% ./generation /dev/shm/test
ioctl: Inappropriate ioctl for device
As such, that just leaves device and inode number to distinguish identity, and since we're always using /dev/shm as the base with no descendant mountpoints, storing the inode number alone should be enough.
Production says no
So far, everything sounds good, but it was not so in production. On machines with high uptimes, the team responsible for this service was seeing intermittent failures that appeared to be caused by inode number collisions:
[root@service ~]# cd /dev/shm/svc
[root@service svc]# stat ApplyTuple.h
File: 'ApplyTuple.h'
Size: 7270 Blocks: 16 IO Block: 4096
Device: 14h/20d Inode: 3924537249 Links: 5
Access: (0555/-r-xr-xr-x) Uid: ( 0/ root)
Access: 2020-12-19 09:30:01.378353516 -0800
Modify: 2020-12-19 09:30:01.378353516 -0800
Change: 2020-12-19 09:30:01.378353516 -0800
Birth: -
[root@service svc]# stat PriorityUnboundedQueueSet.h
File: 'PriorityUnboundedQueueSet.h'
Size: 3302 Blocks: 8 IO Block: 4096
Device: 14h/20d Inode: 3924537249 Links: 3
Access: (0555/-r-xr-xr-x) Uid: ( 0/ root)
Access: 2020-12-19 09:29:53.007252542 -0800
Modify: 2020-12-19 09:27:23.370451978 -0800
Change: 2020-12-19 09:27:29.575526501 -0800
Birth: -
[root@service svc]# cksum ApplyTuple.h PriorityUnboundedQueueSet.h
2624198482 7631 ApplyTuple.h
3677425165 5658 PriorityUnboundedQueueSet.h
Well, this is not good. Notice that the reported inode number for both of these files is 3924537249, but we not only have two clearly distinct inodes – for example, inodes store the modification times, and those don't line up – but also completely different data entirely. Clearly something has gone quite wrong.
Volatile inode numbers
On non-volatile filesystems, inode numbers are typically finite and often have some kind of implementation-defined semantic meaning (e.g. implying where the inode is disk). Take for example ext4's bytes-per-inode ratio, which is configurable with mkfs.ext4 -i
. This behaviour means that it's typically very hard to get inode number collisions without significant filesystem corruption.
On the other hand, tmpfs doesn't have such limits, since it doesn't actually need to store inode data outside of memory, and thus worry about inode placement. In reality, a tmpfs inode is very bare bones, and the inode number (for example) is just taken from a global incrementing counter, get_next_ino()
. The code backing that counter is literally as simple as this:
What this says is essentially this:
- Each CPU has its own
last_ino
variable, with its own distinct value. - Get the value of this CPU's
last_ino
variable. - Assuming you're on a multiprocessing (SMP) system – which you are, because it's 2021 – try to do some batching so that CPUs don't have to synchronise too often. Each CPU gets a batch of 1024 slots, and then it has to come back to get another 1024, and so on.
Looks pretty simple, right? Unfortunately, there's something awry that you might miss if you just skim over the code and don't look closely at the function and variable types: they're all unsigned int
. On an x86_64 machine, an unsigned int
is 32 bits wide, and it's pretty trivial to exceed 2^32 incrementations of the get_next_ino() counter if you're creating a lot of files. We also don't try to reuse inode numbers from freed inodes (and we'll come back to that a little later).
If you're a kernel developer, you probably already know that the kernel actually has a special type for inode numbers, ino_t
, and on x86_64 machines, it's 64 bits wide. As such, a trivial fix might simply be to change get_next_ino
and last_ino
to be of type ino_t
. When I tested this, it appeared to work with no obvious ramifications, but there was a comment above which gave some pause:
So the problem is that if you have a 32-bit executable compiled without _FILE_OFFSET_BITS=64
on a system with a 64-bit wide ino_t
, glibc's stat()
will simply fail with EOVERFLOW for inode numbers greater than 2^32. As such, we need a support option for those who need to preserve this legacy behaviour while still allowing more modern systems to avoid this issue.
Intermediate storage issues
There is also one more problem: I mentioned earlier that get_next_ino()
is a global pool – we also use it for things like sockets, pipes, and other things that are disconnected from physical storage on the backing filesystem. While the original issue was largely isolated to tmpfs, changing get_next_ino()
as a whole meant that some ramifications would be harder to predict.
For example, we have code in prod that tries to produce a mapping of sockets to a list of pids which use them. In order to do this, we use the netlink interface to query for sockets in a particular state we care about (say, LISTEN
), and then iterate /proc/pid/fd to find the process or processes which have this socket open.
The net subsystem stores some of its information in its own structs for reference later. One of the things which it stores is the inode number for the socket:
If you just skim this, it looks fine. But what about if we actually look at the type of idiag_inode
in struct inet_diag_msg
?
…oh dear. struct inet_diag_msg
explicitly stores the inode number as a 32-bit wide unsigned integer, resulting in a similar problem to the one described earlier with a 32-bit userland, but with no nice error handling this time. Instead, we simply overflow entirely. This means that this code strategy no longer works because stat() (using ino_t
) and netlink (using __u32
) no longer agree about the socket's inode number:
At this point, it became pretty clear that changing get_next_ino
itself was not going to work – there were too many callsites already depending on its behaviour, and there are even callsites like the example here that transcend into userspace. No, if this was to be fixed in a way that could be accepted upstream, one would have to change how tmpfs handles inode numbering itself, and get it away from get_next_ino()
entirely.
Resolution
After quite a bit of back and forth, my patches in "tmpfs: per-superblock i_ino support" and "tmpfs: support 64-bit inums per-sb" were merged into kernel 5.9.
In essence, the patches move tmpfs' inode allocation away from the get_next_ino shared pool, in favour of our own superblock-based system, and add support for 64-bit inodes using the inode64
tmpfs mount option (or allow having it on by default by enabling CONFIG_TMPFS_INODE64
when compiling the kernel).
We now use ino_t
everywhere, and for legacy users we simply emulate the old behaviour by doing manual wraparounds when we reach UINT_MAX. This also allows us to catch these cases when they happen and print a warning for the system administrator to act on, suggesting to move to inode64.
Arch Linux also moved to enable this by default from kernel 5.9-ARCH onwards, which means this feature has been tested widely on real workloads. Debian has also followed suit for 5.14+. Thanks, Arch and Debian folks!
In conclusion:
- This issue has likely been the root cause of a non-trivial number of unreproducible or transient failures in systems that are dependent on inode numbering. Reproducing this issue is heavily dependent on the system state, requires a workload which practises heavy file creation in tmpfs, and is typically likely to manifest as transient errors which go away when retried. As such, this issue has likely been a significant cause of hard to pin down unreliability in such systems for a significant period of time.
- If you rely on tmpfs inode numbering in production, I strongly recommend mounting it with
inode64
or compiling your prod kernels withCONFIG_TMPFS_INODE64
in order to get 64-bit inode numbering for tmpfs.