"1195725856" and other mysterious numbers
Last week was the final week for this half's performance review at Facebook, where we write summaries of work and impact we and our peers had over the last half year. Naturally, that can only mean one thing: the entire company trends towards peak levels of procrastination, doing literally anything and everything to avoid the unspeakable horror of having to write a few paragraphs of text.
My personal distraction of choice a few days before the deadline was looking at lines like this, spamming from some hosts serving NFS traffic:
RPC: fragment too large: 1195725856 RPC: fragment too large: 1212498244
Let's take a look at the kernel code responsible for generating this warning –
grepping for "fragment too large" show it comes from
So we're erroring out because we got passed some message which is beyond
sv_max_mesg, cool. But where does this come from? Looking at
svc_sock_reclen shows the following:
ntohl converts a uint from network byte ordering to the host's byte ordering.
RPC_FRAGMENT_SIZE_MASK results in only some of the
data being retained, and looking at the definition show us how many bits that
Ok, so we will only keep the first 31 bits and zero out the rest, since
That means that these numbers come from the first four bytes of the fragment,
omitting the final highest bit, which is reserved to record whether the
fragment is the last one for this record (see
The fact that the error happens so early in fragment parsing in particular got
me thinking that the fragment may not be protocol-confirming in the first
place, since it's not like we got very far in processing at all, not even past
the first four bytes. So what are these first four bytes, then? Looking at
the numbers in hex shows something interesting:
These are all really tightly clustered, generally from 0x40 to 0x50, which
implies there might actually be some semantic meaning per-byte. And since these
char-sized, here's a guess about what might be encoded in them…
Oh dear. Somebody is sending HTTP requests to NFS RPC, but at least we are outright rejecting the fragments instead of actually allocating/dirtying a gigabyte of memory.
Next up was finding out who's actually sending these requests.
shows NFS is listening on the default port, 2049, so we can set up a trap with
tcpdump like so:
tcpdump -i any -w trap.pcap dst port 2049 # ...wait for logs to appear again, then ^C... tcpdump -qX -r trap.pcap | less +/HEAD
From here, it was pretty easy to catch the origin of these requests by tracing back to the origin host and service using the captured pcap data. After that one can coordinate with the team responsible to work out what's actually going on here, and avoid these errant packets being sent out in the first place. As a bonus, you also get to learn more about parts of infrastructure you might otherwise not interact with, which is always cool. :-)
Funnily enough, if you Google for these numbers you can find tons of threads
with people encountering them in the wild. Maybe we should start printing ASCII
in future in some of the error paths hit when all character values are between
0x0 and 0x7F, I'm sure it would help a lot of people realise what's going on
much more quickly. Maybe I'll send a patch upstream to do that in
svc_tcp_recv_record and a few other places in the kernel that directly parse
the first few data bytes from packets as an integer, let's see.
Here's a trivial script that can generate a bunch of other integers for HTTP that might be of interest:
And the results:
As expected, if you Google for most of these numbers, you can find an endless supply of questions mentioning them in error messages (some previously unidentified examples which I replied to: 1, 2, 3).
Hopefully this post will help people find their real problem – using the wrong protocol – more quickly in future. In particular, if your affected application crashed due to ENOMEM/"Out of memory", please especially consider submitting a patch to clamp the size to some reasonable maximum :-)