Linux memory management at scale
(This post is also available in Japanese.)
As part of my work on the cgroup2 project, I spend a lot of time talking with engineers about controlling resources across Linux systems. One thing that has become clearer and clearer to me through these conversations is that many engineers – and even senior SREs – have a number of common misconceptions about Linux memory management, and this may be causing the services and systems they support to not be able to run as reliably or efficiently as they could be.
As such, I wrote a talk which goes into some of these misconceptions, explaining why things are more nuanced than they might seem when it comes to memory. I also go over how to compose more reliable and scalable systems using this new knowledge, talking about how we are managing systems within Facebook, and how you can apply this knowledge to improve your own systems.
I had the privilege of presenting this talk at SREcon, and I hope you'll find it useful. Please feel free to e-mail me with any questions or comments.
Key timestamps
I recommend watching the whole talk, since each section helps set up the next, but here are some key takeaways:
- 2:18: Resource control is important, you need it both for reliability and efficiency
- 6:34: If you just limit one resource alone, it may actually make things worse
- 7:28: Resource control is much more complicated than it seems
- 12:56: Being "reclaimable" isn't a guarantee, caches and buffers don't act like free memory, even though many people think they do
- 14:54: We measure RSS and pretend it's meaningful because it's easy to measure, not because it measures anything useful
- 16:12: Swap matters, even on machines with huge amounts of memory
- 18:59: The OOM killer is often not your friend in an OOM situation, and probably doesn't work in the way you expect
- 22:10: Different types of memory reclaim and why they matter
- 25:05: How to know if a system is running out of memory (you can't just look at MemAvailable or MemFree + Buffers + Cached)
- 29:30: How we detect emerging OOMs before the OOM killer
- 30:49: Determining a usable metric for I/O resource isolation
- 34:42: Limiting things generally doesn't work well, so let's create protections instead
- 37:21: Putting all of these primitives together to create an efficient, high availability system
- 46:09: Results from Facebook production
- 48:03: Using some of these new concepts to help improve Android
- 48:53: How to practically make use of the advice in this talk yourself