Troubleshooting an out-of-memory problem
I’ve had a problem for a while with the server that runs both this site and Boxes and Arrows where it would all of a sudden just race into an out-of-memory situation, and the so-called oom-killer would start shooting processes at random, which would quickly take the server down. Meanwhile the load would be so high that logging in to try and figure out what was going on was impossible. The only recourse was to reboot the server.
The problem was amplified by its irregularity. It would happen sometimes twice in a row, sometimes 2-3 weeks would pass between occurrences. The first time it ever happened was when I was vacationing on Italy’s Amalfi coast with no reception on my cell phone. That took all my sites down for 24 hours. Ouch.
It took me quite some time and effort to figure out the root of the problem, so I’d like to share how I did it and hopefully save someone else some cycles.
My first confusion was with the Linux cache. I noticed that when
updatedb, the command that updates the locate database, was running, memory consumption would shoot up. A kind soul on #debian explained to me how to use the
$ free -m total used free shared buffers cached Mem: 2027 1810 217 0 132 1157 -/+ buffers/cache: 520 1506 Swap: 1027 55 972
What this means is that the box has 2 Gb RAM, of which 1.8 Gb is used. But if you subtract the amount of RAM used for disposable caching, it turns out only 520 Mb is used. That’s pretty healthy. So we’re not starved for RAM per se.
Next, he suggested I setup a cron job to monitor RAM consumption, so I could go back in afterwards and find the bad process. At first, however, I set it up to log only at 5-minute intervals. That turned out to be too coarse, it completely missed the incident (though I couldn’t know that for a fact). I then changed it to log every 1 minute.
Here’s the crontab entry:
*/1 * * * * root /usr/local/bin/memstat
And here’s the script:
date >> /var/lib/memstat/memstat.log vmstat >> /var/lib/memstat/memstat.log ps ax --cols 200 -o size,pid,start,user,rss,resident,share,vsize,command O-s |head >> /var/lib/memstat/memstat.log free -m >> /var/lib/memstat/memstat.log
I also logged a couple other things, taking a shotgun approach, but the above turned out to be the essential part.
Now the next time it happened, I could actually see the process swelling. One minute it was normal, the next it was huge, and the next again, it was gone, killed by the oom-killer. That’s how fast it went. Amazing.
It turned out to be one of my application processes. Good, that meant it was likely to be something I had some control over. Next question, of course, was finding out what caused it. My theory was that it was probably caused by some request, given that there was no signs of leaks, and that it happened so quickly and so irregularly.
What I did in the end was correlate with the server logs. That’s where the
date command in my memstat script came in handy. Based on the timestamp I could find that same 2-minute window in my application log file and see what was going on, most notably which requests were coming in.
My first mistake was to look at completed requests. That was dumb, because the offending request was precisely likely to never finish. Second time around, I looked at requests that never finished. Bingo!
It turned out to be when people did a search on Boxes and Arrows with a query string that included an ampersand (&) character. For some reason, that caused Ferret, our Lucene-like search engine, to start spinning indefinitely, gobbling up all memory in the process. Just to be sure, I tried running the query again, with my hand on the “
kill -9” trigger, and indeed, the process started swelling quickly.
I’m so glad to have solved this, and I feel much more confident going into my next problem. For a long time I wasn’t sure if it might actually be a hardware or a kernel problem, but it feels good in the end to have pinpointed the exact problem, and to know that it’s fixed. I can now sleep more relaxed at night. I still keep my cell phone on and not on silent in case I get an SMS notification in the middle of the night, but they just got a lot less likely.