Troubleshooting an out-of-memory problem

I’ve had a problem for a while with the server that runs both this site and Boxes and Arrows where it would all of a sudden just race into an out-of-memory situation, and the so-called oom-killer would start shooting processes at random, which would quickly take the server down. Meanwhile the load would be so high that logging in to try and figure out what was going on was impossible. The only recourse was to reboot the server.



The problem was amplified by its irregularity. It would happen sometimes twice in a row, sometimes 2-3 weeks would pass between occurrences. The first time it ever happened was when I was vacationing on Italy’s Amalfi coast with no reception on my cell phone. That took all my sites down for 24 hours. Ouch.



It took me quite some time and effort to figure out the root of the problem, so I’d like to share how I did it and hopefully save someone else some cycles.



My first confusion was with the Linux cache. I noticed that when updatedb, the command that updates the locate database, was running, memory consumption would shoot up. A kind soul on #debian explained to me how to use the free command:



$ free -m
             total       used       free     shared    buffers     cached
Mem:          2027       1810        217          0        132       1157
-/+ buffers/cache:        520       1506
Swap:         1027         55        972


What this means is that the box has 2 Gb RAM, of which 1.8 Gb is used. But if you subtract the amount of RAM used for disposable caching, it turns out only 520 Mb is used. That’s pretty healthy. So we’re not starved for RAM per se.



Next, he suggested I setup a cron job to monitor RAM consumption, so I could go back in afterwards and find the bad process. At first, however, I set it up to log only at 5-minute intervals. That turned out to be too coarse, it completely missed the incident (though I couldn’t know that for a fact). I then changed it to log every 1 minute.



Here’s the crontab entry:



*/1 * * * *     root /usr/local/bin/memstat


And here’s the script:



date >> /var/lib/memstat/memstat.log
vmstat >> /var/lib/memstat/memstat.log
ps ax --cols 200 -o size,pid,start,user,rss,resident,share,vsize,command O-s |head >> /var/lib/memstat/memstat.log
free -m >> /var/lib/memstat/memstat.log


I also logged a couple other things, taking a shotgun approach, but the above turned out to be the essential part.



Now the next time it happened, I could actually see the process swelling. One minute it was normal, the next it was huge, and the next again, it was gone, killed by the oom-killer. That’s how fast it went. Amazing.



It turned out to be one of my application processes. Good, that meant it was likely to be something I had some control over. Next question, of course, was finding out what caused it. My theory was that it was probably caused by some request, given that there was no signs of leaks, and that it happened so quickly and so irregularly.



What I did in the end was correlate with the server logs. That’s where the date command in my memstat script came in handy. Based on the timestamp I could find that same 2-minute window in my application log file and see what was going on, most notably which requests were coming in.



My first mistake was to look at completed requests. That was dumb, because the offending request was precisely likely to never finish. Second time around, I looked at requests that never finished. Bingo!



It turned out to be when people did a search on Boxes and Arrows with a query string that included an ampersand (&) character. For some reason, that caused Ferret, our Lucene-like search engine, to start spinning indefinitely, gobbling up all memory in the process. Just to be sure, I tried running the query again, with my hand on the “kill -9” trigger, and indeed, the process started swelling quickly.



I’m so glad to have solved this, and I feel much more confident going into my next problem. For a long time I wasn’t sure if it might actually be a hardware or a kernel problem, but it feels good in the end to have pinpointed the exact problem, and to know that it’s fixed. I can now sleep more relaxed at night. I still keep my cell phone on and not on silent in case I get an SMS notification in the middle of the night, but they just got a lot less likely.

2 comments

Hey Lars, you will find a very good series of utilities called "sysstat" [ http://perso.orange.fr/sebastien.godard/index.html ] for exactly this type of debugging.
  Cancel
sysstat has a whole bunch of reporting about system usage with sar [ http://perso.orange.fr/sebastien.godard/man_sar.html ] and one in particular if you know the PID or you can have it report for ALL processes (with some limits). Look for the "-x { pid | SELF | ALL }" line in the above man page. Of course, if these things happen in between updates from the sar data file, then you might not be able to use it as a proper diagnostic tool. Maybe for your specific problem it was easier to do some detective work like you did, but if you want to see how the CPU usage looked like every 10 minutes before a crash, or mem swapping or even more details, sysstat is very useful. A good swiss-army knife for sysadmins :)
  Cancel

Leave a comment