This problem was certainly irritating! I had come to work one day, when I found out that one of the main servers for a lab had crashed without explanation the day before, leading to temporary chaos. The only pointers present were that there were some kernel segfaults recorded in /var/log/messages. However, the system had crashed at around 14:40, while thlast segfault was at 13:55 or similar (writing this part out of memory, and don’t feel like going back to the server logs right now).
I looked at the sar output, and figured out that the kernel segfaults were in fact, quite possibly the culprit. Right around the time of the segfaults, the memory usage started picking up rapidly, and eventually, page faults went up too like crazy. I would like to quote figures, but since I would have to bring up the stats again, and I am kind of a lazy guy, I prefer not to right now.
The kernel logs did show the program that was responsible: a.out
. Fortunately, there was only one recent a.out on the server (more users than developers), and I located the writer of the buggy program. He had a memory leak, and was running his programs, which would segfault, and he would be wondering why the server crashed. I explained, and his supervisor came in, asking for a memory monitor that would kill processes hogging up more than a certain amount of memory.
I preferred the other solution: them writing a wrapper around their test programs to make sure memory leaks were handled properly (on top of that, I hate server crashing work on production servers). However, since I was pressured a bit too much, and I hate having the server crash again, I googled for scripts to do that. Couldn’t find any (and I am very sure there is a very good reason for that!!!).
Which led me to write my own python script to do that. It might be a bit bloated for most purposes: it does logging and sends you email too… but I have not configured it to run in daemon mode yet (so you have to run it using nohup in the background). Here goes:
#!/usr/bin/python #Prajjwal Devkota #March 18, 2008 #Script to kill processes that hog memory usage #Feature wishlist (TODO): #Trapping kill signals before exiting, and appropriately logging them (perhaps leave the KILL signal untrapped, but at least the softer kills should be trapped, and the -k and -r options in the program should call the softer KILL options #Throttling for messages about processes above ALERT_THRESHOLD but below KILL_THRESHOLD #Email functionality: #Message specific subject lines #Options to turn emailing on/off #Options to send email only for specified types of log messages #Startup/Shutdown/Reconfiguration options #Built in daemon functionality: currently, you have to start it in 'daemon mode' by typing 'nohup mem_usage_monitor.py &' #Options for killing existing processes, and possibly, even doing a 'soft restart'import os import commands import time import sys import socket import getopt #programs and paths SENDMAIL = "/usr/sbin/sendmail" #Files PID_FILE="/var/run/mem_usage_monitor.pid" PROC_MEM_FILE="/proc/meminfo" LOG_FILE="/var/log/mem_usage_monitor.log" #No email functionality as of now, but keeping it for when I add email notifications of misbehaving and killed processes MY_EMAIL="mem_monitor@"+socket.gethostname() #Admin Email ADMIN_EMAIL="your_email@your.domain.com" MAIL_SUBJECT="Alert from Memory Monitor on "+socket.gethostname() #Alert if memory usage above 15% -- just in case multiple processes run, they have low priorities at least-- might help things a bit (though I doubt they will help much!) ALERT_PCNT_THRESHOLD=15 #Kill if memory usage above 90% KILL_PCNT_THRESHOLD=90 minute=60 #seconds hours=minute*60 #check every 10 seconds
SLEEP_TIME=5 #End of settings... on to code #Modify below this line only if you know what you are doing def read_memory_total(): exists=os.path.exists(PROC_MEM_FILE) if exists: memfile=open(PROC_MEM_FILE) memtotalline=memfile.readline().strip() memfile.close() if (memtotalline.split()[0]=="MemTotal:"): return int(memtotalline.split()[1]) else: print PROC_MEM_FILE+" format seems to be different than what I expected!\n...exiting" print "First line encountered:\n"+memtotalline+"\n" sys.exit(2) else: print "Could not find "+PROC_MEM_FILE+"\n...exiting" sys.exit(3) def check_if_already_running(): exists=os.path.exists(PID_FILE) if exists: pidhandle=open(PID_FILE) daemon_pid=pidhandle.readline().strip() pidhandle.close() ps_output=commands.getoutput('ps -p '+daemon_pid).splitlines() if (len(ps_output)==1): exists=False print "pid file found, but pid "+daemon_pid+" not running" else: if (restart_existing): commands.getoutput('kill ' + daemon_pid) exists=False else: print "process already seems to be running with pid of "+daemon_pid print "Use '-r' switch to kill existing process and start new one" else: print "No pid file found" return exists def opt_parse(arguments): global restart_existing try: #opts, args = getopt.getopt(sys.argv[1:], "ho:v", ["help", "output="]) opts, args = getopt.getopt(arguments, "r", ["restart"]) except getopt.GetoptError, err: # print help information and exit: print str(err) # will print something like "option -a not recognized" sys.exit(2) for option, argument in opts: if option == "-r": restart_existing = True else: assert False, "unhandled option" def email_admins(line): global SENDMAIL,MY_EMAIL,ADMIN_EMAIL # open a pipe to the mail program and write the data to the pipe mailpipe = os.popen("%s -t" % SENDMAIL, 'w') header="From: "+MY_EMAIL+"\nTo: "+ADMIN_EMAIL+"\nReturn-Path: "+MY_EMAIL+"\nSubject: "+MAIL_SUBJECT+"\n\n" mailpipe.write(header+line) #print header+line exitcode = mailpipe.close() if exitcode: print "Mail sending was not successful: %s" % exitcode def log_write(line,email=True): global logfile write_string=time.strftime("%h %d %y %X")+": "+line+"\n" logfile.write(write_string) logfile.flush() if email: email_admins(write_string) TOTALSYSMEM=read_memory_total() #GB in terms of KB GB=1024*1024 #ALERT_THRESHOLD=2.5*GB ALERT_THRESHOLD=(TOTALSYSMEM*ALERT_PCNT_THRESHOLD/100) KILL_THRESHOLD=(TOTALSYSMEM*KILL_PCNT_THRESHOLD/100) #Kill and restart already running process if present? restart_existing=False arguments=sys.argv[1:] opt_parse(arguments) already_running=check_if_already_running() if already_running: print "Exiting!" sys.exit(1) else: print "Starting new process with pid "+str(os.getpid()) pidhandle=open(PID_FILE,'w') pidhandle.write(str(os.getpid())+"\n") pidhandle.write("SLEEP_TIME: "+str(SLEEP_TIME)+"\tALERT_THRESHOLD: "+str(ALERT_THRESHOLD)+"\tKILL_THRESHOLD: "+str(KILL_THRESHOLD)+"\n") pidhandle.close() logfile=open(LOG_FILE,'a') print "SLEEP_TIME: "+str(SLEEP_TIME)+"s\tALERT_THRESHOLD: "+str(ALERT_THRESHOLD)+"kb\tKILL_THRESHOLD: "+str(KILL_THRESHOLD)+"kb" #Enable this for default startup so that you know if someone has been messing around! log_write("Memory Monitor started with:\n\tSLEEP_TIME: "+str(SLEEP_TIME)+"s\tALERT_THRESHOLD: "+str(ALERT_THRESHOLD)+"kb\tKILL_THRESHOLD: "+str(KILL_THRESHOLD)+"kb") #Enable this line instead of the previous one if you don't want to receive emails about the program starting up #while tuning parameters, you probably don't want to get as many emails! #log_write("Memory Monitor started with:\n\tSLEEP_TIME: "+str(SLEEP_TIME)+"s\tALERT_THRESHOLD: "+str(ALERT_THRESHOLD)+"kb\tKILL_THRESHOLD: "+str(KILL_THRESHOLD)+"kb",False) #List of pids that have already been noted for showing attitude problems in the previous iteration bad_pids=[] print len(bad_pids) while 1: #List of new bad pids that will eventually replace the existing bad pids new_bad_pids=[] ps_output=commands.getoutput('ps -eo pid,size,vsize,suser,cmd --sort -vsize').splitlines() for process_line_raw in ps_output[1:11]: process_line=process_line_raw.strip().split() pid=int(process_line[0]) size=process_line[1] vsize=int(process_line[2]) user=process_line[3] cmd=process_line[4] pinfo="Running user: "+user+"\tprocess name: "+cmd+"\tpid:"+str(pid)+"\tmemory used: "+str(vsize) if (vsize >= KILL_THRESHOLD): log_write("Alert: Process stepped over memory usage limit, going to kill it: "+pinfo) print "silence "+str(pid)+": I kill you!" #Forceful kill if we reach this stage! commands.getoutput('kill -9 ' + str(pid)) elif (vsize >= ALERT_THRESHOLD): ALREADY_CONVICTED=False #print "bad_pids:" #print bad_pids badpid_index=0 while ((not ALREADY_CONVICTED) and (badpid_index < len(bad_pids))): #if (pid < bad_pids[badpid_index]): # break #elif (pid == bad_pids[badpid_index]): if (pid == bad_pids[badpid_index]): ALREADY_CONVICTED=True badpid_index=badpid_index+1 if (not ALREADY_CONVICTED): commands.getoutput("ionice -c3 -p"+str(pid)) commands.getoutput("renice +20 "+str(pid)) log_write("Alert: Process showing bad attitude problems, reducing i/o and cpu priority!: "+pinfo,False) #Add the bad pid to the new list of bad pids in any case new_bad_pids.append(pid) #new_bad_pids.sort() bad_pids=new_bad_pids time.sleep(SLEEP_TIME)