Memory Usage Monitor

This problem was certainly irritating! I had come to work one day, when I found out that one of the main servers for a lab had crashed without explanation the day before, leading to temporary chaos.  The only pointers present were that there were some kernel segfaults recorded in /var/log/messages.  However, the system had crashed at around 14:40, while thlast segfault was at 13:55 or similar (writing this part out of memory, and don’t feel like going back to the server logs right now).

I looked at the sar output, and figured out that the kernel segfaults were in fact, quite possibly the culprit.  Right around the time of the segfaults, the memory usage started picking up rapidly, and eventually, page faults went up too like crazy.  I would like to quote figures, but since I would have to bring up the stats again, and I am kind of a lazy guy, I prefer not to right now.

The kernel logs did show the program that was responsible: a.out :( .  Fortunately, there was only one recent a.out on the server (more users than developers), and I located the writer of the buggy program.  He had a memory leak, and was running his programs, which would segfault, and he would be wondering why the server crashed.  I explained, and his supervisor came in, asking for a memory monitor that would kill processes hogging up more than a certain amount of memory.

I preferred the other solution: them writing a wrapper around their test programs to make sure memory leaks were handled properly (on top of that, I hate server crashing work on production servers).  However, since I was pressured a bit too much, and I hate having the server crash again, I googled for scripts to do that.  Couldn’t find any (and I am very sure there is a very good reason for that!!!).

Which led me to write my own python script to do that.  It might be a bit bloated for most purposes: it does logging and sends you email too… but I have not configured it to run in daemon mode yet (so you have to run it using nohup in the background).  Here goes:

#!/usr/bin/python
#Prajjwal Devkota
#March 18, 2008
#Script to kill processes that hog memory usage

#Feature wishlist (TODO):
        #Trapping kill signals before exiting, and appropriately logging them (perhaps leave the KILL signal untrapped, but at least the softer kills should be trapped, and the -k and -r options in the program should call the softer KILL options
        #Throttling for messages about processes above ALERT_THRESHOLD but below KILL_THRESHOLD
        #Email functionality:
                #Message specific subject lines
                #Options to turn emailing on/off
                #Options to send email only for specified types of log messages
        #Startup/Shutdown/Reconfiguration options
                #Built in daemon functionality: currently, you have to start it in 'daemon mode' by typing 'nohup mem_usage_monitor.py &'
                #Options for killing existing processes, and possibly, even doing a 'soft restart' :) 

import os
import commands
import time
import sys
import socket
import getopt

#programs and paths
SENDMAIL = "/usr/sbin/sendmail"

#Files
PID_FILE="/var/run/mem_usage_monitor.pid"
PROC_MEM_FILE="/proc/meminfo"
LOG_FILE="/var/log/mem_usage_monitor.log"

#No email functionality as of now, but keeping it for when I add email notifications of misbehaving and killed processes
MY_EMAIL="mem_monitor@"+socket.gethostname()
#Admin Email
ADMIN_EMAIL="your_email@your.domain.com"
MAIL_SUBJECT="Alert from Memory Monitor on "+socket.gethostname()

#Alert if memory usage above 15% -- just in case multiple processes run, they have low priorities at least-- might help things a bit (though I doubt they will help much!)
ALERT_PCNT_THRESHOLD=15
#Kill if memory usage above 90%
KILL_PCNT_THRESHOLD=90

minute=60 #seconds
hours=minute*60
#check every 10 seconds :( 
SLEEP_TIME=5

#End of settings... on to code
#Modify below this line only if you know what you are doing 
def read_memory_total():
        exists=os.path.exists(PROC_MEM_FILE)
        if exists:
                memfile=open(PROC_MEM_FILE)
                memtotalline=memfile.readline().strip()
                memfile.close()
                if (memtotalline.split()[0]=="MemTotal:"):
                        return int(memtotalline.split()[1])
                else:
                        print PROC_MEM_FILE+" format seems to be different than what I expected!\n...exiting"
                        print "First line encountered:\n"+memtotalline+"\n"
                        sys.exit(2)
        else:
                print "Could not find "+PROC_MEM_FILE+"\n...exiting"
                sys.exit(3)

def check_if_already_running():
        exists=os.path.exists(PID_FILE)
        if exists:
                 pidhandle=open(PID_FILE)
                 daemon_pid=pidhandle.readline().strip()
                 pidhandle.close()
                 ps_output=commands.getoutput('ps -p '+daemon_pid).splitlines()
                 if (len(ps_output)==1):
                        exists=False
                        print "pid file found, but pid "+daemon_pid+" not running"
                 else:
                        if (restart_existing):
                                commands.getoutput('kill ' + daemon_pid)
                                exists=False
                        else:
                                print "process already seems to be running with pid of "+daemon_pid
                                print "Use '-r' switch to kill existing process and start new one"
        else:
                print "No pid file found"
        return exists

def opt_parse(arguments):
    global restart_existing
    try:
        #opts, args = getopt.getopt(sys.argv[1:], "ho:v", ["help", "output="])
        opts, args = getopt.getopt(arguments, "r", ["restart"])
    except getopt.GetoptError, err:
        # print help information and exit:
        print str(err) # will print something like "option -a not recognized"
        sys.exit(2)
    for option, argument in opts:
        if option == "-r":
            restart_existing = True
        else:
            assert False, "unhandled option"

def email_admins(line):
        global SENDMAIL,MY_EMAIL,ADMIN_EMAIL
        # open a pipe to the mail program and write the data to the pipe
        mailpipe = os.popen("%s -t" % SENDMAIL, 'w')
        header="From: "+MY_EMAIL+"\nTo: "+ADMIN_EMAIL+"\nReturn-Path: "+MY_EMAIL+"\nSubject: "+MAIL_SUBJECT+"\n\n"
        mailpipe.write(header+line)
        #print header+line
        exitcode = mailpipe.close()
        if exitcode:
            print "Mail sending was not successful: %s" % exitcode

def log_write(line,email=True):
        global logfile
        write_string=time.strftime("%h %d %y %X")+": "+line+"\n"
        logfile.write(write_string)
        logfile.flush()
        if email:
                email_admins(write_string)

TOTALSYSMEM=read_memory_total()

#GB in terms of KB
GB=1024*1024
#ALERT_THRESHOLD=2.5*GB
ALERT_THRESHOLD=(TOTALSYSMEM*ALERT_PCNT_THRESHOLD/100)
KILL_THRESHOLD=(TOTALSYSMEM*KILL_PCNT_THRESHOLD/100)
#Kill and restart already running process if present?
restart_existing=False

arguments=sys.argv[1:]
opt_parse(arguments)

already_running=check_if_already_running()
if already_running:
        print "Exiting!"
        sys.exit(1)
else:
        print "Starting new process with pid "+str(os.getpid())
        pidhandle=open(PID_FILE,'w')
        pidhandle.write(str(os.getpid())+"\n")
        pidhandle.write("SLEEP_TIME: "+str(SLEEP_TIME)+"\tALERT_THRESHOLD: "+str(ALERT_THRESHOLD)+"\tKILL_THRESHOLD: "+str(KILL_THRESHOLD)+"\n")
        pidhandle.close()

logfile=open(LOG_FILE,'a')

print "SLEEP_TIME: "+str(SLEEP_TIME)+"s\tALERT_THRESHOLD: "+str(ALERT_THRESHOLD)+"kb\tKILL_THRESHOLD: "+str(KILL_THRESHOLD)+"kb"

#Enable this for default startup so that you know if someone has been messing around!
log_write("Memory Monitor started with:\n\tSLEEP_TIME: "+str(SLEEP_TIME)+"s\tALERT_THRESHOLD: "+str(ALERT_THRESHOLD)+"kb\tKILL_THRESHOLD: "+str(KILL_THRESHOLD)+"kb")

#Enable this line instead of the previous one if you don't want to receive emails about the program starting up
#while tuning parameters, you probably don't want to get as many emails!
#log_write("Memory Monitor started with:\n\tSLEEP_TIME: "+str(SLEEP_TIME)+"s\tALERT_THRESHOLD: "+str(ALERT_THRESHOLD)+"kb\tKILL_THRESHOLD: "+str(KILL_THRESHOLD)+"kb",False)

#List of pids that have already been noted for showing attitude problems in the previous iteration
bad_pids=[]
print len(bad_pids)

while 1:
        #List of new bad pids that will eventually replace the existing bad pids
        new_bad_pids=[]
        ps_output=commands.getoutput('ps -eo pid,size,vsize,suser,cmd --sort -vsize').splitlines()
        for process_line_raw in ps_output[1:11]:
                process_line=process_line_raw.strip().split()
                pid=int(process_line[0])
                size=process_line[1]
                vsize=int(process_line[2])
                user=process_line[3]
                cmd=process_line[4]
                pinfo="Running user: "+user+"\tprocess name: "+cmd+"\tpid:"+str(pid)+"\tmemory used: "+str(vsize)
                if (vsize >= KILL_THRESHOLD):
                        log_write("Alert: Process stepped over memory usage limit, going to kill it: "+pinfo)
                        print "silence "+str(pid)+": I kill you!"
                        #Forceful kill if we reach this stage!
                        commands.getoutput('kill -9 ' + str(pid))
                elif (vsize >= ALERT_THRESHOLD):
                        ALREADY_CONVICTED=False
                        #print "bad_pids:"
                        #print bad_pids
                        badpid_index=0
                        while ((not ALREADY_CONVICTED) and (badpid_index < len(bad_pids))):
                                #if (pid < bad_pids[badpid_index]):
                                #   break
                                #elif (pid == bad_pids[badpid_index]):
                                if (pid == bad_pids[badpid_index]):
                                        ALREADY_CONVICTED=True
                                badpid_index=badpid_index+1
                        if (not ALREADY_CONVICTED):
                                commands.getoutput("ionice -c3 -p"+str(pid))
                                commands.getoutput("renice +20 "+str(pid))
                                log_write("Alert: Process showing bad attitude problems, reducing i/o and cpu priority!: "+pinfo,False)
                        #Add the bad pid to the new list of bad pids in any case
                        new_bad_pids.append(pid)
        #new_bad_pids.sort()
        bad_pids=new_bad_pids
        time.sleep(SLEEP_TIME)

Getting started with gdb

I found these steps helpful a few weeks ago:

1. Compile your program with the -g switch so that gcc/g++ builds it with debugging options.

e.g. gcc -g myprog.cc -omyprog

2. Run gdb with the binary as an argument

gdb myprog

3. Check the contents of the program

list

– type enter till you reach the end of the program.

4. Put breakpoints, if required

e.g. break 4 (put a break in line 4) — can elaborate more on that, will leave it as it is for now.

5. Run your program (with arguments, if necessary).

run (put arguments after run if required by your program).

6. examine values:

listing a few I remember off the top of my head:

x/s variable name– display contents of string

x/32xw $esp — show contents of the location pointed to by esp, and 32 locations before (or is it after.. hehe.. gotta check!)

and so on… help will inform you better than I will (at least for now!).

7. Step on to the next breakpoint:

step

8. Repeat steps 6 and 7 till either the program gracefully exits, you get bored, or it segfaults (or does other nasty stuff).

9. check the various registers, eip, etc if the program segfaulted or did nasty stuff.

If you want to debug the core dump of a program, do:

gdb program_name core

(make sure the core file is there first!)

I just recall one command for now (limited experience with gdb :) ):

where

this should give you an idea of where exactly the program segfaulted (if it did).

Check this page for more information

GDB seems cool :)

For those people who have had to program in Linux like I have had to (that is.. because of coursework), and if you have been used to a development environment like Visual C++ (I was used to something much more basic, but I found it very easy nonetheless– Turbo Pascal!), you might find debugging in linux a lot more difficult, especially if you start getting segfaults.

Something I learned very recently: use gdb.. it helps a lot!  I just managed to discover a very subtle error in my own code that was causing my program to crash using it…  google around a bit, catch hold of a good gdb tutorial, and debug away.. don’t let the black screen scare you.. believe me.. if you have to program in LInux (even if its just coursework), gdb might be well worth the effort (30-60 mins max?) if you have to spawn more than a few hundred lines of code and debug it too!

Is your computer freezing up on you?

The windows xp machine I had been using had been freezing up on me more and more frequently for the past few days.  I had booted over to Linux and checked the smart statistics of the hard disk– which seemed to be fine.  I was thinking of doing memory tests too.. but never got around to it (since I was working on Linux anyways).  I was not sure what was causing the freezes– I suspected a corrupted driver or (God forbid!) a virus/malware program.

I was using ntfs-3g to mount the ntfs partition I had my data on on Linux, and when I ran an updatedb command, I suddenly got all these ‘io error’ messages for some symantec files.  I had been guessing that Windows would automatically notice a corrupted partition and do chkdsk.. maybe I had skipped disk checking sometime after an unclean shutdown (power failure).

Suspecting a corrupted filesystem, I booted to Windows XP safe mode, ran chkdsk /f c:, scheduled the check to be run at boot, and voila.. after the error-fixing, my computer seems to have started liking me once again.. no more freezes every 10 minutes!

Follow

Get every new post delivered to your Inbox.