is anyone experiencing gaps in their graphs after upgrading to 2.2.3? * natepm (n=nmarks@206.83.88.66.ptr.us.xo.net) has joined #zenoss jb, Losing data in graphs is generally caused by system bottlenecks. jb, I've been tuning zenoss quite a bit and could help you with this if you are interested well the system is not under any stress at all no iowait no heavy cpu load avg-cpu: %user %nice %system %iowait %steal %idle 0.50 0.00 0.50 0.00 0.00 99.00 avg-cpu: %user %nice %system %iowait %steal %idle 2.50 0.00 6.00 0.00 0.00 91.50 With 1 second polling, I was seeing low io wait, <= 5 % with an 8 disk raid 10 hmmm that looks pretty nice So could it be the network perhaps? Also, depending on how aggressively you are polling or how busy the devices are, some devices will drop snmp data s/data/requests/ hrm, well it worked fine with 2.2.0 no additional devices added with 2.2.3 We noticed that foundry switches would drop snmp requests when they were overloaded hm, when you run zenperfsnmp from the CLI, does it exit normally? i notice that it hangs Perhaps you should strace it and see what it is blocking on Just do something like this: strace -f -o strace.log zenperfsnmp run -v1 It will run a lot slower under strace just so you know ok yeah thats the problem it doesn't stop like that on my old box. Exactly why you should strace it Because you can tail the log and see where it is hanging yep, doing that now. Probably on a poll, or select My guess is a select waiting for io. Maybe network? i doubt it, but we will see :) Alright If you are ok with doing it, pastebin the strace log sure its running now.. yeah it hangs periodically http://pastebin.ca/1094894 theres an example.. <__adytum-bot__> Title: general pastebin - Mine - post number 1094894 (at pastebin.ca) it was stuck there for about a minute The select... exactly. Waiting on a socket probably. Lets look yep, everytime it stalls, its on a SELECT Thats common how would you further diagnose it Ok so can you pastebin the whole thing? not done yet :) its gonna be quite large Line 206: 21708 select(5, [3 4], [], [], {86, 492990} Ok how about this Ok break out of that and rm strace.log K Then try this strace -t -f -o strace.log zenperfsnmp run -v1 Then let it go to the first time it hangs and then break out pastebin that log k In the previous one I couldn't see what file descriptor it was hanging on the select said fd 5, but you didn't include enough of the log to see what fd 5 was. Make sense? sure Just "man select" to see how it is being called. All of those functions have man pages Just grep the strace: grep -E 'open.*= 5$' BMDan, I think it is a socket, not a file. We'll find out in a minute Actually FD 5 is the notify FD there. You're right You probably want 3 and 4. I've never seen the [3 4] syntax in strace before; is that how it represents arrays? jb, You can also find out what those fds are with lsof while the process is running well, not technically "arrays"... BMDan, Yup struct fd_set *readfds * BMDan slinks back into his corner. ok hm heh You could also freeze the process while it is running and beat on it with lsof to see what it is doing http://pastebin.ca/1094899 With that + strace you should be able to mostly reverse engineer what is happening it was stuck here for a bit.. <__adytum-bot__> Title: general pastebin - Unnamed - post number 1094899 (at pastebin.ca) Is it still running? yes If so CTRL Z k That freezes the process "ps -efH" to show a treeview of the process table. Then find the strace and the python instance with zenperfsnmp ok hmmmm can you paste the entire log please? its 19MB sec Ok nevermind :) i can if you would like? :) Ok here is what I'd like you to do Or ls -l /proc/$(pidof )/fd * xpot has quit () lsof hides that all from you though :) Easier than lsof, and gives you the fd numbers (not sure if lsof does that, does it?) BMDan, yes it does Shameless plug to a blog I wrote on lsof: http://digitalprognosis.com/blog/2008/02/14/troubleshooting-running-systems-with-lsof/ <__adytum-bot__> Title: Troubleshooting running systems with lsof - Open Source Awareness (at digitalprognosis.com) Look towards the bottom of the lsof of init init 1 root 10u FIFO 0,13 1108 /dev/initctl It has /dev/initctl open at fd10 Bah, but I can't budge around in /proc if I use a fancy-schmancy tool like lsof, now can I? ;) BMDan, And notice the very last trick which is a poor mans lsof in that blog posting Back in mah day, we loaded a kernel module if we wanted to see open descriptors. And we liked it! awk '/\//{print $NF}' /proc/1/maps | sort -u And wrote binary straight to stdin, we know :) jb, So anyways, here is the gameplan... Run strace until it hangs. Since it is hanging on select, it is waiting on io somewhere Ok As soon as it hangs, CTRL Z to freeze the process ok did that 22880 12:27:24 select(5, [3 4], [], [], {234, 888653} Then you can find the pid and ls -l /proc//fd or lsof -p the pid of zenperfsnmp? We are trying to find out what fd 3 and fd4 are. BMDan is just showing the old peopel way to do it :) jb the pid of what is being strace'd Is it bas that I can type this off the top of my head? /bin/echo -e '\033[1;1H\033[2J' #poor man's clear I don't have 2.2.3 up right now but something in the changelog I recall was switching from python snmp to net-snmp Thats why we did strace -f, I think it forks off snmpget or snmpwalk and we need to see lol @ BMDan. That is impressive jb, So after zenperfsnmp is frozen, run "ps -efH" to get a tree view process listing Last one, I promise: strace -fF might give you even more info, if you're in the mood for pain. Now, lemme see if I can craft some Perl for you... Find zenperfsnmp BMDan, Yeah, but that is pretty nuts jb, find zenperfsnmp and then see if their are any child processes. Do you see anything? If so, can you paste the output of ps? 22128 ok zenoss 22127 21687 3 12:16 pts/1 00:00:24 strace -t -f -o strace.log zenperfsnmp run -v1 zenoss 22128 22127 1 12:16 pts/1 00:00:13 /usr/local/zenoss/python/bin/.python.bin /usr/local/zenoss/zenoss/Products/ ZenRRD/zenperfsnmp.py --configfile /usr/local/zenoss/zenoss/etc/zenperfsnmp.conf -v1 zenoss 22879 21687 11 12:26 pts/1 00:00:28 strace -t -f -o strace.log zenperfsnmp run -v1 zenoss 22880 22879 6 12:26 pts/1 00:00:17 /usr/local/zenoss/python/bin/.python.bin /usr/local/zenoss/zenoss/Products/ ZenRRD/zenperfsnmp.py --configfile /usr/local/zenoss/zenoss/etc/zenperfsnmp.conf -v1 jb, You compiled this yourself? nope stack installer uses /usr/local/zenoss uggg. Debian? nope cent5 Why not use the rpms? Much easier to manage that way the stack is the preferred method? i did use RPM's, but switched to the Stack installer I don't work for zenoss. From a management perspective rpms are much easier. It doesn't matter though yeah :( Just a sec Can you paste that in pastebin or something Sorry it is hard to look at in irc with messed up linebreaks sure http://pastebin.ca/1094908 <__adytum-bot__> Title: general pastebin - Anonymous - post number 1094908 (at pastebin.ca) hrm.. two straces Ok so you have two Type jobs You probably have two frozen ones, no biggy k Kill the top one As the second one likely overwrote the log from the first Ok then run lsof -p 22880 | less k In the FD column, you are looking for the fds in the second argument in the select statement from the strace Do you understand why we are doing this? to match the FDs and see what that FD was doing at the time that it hung? ok good Well you froze it at the exact time it hung, so it should be in the same state 22880 12:27:24 select(5, [3 4], [], [], {234, 888653} Yes So fd3 and fd4 .python.b 22880 zenoss 3u IPv4 55106070 TCP localhost.localdomain:36961->localhost.localdomain:8789 (ESTABLISHED) .python.b 22880 zenoss 4r FIFO 0,6 55106060 pipe i think thats it? Ok that is 3, what about 4 Oh wait, yeah that is it hm so its TCP related that one anyways Yup Likely something zenoss internal but its connecting to its self? yeah Might be something like zenperfsnmp talking to zenhub or something like that ok now look through $ZENHOME/logs and see if you can find anything crazy will do This might take some time to track down Can you do: ps -efH | egrep '[2]2880' It will probably be zenperfsnmp, but just it might not be zenoss@fc-zenoss01:~$ ps -efH | egrep '[2]2880' zenoss 22880 22879 1 12:26 pts/1 00:00:17 /usr/local/zenoss/python/bin/.python.bin /usr/local/zenoss/zenoss/Products/ZenRRD/zenperfsnmp.py --configfile /usr/local/zenoss/zenoss/etc/zenperfsnmp.conf -v1 yeah