arghedy argh argh
Apr. 28th, 2005 10:16 amsomething you never ever want to see on a production server:
# sar -f sa27 -g
SunOS <hostname> 5.9 Generic_117171-08 sun4u 04/27/2005
the explanation for the non unix geeks. Page in/outs are chunks (pages) of memory being swapped to or from disk. Page scans are requests for a free page of memory. 200 is the upper limit you'd ever expect to see on a fully loaded server. This poor beast seemed to have some root owned process doing:
while (true) {
malloc();
}
and we wondered why it became completely unresponsive at about 18:30 ... We hadn't noticed before because we were too busy playing CounterStrike. Would have been much worse if we'd done the normal 17:00 thing and gone home then (and not noticed until 0830 the next morning)
# sar -f sa27 -g
SunOS <hostname> 5.9 Generic_117171-08 sun4u 04/27/2005
00:00:01 pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf 01:00:00 0.94 1.98 1.98 0.00 0.05 02:00:00 0.76 1.06 1.05 0.00 0.96 <snippety> 16:00:00 11.38 25.32 25.31 0.00 0.00 16:20:00 10.85 26.22 26.21 0.00 0.00 16:40:00 38.04 270.21 273.24 60.48 0.00 17:00:00 57.50 590.54 592.73 151.67 0.00 17:20:00 44.11 569.10 571.56 161.28 0.00 17:40:01 79.33 795.55 812.00 667.62 0.00 18:00:01 59.29 703.71 723.77 1465.26 0.00 Average 9.51 65.60 66.46 51.43 0.71
the explanation for the non unix geeks. Page in/outs are chunks (pages) of memory being swapped to or from disk. Page scans are requests for a free page of memory. 200 is the upper limit you'd ever expect to see on a fully loaded server. This poor beast seemed to have some root owned process doing:
while (true) {
malloc();
}
and we wondered why it became completely unresponsive at about 18:30 ... We hadn't noticed before because we were too busy playing CounterStrike. Would have been much worse if we'd done the normal 17:00 thing and gone home then (and not noticed until 0830 the next morning)
no subject
Date: 2005-04-28 06:58 am (UTC)Even though this isn't a server and is a workstation, I'm running like 3 Adobe apps, e-mail, irc, and a VNC client, and I get:
23:56 up 22:56, 2 users, load averages: 0.54 0.64 1.01
I'd have better uptime, but had to reboot for a kernel update.
no subject
Date: 2005-04-28 07:37 am (UTC)no idea what the load average was. that info disappeared with the reset. it was probably ... high though.
no subject
Date: 2005-04-28 12:35 pm (UTC)Not something you want to see on a production box. not at all. Your alert system must have been screaming at you!
no subject
Date: 2005-04-28 12:53 pm (UTC)Admins claimed nothing was wrong.
It usually took about 5 minutes before something you typed came back to your terminal.
These are the same admins that claimed eggdrop bots crashed their server...
no subject
Date: 2005-04-28 06:25 pm (UTC)Heh, just to humor you though:
Processes: 73 total, 3 running, 70 sleeping... 241 threads 11:24:32
Load Avg: 0.52, 0.83, 0.70 CPU usage: 57.4% user, 30.7% sys, 11.9% idle
SharedLibs: num = 121, resident = 26.5M code, 2.37M data, 8.27M LinkEdit
MemRegions: num = 11463, resident = 161M + 11.3M private, 161M shared
PhysMem: 77.4M wired, 285M active, 143M inactive, 506M used, 5.84M free
VM: 7.12G + 83.5M 125951(104) pageins, 88560(27) pageouts
no subject
Date: 2005-04-29 05:14 am (UTC)It was the (http) proxy being unaccessable that woke us up!
no subject
Date: 2005-04-29 01:15 pm (UTC)We use Nagios, so the agent keeps track of mem usage, swap useage, HDD space, ping responses, telnet/ftp etc...I love it ^_^