techrecovery | arghedy argh argh

something you never ever want to see on a production server:

# sar -f sa27 -g

SunOS <hostname> 5.9 Generic_117171-08 sun4u 04/27/2005

00:00:01  pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
01:00:00     0.94     1.98     1.98     0.00     0.05
02:00:00     0.76     1.06     1.05     0.00     0.96

<snippety>

16:00:00    11.38    25.32    25.31     0.00     0.00
16:20:00    10.85    26.22    26.21     0.00     0.00
16:40:00    38.04   270.21   273.24    60.48     0.00
17:00:00    57.50   590.54   592.73   151.67     0.00
17:20:00    44.11   569.10   571.56   161.28     0.00
17:40:01    79.33   795.55   812.00   667.62     0.00
18:00:01    59.29   703.71   723.77  1465.26     0.00

Average      9.51    65.60    66.46    51.43     0.71

the explanation for the non unix geeks. Page in/outs are chunks (pages) of memory being swapped to or from disk. Page scans are requests for a free page of memory. 200 is the upper limit you'd ever expect to see on a fully loaded server. This poor beast seemed to have some root owned process doing:

while (true) {
malloc();
}

and we wondered why it became completely unresponsive at about 18:30 ... We hadn't noticed before because we were too busy playing CounterStrike. Would have been much worse if we'd done the normal 17:00 thing and gone home then (and not noticed until 0830 the next morning)

Threaded | Top-Level Comments Only

From:

sean-langley.livejournal.com

That's...very bad.

Even though this isn't a server and is a workstation, I'm running like 3 Adobe apps, e-mail, irc, and a VNC client, and I get:

23:56 up 22:56, 2 users, load averages: 0.54 0.64 1.01

I'd have better uptime, but had to reboot for a kernel update.

japester.livejournal.com

and .... you are aware that it's not uptime/load average that this was measuring?

no idea what the load average was. that info disappeared with the reset. it was probably ... high though.

taleya.livejournal.com

oh christ

Not something you want to see on a production box. not at all. Your alert system must have been screaming at you!

compwizrd.livejournal.com

i can remember working on a SGI Challenge back in the mid 90's that REGULARLY hit 900 load average.

Admins claimed nothing was wrong.

It usually took about 5 minutes before something you typed came back to your terminal.

These are the same admins that claimed eggdrop bots crashed their server...

Yeah, I realised that after posting.

Heh, just to humor you though:

Processes: 73 total, 3 running, 70 sleeping... 241 threads 11:24:32
Load Avg: 0.52, 0.83, 0.70 CPU usage: 57.4% user, 30.7% sys, 11.9% idle
SharedLibs: num = 121, resident = 26.5M code, 2.37M data, 8.27M LinkEdit
MemRegions: num = 11463, resident = 161M + 11.3M private, 161M shared
PhysMem: 77.4M wired, 285M active, 143M inactive, 506M used, 5.84M free
VM: 7.12G + 83.5M 125951(104) pageins, 88560(27) pageouts

well, no actually. We use intermapper which is roughly equivelent to telnetting to the ports and seeing if you get a response. It was doing that, as all those routines are kernel level.
It was the (http) proxy being unaccessable that woke us up!

aii..

We use Nagios, so the agent keeps track of mem usage, swap useage, HDD space, ping responses, telnet/ftp etc...I love it ^_^

Tech Support HELL

A Lesson in the Hatred of Humanity

arghedy argh argh

arghedy argh argh

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

April 2017

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags