Network outage
Aug. 7th, 2010 04:40 pmBackground: 24-hour Network Operations support for probably the largest, best-known global imagery sourcing company.
Last night I got to call London around 3am Pacific.
Sure enough, the workmen onsite had accidentally pulled some of the wrong cables in the course of whatever the hell they were doing. Whoopsie.
But man, I loved the subtext of that phone call. It was possibly one of the most fun phone calls I've ever made in my life.
"Hi, London. This is the opposite side of the world calling. Your network has just gone into screaming fits of uncontrollable shits. We know it before you do. You'll be able to verify it independently in... just about... any second now."
Last night I got to call London around 3am Pacific.
London: "London picture desk."kizayaen: "Hey. This is $KIZ from the NOC in Seattle calling. We're seeing no-poll-response alerts for $LONDON_SWITCH_04 in our Operations Manager console. 05's failing polling too, but hasn't hit the alert threshold yet. How's your connectivity looking over there?"
London: "It's okay as far as I know. Let me ask around and see if anyone's having problems."kizayaen: "Sure."
London: "... It looks like we're all fine her... no, wait, one of my colleagues seems to be unable to connect... nope, looks like we're all down. Intranet is working but we've got no external access."kizayaen: "Right. 05's now failed and we're getting poll failures on 01, so going by nomenclature, you've got at least three of five switches down."
London: "Let me check around on some things. We've got workmen here, I'll try and find out if this is related to anything they're doing."
Alarm bells: "AAAAAAAAHHHHHHHHHHHHHHH"
Sure enough, the workmen onsite had accidentally pulled some of the wrong cables in the course of whatever the hell they were doing. Whoopsie.
But man, I loved the subtext of that phone call. It was possibly one of the most fun phone calls I've ever made in my life.
"Hi, London. This is the opposite side of the world calling. Your network has just gone into screaming fits of uncontrollable shits. We know it before you do. You'll be able to verify it independently in... just about... any second now."
no subject
Date: 2010-08-08 12:35 am (UTC)I used to look after a college network in a building lined up to receive a minor rebuild on the top floor, to allow more admin staff in. We already had 50+ desktops and IP telephony in there, all running to a modular HP switch and patch panels out in a room off the hall, then fibres from there down to our main switch. Many of those ports had custom configurations for key machines in the area. I got a panic call from the people still working on that floor whilst the first bit of building work went ahead to say that everything had stopped working for everyone.. yup, one of *those* calls. Upon walking through the door I saw the switch cabinet in the hall, with all 120 or so cables attached... which had somehow found its way out of the switch room despite the cables being tracked there through port in the wall.
The builders had basically unplugged them all, dragged the cab out then plugged them back in.. somehow. No idea which cables went where, or which ones were powered for the phones. The reason nobody was getting anything was because they'd broken one of the main fibre pair by presumably trying to unplug it by grabbing a handful and pulling. Probably for the best under the circumstances.
Funny thing is, when I asked them WTF made them think that would be a good idea, they all denied touching anything.
no subject
Date: 2010-08-08 01:19 am (UTC)I've used that on workmen dozens of times. Awesome phone call, though.
no subject
Date: 2010-08-08 04:33 am (UTC)So there were workmen in the patch racks yanking stuff out indiscrimiately? :p
no subject
Date: 2010-08-08 04:57 am (UTC)no subject
Date: 2010-08-08 05:10 am (UTC)"It's just a flesh wound."
no subject
Date: 2010-08-08 06:49 am (UTC)We had a contractor in a couple weeks ago to do some tiding in our data centre during the day time. About the second "Server has lost it's heartbeat to the monitoring server" alert, one of the other network admins went in there and told them to NOT UNPLUG SHIT.
no subject
Date: 2010-08-08 08:28 am (UTC)no subject
Date: 2010-08-08 08:29 am (UTC)no subject
Date: 2010-08-09 08:13 pm (UTC)The local electricity co had severed the armoured power cable into the data centre that housed THE mainframe for the whole company. No policy information, anyone calling in would not be able to find out *anything*. Data centre was down for 3 working days.
This was my first freelance consluting job. I made *bank* that week :)
no subject
Date: 2010-09-26 06:16 pm (UTC)I know what you mean!
I am working in the same business (located in germany but looking after the world wide network, so it's simply the other way around ^^)
It's fun, when you call the main responsible for an area (who is listed as local contact in your database -.-) and he is like "No, there is no problem at site x. *ringing in the background* Oh wait *he sets you on hold* *nice waiting music* *he gets back* Ähm, there is a power outage on site, the local company is working on it....
Also nice:
A riskclass 1 component goes down (only 0 is higher and that is resverd for coreswitches) you start all the bells and whistles -> eskalation, informing who knows how many "not-really-responsible-but-important-enought-that-they-have-to-be-informed" people, get a hold onto someone on site after the 100th try... and they tell you "Oh, we do maintenace on site. A ticket for this? But it is our LAN, we can do what we want with our LAN"
Not when WE montior it *grrrrrrrrrr*