[identity profile] grrltechie.livejournal.com posting in [community profile] techrecovery

Work conversation from earlier today:

Network Guy:  We are about to do something disruptive*.
Me:  Wha?
NG:  Did boss not tell you about it?
Me:  No
NG:  .....*laughs*.  Well, we are going to do something disruptive, but it shouldn't cause any problems.

And of course I got about 20 phone calls from users who were completely disrupted.  I.E. knocked off the network, lost work and were unable to get back on our primary (HIS) software.

Both pieces, not being told anything and the NG screwing things up after saying what he was doing wouldn't screw anything up, are completely typical.  And large pieces of why I am utterly burned out on this job.

*in this case the something disruptive was rebooting the core switch.  Ya know, the one that everybody goes through to authenticate to the network?  On the one hand, we have 2 and they are supposed to be set up to provide redundancy.  On the other hand, I don't think that redundancy has ever actually been tested.  Before today.  And, wow.  Redundancy FAIL.

So I'm thinking it might have been a good idea to treat this like a short down time, by, oh I don't know, warning people they might lose network connectivity and therefore our HIS software.  Because that's what it turned out to be, only it was UNPLANNED down time and oh did the users get upset.

Date: 2010-04-07 08:28 am (UTC)
From: [identity profile] japester.livejournal.com
oh, pairing 6509s into what is nominally a redundant config, testing it, and then rebooting one is still fraught with danger.
That was a previous job's hell. We just treated anything that was serious on any of our core kit as 'potential outage', so got done out of business hours.

Date: 2010-04-07 02:22 pm (UTC)
From: [identity profile] kuang.livejournal.com
I used to work with an HP 9308 that would throw a hissy fit if you even looked at it funny, but unfortunately it was presided over by a networking team who saw preventative maintenance as something that stole valuable Unreal Tournament time. Any concerns I raised were met with indifference, even when supported by proof.

It was therefore quite interesting when an old 24 port HP on my site started to glitch and tripped the entire thing, much to the surprise of... well, just that team really. Unfortunately it broke at two minutes to lunch, so.. you know. Was funny to come back an hour later and see two of the team running around in circles after travelling across the city and realising they didn't know where any of 'their' kit was hidden.

Mind you, this is the same team that called in a consultant to set up some extra routing rules for lesser used protocols, didn't ask for documentation, didn't take a backup, and then were forced to reset the whole thing a month later after a bad thunderstorm, losing the lot in the process.

Date: 2010-04-08 12:26 pm (UTC)
From: [identity profile] wolfhound668.livejournal.com
Oh, if I had a dollar for every time I heard, "you'll never even notice what we're doing"...

My favorite was when I was working at a company that was on the outer edge of coverage from the local AT&T central office.

We were on a third party phone system and one day they discovered that all of our circuits were running across AT&T's hardware, not what they had co-located at the co. "We're going to cut over to our hardware in the middle of the night. You'll never notice an outage".

The next morning none of our analog lines worked. It turns out that their hardware lacked the ability to push as far as our building, where AT&T's was fine. It took *five weeks* for them to get our faxes back (they couldn't just reverse back what they did without a work-order, f-ing union).

The punchline? Every year after that some tech from third party phone company would call and say, "hey, for some reason you were never moved over to our hardware at the co. I'm going to cut you over tonight while I'm out there".

My standard response was, "leave it alone. If you move it I will personally hunt you down and break your legs".

Date: 2010-04-08 08:44 pm (UTC)
From: [identity profile] lihan161051.livejournal.com
Ugh.

Never assume any kind of poke at a core IT system is going to go unnoticed. Schedule it for minimal-impact periods and build post-action testing and contingency failsafe revert into the schedule, importance proportional to number of people using the system.

Because sooner or later, even the best IT guy is going to pooch something unexpectedly in the middle of what would seem to be a transparent activity. And you want that to happen when the only person you knock offline is the security guy surfing porn at the receptionist's desk, not when an angry mob of screaming marketing guys is seconds away from converging on your rack room.

Date: 2010-04-19 07:45 pm (UTC)
From: [identity profile] the-s-guy.livejournal.com
"That's fine, I just need to know who authorised it."

*redirects incoming lines to that person's phone*

"OK, we're set, go ahead."

Profile

techrecovery: (Default)
Elitist Computer Nerd Posse

April 2017

S M T W T F S
      1
2345678
91011121314 15
16171819202122
23242526272829
30      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 20th, 2026 05:27 am
Powered by Dreamwidth Studios