Okay, so my customers are clueless from time to time.....well okay, they're just clueless. But what gets me royally pissed off is the laziness of fellow colleagues in other departments that land mine in hot water, mostly because they can't be arsed simply saving changes to live kit.
I wasted two hours of my life this afternoon, slowly getting the humidity sucked out of my body by the co-lo's controlled environment, fucking about with a customer's cabinet and wondering why their server suddenly seems to have disappeared into a routing black hole.
Customer runs a bunch of 1U rack-mounted Windows 2003 server servers, all connected into a shitty little Netscreen 5XP firewall via an intelligent 24-port Cisco 2950 switch. One of the Windows servers seemed to have "disappeared", not able to ping/RDP/FTP into it from the outside world. One of my jobs in our company's NOC, naturally, is to visit the co-lo to connect a console kit (keyboard, monitor, mouse) to the offending, non-communicative server to find out why and to fix it (which, in the case of most Windoze machines, just needs a reboot).
Only when I log into it locally, it can't see anything outside of itself, even within its own internal network. It can't even ping the gateway (the Netscreen firewall) that's on the same subnet. Oddly enough, every other rack-mounted server in the cab using the same switch and firewall is visible to the outside world, so it must be something locally with the box. Only, after 90min of faffing about, I can't find out what. I tried different IPs in the same subnet, TCP/IP stack resetting, double-checking Network Services, switching NICs, switching ports on the Cisco, replacing cables, Event Viewer surfing, NIC driver upgrades; and all of the above in both Windows' own normal and Safe Mode....everything I could think of. Physically it's fine: there's link between the switch and the server, and the OS reports duplex and speed settings fine. No restrictive policies on the firewall.....and, oddly enough, no ARP entries in the routing cache (aha, finally a clue). It's like it really has slipped into a IT black hole.
WTF!?
Oh, did I mention this customer is a rather important one? That my manager has been chasing me for the last 45min, only to get an engaged signal every time because I've been on the mobile the entire time to the customer on the other end who's troubleshooting the same issue [10 missed calls]? That Account Management are getting, to put it politely, rather pushy and demanding resolution? That the customer's own manager on his side of the phone is whining as well? The pressure, as they say, is on.
Then we discover the Cisco has a config. And that it was unexpectedly rebooted sometime before I arrived (how, we're still not really too sure of, there's no APC in the cab, and none of the other machines went down from power issues). The config also makes no mention of the port the server is connected to, although we can see the status of that port fine. A little further digging shows the entire switch is in a VLAN of its own, and that by default, this server isn't a part of it. Why?
Because someone, along the way, added this new machine into the cab, and altered the Cisco's config to add it into the VLAN to allow routing....but never saved the fucking config. So naturally, when the Cisco rebooted, it defaulted to the previous saved version, without the new changes. Hence: no routing between Mystery Server and Ignorant Cisco.
You can imagine how pissed off I was when I found that out. "I've just wasted an whole two hours troubleshooting an issue that's down to the laziness of another engineer who couldn't be bothered saving changes?"
"Oops."
I'd like you to meet my special little coil of ethernet here, the one that's been pre-stressed. Don't mind me as I wind it around your larynx and slowly squeeze.
I wasted two hours of my life this afternoon, slowly getting the humidity sucked out of my body by the co-lo's controlled environment, fucking about with a customer's cabinet and wondering why their server suddenly seems to have disappeared into a routing black hole.
Customer runs a bunch of 1U rack-mounted Windows 2003 server servers, all connected into a shitty little Netscreen 5XP firewall via an intelligent 24-port Cisco 2950 switch. One of the Windows servers seemed to have "disappeared", not able to ping/RDP/FTP into it from the outside world. One of my jobs in our company's NOC, naturally, is to visit the co-lo to connect a console kit (keyboard, monitor, mouse) to the offending, non-communicative server to find out why and to fix it (which, in the case of most Windoze machines, just needs a reboot).
Only when I log into it locally, it can't see anything outside of itself, even within its own internal network. It can't even ping the gateway (the Netscreen firewall) that's on the same subnet. Oddly enough, every other rack-mounted server in the cab using the same switch and firewall is visible to the outside world, so it must be something locally with the box. Only, after 90min of faffing about, I can't find out what. I tried different IPs in the same subnet, TCP/IP stack resetting, double-checking Network Services, switching NICs, switching ports on the Cisco, replacing cables, Event Viewer surfing, NIC driver upgrades; and all of the above in both Windows' own normal and Safe Mode....everything I could think of. Physically it's fine: there's link between the switch and the server, and the OS reports duplex and speed settings fine. No restrictive policies on the firewall.....and, oddly enough, no ARP entries in the routing cache (aha, finally a clue). It's like it really has slipped into a IT black hole.
WTF!?
Oh, did I mention this customer is a rather important one? That my manager has been chasing me for the last 45min, only to get an engaged signal every time because I've been on the mobile the entire time to the customer on the other end who's troubleshooting the same issue [10 missed calls]? That Account Management are getting, to put it politely, rather pushy and demanding resolution? That the customer's own manager on his side of the phone is whining as well? The pressure, as they say, is on.
Then we discover the Cisco has a config. And that it was unexpectedly rebooted sometime before I arrived (how, we're still not really too sure of, there's no APC in the cab, and none of the other machines went down from power issues). The config also makes no mention of the port the server is connected to, although we can see the status of that port fine. A little further digging shows the entire switch is in a VLAN of its own, and that by default, this server isn't a part of it. Why?
Because someone, along the way, added this new machine into the cab, and altered the Cisco's config to add it into the VLAN to allow routing....but never saved the fucking config. So naturally, when the Cisco rebooted, it defaulted to the previous saved version, without the new changes. Hence: no routing between Mystery Server and Ignorant Cisco.
You can imagine how pissed off I was when I found that out. "I've just wasted an whole two hours troubleshooting an issue that's down to the laziness of another engineer who couldn't be bothered saving changes?"
"Oops."
I'd like you to meet my special little coil of ethernet here, the one that's been pre-stressed. Don't mind me as I wind it around your larynx and slowly squeeze.
no subject
Date: 2007-06-22 05:13 pm (UTC)no subject
Date: 2007-06-22 06:32 pm (UTC)no subject
Date: 2007-06-22 08:18 pm (UTC)no subject
Date: 2007-06-23 02:26 am (UTC)