![[identity profile]](https://www.dreamwidth.org/img/silk/identity/openid.png)
Ack. The internet broke this evening; not very pretty - albeit rather entertaining at times - watching the helpdesk cope. Since I'm not officially part of the helpdesk anymore I was spared most of the anguish of dealing with the customers, but as tier 2 Network Support I was instead the sole contact between our upstream provider and reliant downstreamers, and our supervisors controlling the floor.
It went from routing problems in Western Australia, to mail issues, which were both wrapped up fairly quickly, but were only symptoms of the problems to come.
Our ISP - and many like ours - run on a double-tier authentication system: RADIUS. One machine looks after the actual auth requests (username/password checks, previous online sessions) while the other looks after accounting information (once user is actually logged in, accounting packets - "start", "stop", -"alive" - are generated to keep track of the customer's account for billing and cross-references). If you can't authenicate, you won't get on, and if you don't get accounting packets generated correctly/at all, your session drops off and is automatically terminated some time later. It's possible to bypass the machines in times of great need: auth-all bypasses and automatically accepts any user/pwd combo, regardless if it's right, on our auth RADIUS, and bypassing the accounting RADIUS usually results in free internet, since we can't track when/how they got online.
At 1700hrs, our major upstream provider's RADIUS simply disappeared. Like magic. You could ping it, port into it and everything, but interrogating it for RADIUS packets yielded absolutely nothing. Thankfully, being multi-provided, customers using our other minor upstream provider were unaffected, and our DSL customers had no issue either. But a good 60% of all of our business is run through our major provider's dial-up service.....and all of it disappeared when our auth systems did. So anyone using our dial-up services in that provider were immediately returned auth issues (the infamous Windows 691 error).
Oh and how the heavens did drown in the fury of our technicians, and verily the fiery chasms of hell did open to vomit up our customers in an evil mood. :)
An issue of this magnitude usually results in us network reps reaching for the big red handle labelled Auth All - but how can you automatically authenticate packets you don't even receive? No go: we didn't even have that to fall back on.
Now, having been on that same helpdesk myself until recently, I could sympathise with them, having to immediately deal with waves of mad customers demanding to be put back online, regardless of the IVR management had placed on the incoming call queue (do customers even bother listening to those messages???). It eventually got to the point many of the technicians automatically trotted out lines much like this as soon as they picked up each call :
Welcome to your.isp.net, and thank you for your patience. If you are calling regards to authenication issues, or 691 errors, please be advised there is a national outage at the moment with no current ETA. If you still wish to speak to a technician, please remain on the line.
Meanwhile, as the only network rep attempting to liase with our upstreams, development crew, and network engineers (all at physically seperate locations some suburbs away), getting information out to the supervisors on the floor as soon as possible became my priority. But you can't really instil confidence to a floor looking up at you for information when the upstreamer's engineers themselves simply don't know what broke and are still searching for the problem.....for the next 6 hours.
*head-desk head-desk head-desk*
Of course, we might have made it through the night.....if the sheer number of incoming customer calls hadn't completely shat our PABX and lead to a very messy telephone systems crash. Suddenly over 200 pods on the floor were effectively cut off from the outside world.
*sharpens razor* *poise over wrist*
Ever administered a helpdesk in peak evening period and it's damn near silent? As much as the techs and salespeople liked the break, it was damn near spooky, I'll tell you. Eventually we got our backup phone system up, but that had no intelligent queue-sort system operative, and it was like literally picking up calls from a blind stack, not knowing what department the customer wanted, and copping a lot of customer angst in the mean-time. Later attemtps to restart, kick-start, curse-start, threaten, cajole, hack, bypass and eventually switch modules couldn't stand up to the traffic and the phonesystem kept crashing.....another seven times in total, apparently.
I guess it could have been worse: our customer databse could have crashed along with everything else, and then we could have gone completely blind! *sarcastic cheer*
And now I'm home: it's still going on but I can't summon the energy to care, to be honest. Need aspirin. And preferably something very alcoholic.
It went from routing problems in Western Australia, to mail issues, which were both wrapped up fairly quickly, but were only symptoms of the problems to come.
Our ISP - and many like ours - run on a double-tier authentication system: RADIUS. One machine looks after the actual auth requests (username/password checks, previous online sessions) while the other looks after accounting information (once user is actually logged in, accounting packets - "start", "stop", -"alive" - are generated to keep track of the customer's account for billing and cross-references). If you can't authenicate, you won't get on, and if you don't get accounting packets generated correctly/at all, your session drops off and is automatically terminated some time later. It's possible to bypass the machines in times of great need: auth-all bypasses and automatically accepts any user/pwd combo, regardless if it's right, on our auth RADIUS, and bypassing the accounting RADIUS usually results in free internet, since we can't track when/how they got online.
At 1700hrs, our major upstream provider's RADIUS simply disappeared. Like magic. You could ping it, port into it and everything, but interrogating it for RADIUS packets yielded absolutely nothing. Thankfully, being multi-provided, customers using our other minor upstream provider were unaffected, and our DSL customers had no issue either. But a good 60% of all of our business is run through our major provider's dial-up service.....and all of it disappeared when our auth systems did. So anyone using our dial-up services in that provider were immediately returned auth issues (the infamous Windows 691 error).
Oh and how the heavens did drown in the fury of our technicians, and verily the fiery chasms of hell did open to vomit up our customers in an evil mood. :)
An issue of this magnitude usually results in us network reps reaching for the big red handle labelled Auth All - but how can you automatically authenticate packets you don't even receive? No go: we didn't even have that to fall back on.
Now, having been on that same helpdesk myself until recently, I could sympathise with them, having to immediately deal with waves of mad customers demanding to be put back online, regardless of the IVR management had placed on the incoming call queue (do customers even bother listening to those messages???). It eventually got to the point many of the technicians automatically trotted out lines much like this as soon as they picked up each call :
Welcome to your.isp.net, and thank you for your patience. If you are calling regards to authenication issues, or 691 errors, please be advised there is a national outage at the moment with no current ETA. If you still wish to speak to a technician, please remain on the line.
Meanwhile, as the only network rep attempting to liase with our upstreams, development crew, and network engineers (all at physically seperate locations some suburbs away), getting information out to the supervisors on the floor as soon as possible became my priority. But you can't really instil confidence to a floor looking up at you for information when the upstreamer's engineers themselves simply don't know what broke and are still searching for the problem.....for the next 6 hours.
*head-desk head-desk head-desk*
Of course, we might have made it through the night.....if the sheer number of incoming customer calls hadn't completely shat our PABX and lead to a very messy telephone systems crash. Suddenly over 200 pods on the floor were effectively cut off from the outside world.
*sharpens razor* *poise over wrist*
Ever administered a helpdesk in peak evening period and it's damn near silent? As much as the techs and salespeople liked the break, it was damn near spooky, I'll tell you. Eventually we got our backup phone system up, but that had no intelligent queue-sort system operative, and it was like literally picking up calls from a blind stack, not knowing what department the customer wanted, and copping a lot of customer angst in the mean-time. Later attemtps to restart, kick-start, curse-start, threaten, cajole, hack, bypass and eventually switch modules couldn't stand up to the traffic and the phonesystem kept crashing.....another seven times in total, apparently.
I guess it could have been worse: our customer databse could have crashed along with everything else, and then we could have gone completely blind! *sarcastic cheer*
And now I'm home: it's still going on but I can't summon the energy to care, to be honest. Need aspirin. And preferably something very alcoholic.