Cock-upus Tremendumous
Dec. 27th, 2007 11:50 amHi [Head of IT in Central-West African country],
Allow me to share with you my analysis of the factors leading to the loss of data from your Exchange server in [country's head office]. As noted previously [and loudly, by you] all e-mails for users in [country's head office] were lost, resulting in significant impact to business in [country].
**Double disk loss**
Last Friday the Exchange server at [country's head office] site experienced a single disk failure. Under normal circumstances, the array would remain online in the event of such a failure.
Almost two months ago [your team] received a [name of monitoring system] alert indicating that the server's array controller had reported one of the disks in the array as missing. This alert was acknowledged by [your team] at [timestamp] and closed a day later at [timestamp], apparently without first rectifying the issue. The missing disk combined with last Friday's failure resulted in the loss of two disks from the array, which is fatal.
**Incorrect drive configuration**
According to [IT department standards document], [Exchange] servers should be built using a RAID5 array for mailbox stores and a separate mirror set for log files. Under this scenario, if a total array failure affects only the logical drive containing the mailbox store, complete recovery of all messages is possible using a recent backup. If the lost array contains only the logs, the mailbox store can usually be recovered with less than 24 hours of lost e-mail.
Unfortunately in this case, the server was built [by you] using a single RAID5 array containing both the mailbox store and log files. [My boss] informs me this was done contrary to his advice at the time.
**Backup media issues**
Due to ongoing issues with the server's tape drive, the last successful backup of this server occurred on December 12th. On the 14th [your team] received a [monitoring system] alert. The corresponding ticket remains open and has not been updated [by your team] at the time of writing.
According to [IT department] guidelines, backup tapes should be removed from the rotation weekly (on Fridays) for archival purposes. Had the guidelines been followed in this case, there would have been a useable backup from the night of the 7th of December.
Because they were not followed, each tape from your rotation of five tapes was overwritten with partial backups and are therefore unusable.
**Information store backup permissions**
It appears that the information store was not configured properly for backups as per [IT department standards document]. Because the appropriate permissions were not granted to the [backup software] service account on the mailbox store, backups were failing with an access denied error. While backups of the other components of the job were completing successfully, backup tasks were finishing with Failed status.
This issue had been occurring since the server was commissioned [by you] in 2005. Before going into production, the [country] should ensure that backups are functioning correctly as per [IT department standards document]. Additionally, I would expect the status of backups to be checked **every day** by [your team's] staff.
Even if the backup media issues were not present at the time of last Friday's failure, we would still not have had a backup of the information store as a result of this problem.
--
In summary, multiple factors at the [country] level contributed to the unrecoverable loss of data on [server name]. I would recommend that [your team] undertake an audit of all its critical servers to ensure they comply with [IT department standards document] and [another IT department standards document] ASAP.
[I'd further recommend that your entire team is sacked, banned from ever touching a server again, and that they seriously reconsider any intentions they may have of breeding.]
Kind regards,
[Bewildered sysadmin]
Allow me to share with you my analysis of the factors leading to the loss of data from your Exchange server in [country's head office]. As noted previously [and loudly, by you] all e-mails for users in [country's head office] were lost, resulting in significant impact to business in [country].
**Double disk loss**
Last Friday the Exchange server at [country's head office] site experienced a single disk failure. Under normal circumstances, the array would remain online in the event of such a failure.
Almost two months ago [your team] received a [name of monitoring system] alert indicating that the server's array controller had reported one of the disks in the array as missing. This alert was acknowledged by [your team] at [timestamp] and closed a day later at [timestamp], apparently without first rectifying the issue. The missing disk combined with last Friday's failure resulted in the loss of two disks from the array, which is fatal.
**Incorrect drive configuration**
According to [IT department standards document], [Exchange] servers should be built using a RAID5 array for mailbox stores and a separate mirror set for log files. Under this scenario, if a total array failure affects only the logical drive containing the mailbox store, complete recovery of all messages is possible using a recent backup. If the lost array contains only the logs, the mailbox store can usually be recovered with less than 24 hours of lost e-mail.
Unfortunately in this case, the server was built [by you] using a single RAID5 array containing both the mailbox store and log files. [My boss] informs me this was done contrary to his advice at the time.
**Backup media issues**
Due to ongoing issues with the server's tape drive, the last successful backup of this server occurred on December 12th. On the 14th [your team] received a [monitoring system] alert. The corresponding ticket remains open and has not been updated [by your team] at the time of writing.
According to [IT department] guidelines, backup tapes should be removed from the rotation weekly (on Fridays) for archival purposes. Had the guidelines been followed in this case, there would have been a useable backup from the night of the 7th of December.
Because they were not followed, each tape from your rotation of five tapes was overwritten with partial backups and are therefore unusable.
**Information store backup permissions**
It appears that the information store was not configured properly for backups as per [IT department standards document]. Because the appropriate permissions were not granted to the [backup software] service account on the mailbox store, backups were failing with an access denied error. While backups of the other components of the job were completing successfully, backup tasks were finishing with Failed status.
This issue had been occurring since the server was commissioned [by you] in 2005. Before going into production, the [country] should ensure that backups are functioning correctly as per [IT department standards document]. Additionally, I would expect the status of backups to be checked **every day** by [your team's] staff.
Even if the backup media issues were not present at the time of last Friday's failure, we would still not have had a backup of the information store as a result of this problem.
--
In summary, multiple factors at the [country] level contributed to the unrecoverable loss of data on [server name]. I would recommend that [your team] undertake an audit of all its critical servers to ensure they comply with [IT department standards document] and [another IT department standards document] ASAP.
[I'd further recommend that your entire team is sacked, banned from ever touching a server again, and that they seriously reconsider any intentions they may have of breeding.]
Kind regards,
[Bewildered sysadmin]
no subject
Date: 2007-12-27 01:15 am (UTC)As for backups, those are just a few of the reasons we are moving away from tape backups as fast as we can go.
no subject
Date: 2007-12-27 02:20 am (UTC)because they might just try and blame you for their fuck up, irrelevant to the fact that you've pinned them down like an fly on flypaper.
no subject
Date: 2007-12-27 02:46 am (UTC)no subject
Date: 2007-12-27 02:48 am (UTC)no subject
Date: 2007-12-27 02:49 am (UTC)Who doesn't check this stuff everyday?! Seriously! I used to do IT for a very small company (140 employees total, across all US sites). We had to hire a 2nd IT person to be onsite because my shift began at 6am, and until around 12pm, I was checking every single backup, switching tapes, and logging everything into a gigantic excel document.
no subject
Date: 2007-12-27 02:54 am (UTC)no subject
Date: 2007-12-27 02:58 am (UTC)no subject
Date: 2007-12-27 02:59 am (UTC)The powers that be were attempting to pin this on us (me) somehow, this is my way of recommending an alternate punching bag :)
no subject
Date: 2007-12-27 03:01 am (UTC)no subject
Date: 2007-12-27 03:46 am (UTC)no subject
Date: 2007-12-27 03:49 am (UTC)no subject
Date: 2007-12-27 03:51 am (UTC)no subject
Date: 2007-12-27 04:35 am (UTC)no subject
Date: 2007-12-27 04:35 am (UTC)no subject
Date: 2007-12-27 04:41 am (UTC)*Gets coat, heads for car.*
no subject
Date: 2007-12-27 05:50 am (UTC)Win.
no subject
Date: 2007-12-27 06:19 am (UTC)no subject
Date: 2007-12-27 06:40 am (UTC)i can feel the force growing inside me...
... maybe that's too much holiday food
no subject
Date: 2007-12-27 10:41 am (UTC)As for the failed disk for AGES, and the NO BACKUPS EVAH, words fail me. Good to see they were evidently regularly testing their recovery procedures as well. NOT.
no subject
Date: 2007-12-27 11:42 am (UTC)However in the case of the particular company I work for separate arrays doesn't just improve recoverability, it is actually a part of our recovery strategy. It works because log arrays are the same size as the EDB array, which is more than enough to fit both. This way if the EDB array dies we restore it to the logs drive, and if we lose logs we can recreate them on the EDB drive.
If it sounds silly that's because it is. For one, it doesn't help things if your system drive or array controller or, for that matter, anything else dies. But it does cover 2 common points of failure and is a quick way to get back online in places where there are often no IT staff in the same country, or in countries where hardware vendors don't guarantee parts in 4 hours*, or even 4 days, and where deliveries have been known to take upwards of 4 weeks.
(Besides, I wasn't really trying to suggest that a single array dimished recoverability -- I just thought it added impact to list all the ways they fucked up :)
no subject
Date: 2007-12-27 02:18 pm (UTC)no subject
Date: 2007-12-27 02:43 pm (UTC)no subject
Date: 2007-12-27 03:34 pm (UTC)(also when reading this, i went omgwtfbbq at the new trainee here at our helpdesk who is shadowingf me today. I believe you can imagine the sound my head made mating with the desk when she asked "what are backups?")
no subject
Date: 2007-12-27 06:20 pm (UTC)Your trainee has given me new hope for my abilities, though. Maybe I do have what it takes for an IT position. :-)
(Still going for an EE design job, though. I just like to wonder whether I could hack it as an IT admin.)
no subject
Date: 2007-12-27 06:21 pm (UTC)It gets worse. she also asked me what's a hard drive.
no subject
Date: 2007-12-27 06:23 pm (UTC)no subject
Date: 2007-12-27 06:24 pm (UTC)no subject
Date: 2007-12-27 06:26 pm (UTC)And if she's reading this, don't fret, sister! 'Puters are easy to learn, and when you become friends with one, it can bring you so many wondrous pleasures!
no subject
Date: 2007-12-27 06:27 pm (UTC)no subject
Date: 2007-12-27 06:53 pm (UTC)no subject
Date: 2007-12-27 07:58 pm (UTC)Backups have been failing since who-knows-when, up until the day after the person in charge of them got canned. Myself and someone else (he took the reins; once the mess is cleared it's my baby) handle them now.
no subject
Date: 2007-12-28 05:24 am (UTC)Tape still has the best price per GB ratio, and running costs (negligible when idle).
For the small business market, external USB/firewire drives Just Work (tm).