techrecovery | Cock-upus Tremendumous

Hi [Head of IT in Central-West African country],

Allow me to share with you my analysis of the factors leading to the loss of data from your Exchange server in [country's head office]. As noted previously [and loudly, by you] all e-mails for users in [country's head office] were lost, resulting in significant impact to business in [country].

**Double disk loss**

Last Friday the Exchange server at [country's head office] site experienced a single disk failure. Under normal circumstances, the array would remain online in the event of such a failure.

Almost two months ago [your team] received a [name of monitoring system] alert indicating that the server's array controller had reported one of the disks in the array as missing. This alert was acknowledged by [your team] at [timestamp] and closed a day later at [timestamp], apparently without first rectifying the issue. The missing disk combined with last Friday's failure resulted in the loss of two disks from the array, which is fatal.

**Incorrect drive configuration**

According to [IT department standards document], [Exchange] servers should be built using a RAID5 array for mailbox stores and a separate mirror set for log files. Under this scenario, if a total array failure affects only the logical drive containing the mailbox store, complete recovery of all messages is possible using a recent backup. If the lost array contains only the logs, the mailbox store can usually be recovered with less than 24 hours of lost e-mail.

Unfortunately in this case, the server was built [by you] using a single RAID5 array containing both the mailbox store and log files. [My boss] informs me this was done contrary to his advice at the time.

**Backup media issues**

Due to ongoing issues with the server's tape drive, the last successful backup of this server occurred on December 12th. On the 14th [your team] received a [monitoring system] alert. The corresponding ticket remains open and has not been updated [by your team] at the time of writing.

According to [IT department] guidelines, backup tapes should be removed from the rotation weekly (on Fridays) for archival purposes. Had the guidelines been followed in this case, there would have been a useable backup from the night of the 7th of December.

Because they were not followed, each tape from your rotation of five tapes was overwritten with partial backups and are therefore unusable.

**Information store backup permissions**

It appears that the information store was not configured properly for backups as per [IT department standards document]. Because the appropriate permissions were not granted to the [backup software] service account on the mailbox store, backups were failing with an access denied error. While backups of the other components of the job were completing successfully, backup tasks were finishing with Failed status.

This issue had been occurring since the server was commissioned [by you] in 2005. Before going into production, the [country] should ensure that backups are functioning correctly as per [IT department standards document]. Additionally, I would expect the status of backups to be checked **every day** by [your team's] staff.

Even if the backup media issues were not present at the time of last Friday's failure, we would still not have had a backup of the information store as a result of this problem.

--

In summary, multiple factors at the [country] level contributed to the unrecoverable loss of data on [server name]. I would recommend that [your team] undertake an audit of all its critical servers to ensure they comply with [IT department standards document] and [another IT department standards document] ASAP.

[I'd further recommend that your entire team is sacked, banned from ever touching a server again, and that they seriously reconsider any intentions they may have of breeding.]

Kind regards,

[Bewildered sysadmin]

Threaded | Top-Level Comments Only

From:

yanni85.livejournal.com

Ouch.

As for backups, those are just a few of the reasons we are moving away from tape backups as fast as we can go.

From:

jecook

You *did* cc your boss, their boss, and the Big Big IT boss on this, right?

because they might just try and blame you for their fuck up, irrelevant to the fact that you've pinned them down like an fly on flypaper.

From:

valiskeogh.livejournal.com

i suspect the [your team] consisted of the nephew of the VP of marketing who "knows all that computer stuff"

From:

valiskeogh.livejournal.com

on the other hand, i'm crossing my fingers that this was Nigeria and the "significant impact to business" was all replies from american idiots trying to cash in on some deposed dictator's money.

From:

azzy23.livejournal.com

BAH!!! Greatness!

Who doesn't check this stuff everyday?! Seriously! I used to do IT for a very small company (140 employees total, across all US sites). We had to hire a 2nd IT person to be onsite because my shift began at 6am, and until around 12pm, I was checking every single backup, switching tapes, and logging everything into a gigantic excel document.

From:

argonel.livejournal.com

Being evil I would also recommend copying this to the local paper along with an estimate of the money wasted, including the salaries of the entire IT staff since 2005.

From:

rorted.livejournal.com

The closest thing I can compare it with... I once witnessed a terrible accident and gave a statement to the police afterward.

From:

rorted.livejournal.com

Haha, yes of course!

The powers that be were attempting to pin this on us (me) somehow, this is my way of recommending an alternate punching bag :)

From:

rorted.livejournal.com

It was Nigeria, actually.

From:

tecie.livejournal.com

I've heard that argument before -- but what can handle long term storage like tapes in a controlled environment?

From:

kalium.livejournal.com

And alternate employment for some, too.

From:

yanni85.livejournal.com

Fair. Tapes are good for long-term storage. However, on a day-to-day basis they suck. A bit of qualification on this is that I work in the Small Business environment. Most of our networks are single server and those that have more are almost all application servers. If we run a backup in the middle of the day their server slows way down and the client complains. Backup jobs take days to troubleshoot and they keep breaking down. Much better, we've found, is a solution which takes snapshots of the server for immediate backup needs and sends critical data offsite (where I am sure it is backed up by tapes).

From:

tecie.livejournal.com

that makes sense. I've really only worked for midsized to large companies, so tapes tend to win in the economies of scale, as do redundant servers that we can just clone.

From:

thecrazyfinn.livejournal.com

Tape for archival, seperate disk array(preferably offsite) for incremental.

From:

dubhain.livejournal.com

Thank you for reminding me. I'm covering for my co-worker who's on vacation and because it isn't normally part of my job, I forgot to set-up the backup tapes.

*Gets coat, heads for car.*

From:

ihateemo.livejournal.com

HAHAHAHAHAHAHAHAHA.

Win.

From:

mouser.livejournal.com

"redundant" is a word I'm not allowed to hear. I wrangled a spare network harddrive that gets a copy which goes to tape. Seems to work okay (not great, I admit...)

From:

valiskeogh.livejournal.com

SWISH!

i can feel the force growing inside me...

... maybe that's too much holiday food

From:

trixtah.livejournal.com

Eh, regarding having logs and dbs on a RAID5 array, I don't actually think you lose much in terms of recoverability, unless you have separate RAID controllers. Really, a separate mirror set is more for performance reasons. If there's only a couple of hundred mailboxes with moderate traffic, no biggie. The configuration wouldn't have made much difference with a two-disk failure.

As for the failed disk for AGES, and the NO BACKUPS EVAH, words fail me. Good to see they were evidently regularly testing their recovery procedures as well. NOT.

From:

rorted.livejournal.com

I agree that the different RAID levels for each doesn't add recoverability and is purely to improve the performance of logs vs the EDB.

However in the case of the particular company I work for separate arrays doesn't just improve recoverability, it is actually a part of our recovery strategy. It works because log arrays are the same size as the EDB array, which is more than enough to fit both. This way if the EDB array dies we restore it to the logs drive, and if we lose logs we can recreate them on the EDB drive.

If it sounds silly that's because it is. For one, it doesn't help things if your system drive or array controller or, for that matter, anything else dies. But it does cover 2 common points of failure and is a quick way to get back online in places where there are often no IT staff in the same country, or in countries where hardware vendors don't guarantee parts in 4 hours*, or even 4 days, and where deliveries have been known to take upwards of 4 weeks.

(Besides, I wasn't really trying to suggest that a single array dimished recoverability -- I just thought it added impact to list all the ways they fucked up :)

From:

ptstech.livejournal.com

Man, THAT is a seriously EPIC FAIL. We're talking Death Star-level fail here.

From:

squigit.livejournal.com

Nothing, but thats why you want a d2d2t setup. Keep your most recent week or so's data on spinning disk, and archive off to tape on a routine (but not daily) basis.

From:

arabwel.livejournal.com

... woah.

(also when reading this, i went omgwtfbbq at the new trainee here at our helpdesk who is shadowingf me today. I believe you can imagine the sound my head made mating with the desk when she asked "what are backups?")

From:

marco262.livejournal.com

I read this community occasionally, mostly to get some chuckles, but I don't read it often because some of this stuff tends to go over my head, lowly EE undergrad that I am.

Your trainee has given me new hope for my abilities, though. Maybe I do have what it takes for an IT position. :-)

(Still going for an EE design job, though. I just like to wonder whether I could hack it as an IT admin.)

From:

arabwel.livejournal.com

Dude, you have more education than I do, I don;t even have anything equivalent to a HS diploma :P

It gets worse. she also asked me what's a hard drive.

From:

marco262.livejournal.com

*headdesk* And you let her NEAR a computer?? Quick, quarantine THEN educate. Idiocy can be catastrophic when in large doses near sensitive electronics!

From:

arabwel.livejournal.com

What can I say? I blame the HR! I am not allwoed to keep her awya from the computer *sigh* the best i can do is hold her hand through everything she does.

From:

marco262.livejournal.com

With any luck, she's not one of the curious types. Make sure she doesn't go near Windows Explorer and the precious system files may yet stay unmolested.

And if she's reading this, don't fret, sister! 'Puters are easy to learn, and when you become friends with one, it can bring you so many wondrous pleasures!

From:

arabwel.livejournal.com

Until she gets a computer of her own, I am watchign every move she makes. Unless someone else gets saddled with her. but thankfully anything beyond facebook seems to go beyond her ken...

From:

loosechanj.livejournal.com

And to think, it could have been prevented if they'd just nailed plywood over that damn vent.

From:

toxico.livejournal.com

My company.

Backups have been failing since who-knows-when, up until the day after the person in charge of them got canned. Myself and someone else (he took the reins; once the mess is cleared it's my baby) handle them now.

From:

japester.livejournal.com

If you have to touch your tapes, you will have problems. If you can keep them all in a controlled, robot managed box then all the problem magically go away.
Tape still has the best price per GB ratio, and running costs (negligible when idle).

For the small business market, external USB/firewire drives Just Work (tm).