[identity profile] rorted.livejournal.com posting in [community profile] techrecovery
Hi [Head of IT in Central-West African country],

Allow me to share with you my analysis of the factors leading to the loss of data from your Exchange server in [country's head office]. As noted previously [and loudly, by you] all e-mails for users in [country's head office] were lost, resulting in significant impact to business in [country].


**Double disk loss**

Last Friday the Exchange server at [country's head office] site experienced a single disk failure. Under normal circumstances, the array would remain online in the event of such a failure.

Almost two months ago [your team] received a [name of monitoring system] alert indicating that the server's array controller had reported one of the disks in the array as missing. This alert was acknowledged by [your team] at [timestamp] and closed a day later at [timestamp], apparently without first rectifying the issue. The missing disk combined with last Friday's failure resulted in the loss of two disks from the array, which is fatal.


**Incorrect drive configuration**

According to [IT department standards document], [Exchange] servers should be built using a RAID5 array for mailbox stores and a separate mirror set for log files. Under this scenario, if a total array failure affects only the logical drive containing the mailbox store, complete recovery of all messages is possible using a recent backup. If the lost array contains only the logs, the mailbox store can usually be recovered with less than 24 hours of lost e-mail.

Unfortunately in this case, the server was built [by you] using a single RAID5 array containing both the mailbox store and log files. [My boss] informs me this was done contrary to his advice at the time.


**Backup media issues**

Due to ongoing issues with the server's tape drive, the last successful backup of this server occurred on December 12th. On the 14th [your team] received a [monitoring system] alert. The corresponding ticket remains open and has not been updated [by your team] at the time of writing.

According to [IT department] guidelines, backup tapes should be removed from the rotation weekly (on Fridays) for archival purposes. Had the guidelines been followed in this case, there would have been a useable backup from the night of the 7th of December.

Because they were not followed, each tape from your rotation of five tapes was overwritten with partial backups and are therefore unusable.


**Information store backup permissions**

It appears that the information store was not configured properly for backups as per [IT department standards document]. Because the appropriate permissions were not granted to the [backup software] service account on the mailbox store, backups were failing with an access denied error. While backups of the other components of the job were completing successfully, backup tasks were finishing with Failed status.

This issue had been occurring since the server was commissioned [by you] in 2005. Before going into production, the [country] should ensure that backups are functioning correctly as per [IT department standards document]. Additionally, I would expect the status of backups to be checked **every day** by [your team's] staff.

Even if the backup media issues were not present at the time of last Friday's failure, we would still not have had a backup of the information store as a result of this problem.

--

In summary, multiple factors at the [country] level contributed to the unrecoverable loss of data on [server name]. I would recommend that [your team] undertake an audit of all its critical servers to ensure they comply with [IT department standards document] and [another IT department standards document] ASAP.

[I'd further recommend that your entire team is sacked, banned from ever touching a server again, and that they seriously reconsider any intentions they may have of breeding.]

Kind regards,

[Bewildered sysadmin]

Date: 2007-12-27 01:15 am (UTC)
From: [identity profile] yanni85.livejournal.com
Ouch.

As for backups, those are just a few of the reasons we are moving away from tape backups as fast as we can go.

Date: 2007-12-27 02:20 am (UTC)
jecook: (Default)
From: [personal profile] jecook
You *did* cc your boss, their boss, and the Big Big IT boss on this, right?

because they might just try and blame you for their fuck up, irrelevant to the fact that you've pinned them down like an fly on flypaper.

Date: 2007-12-27 02:46 am (UTC)
From: [identity profile] valiskeogh.livejournal.com
i suspect the [your team] consisted of the nephew of the VP of marketing who "knows all that computer stuff"

Date: 2007-12-27 02:48 am (UTC)
From: [identity profile] valiskeogh.livejournal.com
on the other hand, i'm crossing my fingers that this was Nigeria and the "significant impact to business" was all replies from american idiots trying to cash in on some deposed dictator's money.

Date: 2007-12-27 02:49 am (UTC)
From: [identity profile] azzy23.livejournal.com
BAH!!! Greatness!

Who doesn't check this stuff everyday?! Seriously! I used to do IT for a very small company (140 employees total, across all US sites). We had to hire a 2nd IT person to be onsite because my shift began at 6am, and until around 12pm, I was checking every single backup, switching tapes, and logging everything into a gigantic excel document.

Date: 2007-12-27 02:54 am (UTC)
From: [identity profile] argonel.livejournal.com
Being evil I would also recommend copying this to the local paper along with an estimate of the money wasted, including the salaries of the entire IT staff since 2005.

Date: 2007-12-27 03:46 am (UTC)
From: [identity profile] tecie.livejournal.com
I've heard that argument before -- but what can handle long term storage like tapes in a controlled environment?

Date: 2007-12-27 03:49 am (UTC)
From: [identity profile] kalium.livejournal.com
And alternate employment for some, too.

Date: 2007-12-27 03:51 am (UTC)
From: [identity profile] yanni85.livejournal.com
Fair. Tapes are good for long-term storage. However, on a day-to-day basis they suck. A bit of qualification on this is that I work in the Small Business environment. Most of our networks are single server and those that have more are almost all application servers. If we run a backup in the middle of the day their server slows way down and the client complains. Backup jobs take days to troubleshoot and they keep breaking down. Much better, we've found, is a solution which takes snapshots of the server for immediate backup needs and sends critical data offsite (where I am sure it is backed up by tapes).

Date: 2007-12-27 04:35 am (UTC)
From: [identity profile] tecie.livejournal.com
that makes sense. I've really only worked for midsized to large companies, so tapes tend to win in the economies of scale, as do redundant servers that we can just clone.

Date: 2007-12-27 04:35 am (UTC)
From: [identity profile] thecrazyfinn.livejournal.com
Tape for archival, seperate disk array(preferably offsite) for incremental.

Date: 2007-12-27 04:41 am (UTC)
From: [identity profile] dubhain.livejournal.com
Thank you for reminding me. I'm covering for my co-worker who's on vacation and because it isn't normally part of my job, I forgot to set-up the backup tapes.

*Gets coat, heads for car.*

Date: 2007-12-27 05:50 am (UTC)
From: [identity profile] ihateemo.livejournal.com
HAHAHAHAHAHAHAHAHA.

Win.

Date: 2007-12-27 06:19 am (UTC)
From: [identity profile] mouser.livejournal.com
"redundant" is a word I'm not allowed to hear. I wrangled a spare network harddrive that gets a copy which goes to tape. Seems to work okay (not great, I admit...)

Date: 2007-12-27 06:40 am (UTC)
From: [identity profile] valiskeogh.livejournal.com
SWISH!

i can feel the force growing inside me...

... maybe that's too much holiday food

Date: 2007-12-27 10:41 am (UTC)
ext_8716: (Default)
From: [identity profile] trixtah.livejournal.com
Eh, regarding having logs and dbs on a RAID5 array, I don't actually think you lose much in terms of recoverability, unless you have separate RAID controllers. Really, a separate mirror set is more for performance reasons. If there's only a couple of hundred mailboxes with moderate traffic, no biggie. The configuration wouldn't have made much difference with a two-disk failure.

As for the failed disk for AGES, and the NO BACKUPS EVAH, words fail me. Good to see they were evidently regularly testing their recovery procedures as well. NOT.

Date: 2007-12-27 02:18 pm (UTC)
From: [identity profile] ptstech.livejournal.com
Man, THAT is a seriously EPIC FAIL. We're talking Death Star-level fail here.

Date: 2007-12-27 02:43 pm (UTC)
From: [identity profile] squigit.livejournal.com
Nothing, but thats why you want a d2d2t setup. Keep your most recent week or so's data on spinning disk, and archive off to tape on a routine (but not daily) basis.

Date: 2007-12-27 03:34 pm (UTC)
From: [identity profile] arabwel.livejournal.com
... woah.

(also when reading this, i went omgwtfbbq at the new trainee here at our helpdesk who is shadowingf me today. I believe you can imagine the sound my head made mating with the desk when she asked "what are backups?")

Date: 2007-12-27 06:20 pm (UTC)
From: [identity profile] marco262.livejournal.com
I read this community occasionally, mostly to get some chuckles, but I don't read it often because some of this stuff tends to go over my head, lowly EE undergrad that I am.

Your trainee has given me new hope for my abilities, though. Maybe I do have what it takes for an IT position. :-)

(Still going for an EE design job, though. I just like to wonder whether I could hack it as an IT admin.)

Date: 2007-12-27 06:21 pm (UTC)
From: [identity profile] arabwel.livejournal.com
Dude, you have more education than I do, I don;t even have anything equivalent to a HS diploma :P

It gets worse. she also asked me what's a hard drive.

Date: 2007-12-27 06:23 pm (UTC)
From: [identity profile] marco262.livejournal.com
*headdesk* And you let her NEAR a computer?? Quick, quarantine THEN educate. Idiocy can be catastrophic when in large doses near sensitive electronics!

Date: 2007-12-27 06:24 pm (UTC)
From: [identity profile] arabwel.livejournal.com
What can I say? I blame the HR! I am not allwoed to keep her awya from the computer *sigh* the best i can do is hold her hand through everything she does.

Date: 2007-12-27 06:26 pm (UTC)
From: [identity profile] marco262.livejournal.com
With any luck, she's not one of the curious types. Make sure she doesn't go near Windows Explorer and the precious system files may yet stay unmolested.

And if she's reading this, don't fret, sister! 'Puters are easy to learn, and when you become friends with one, it can bring you so many wondrous pleasures!

Date: 2007-12-27 06:27 pm (UTC)
From: [identity profile] arabwel.livejournal.com
Until she gets a computer of her own, I am watchign every move she makes. Unless someone else gets saddled with her. but thankfully anything beyond facebook seems to go beyond her ken...

Date: 2007-12-27 06:53 pm (UTC)
From: [identity profile] loosechanj.livejournal.com
And to think, it could have been prevented if they'd just nailed plywood over that damn vent.

Date: 2007-12-27 07:58 pm (UTC)
From: [identity profile] toxico.livejournal.com
My company.

Backups have been failing since who-knows-when, up until the day after the person in charge of them got canned. Myself and someone else (he took the reins; once the mess is cleared it's my baby) handle them now.

Date: 2007-12-28 05:24 am (UTC)
From: [identity profile] japester.livejournal.com
If you have to touch your tapes, you will have problems. If you can keep them all in a controlled, robot managed box then all the problem magically go away.
Tape still has the best price per GB ratio, and running costs (negligible when idle).

For the small business market, external USB/firewire drives Just Work (tm).

Profile

techrecovery: (Default)
Elitist Computer Nerd Posse

April 2017

S M T W T F S
      1
2345678
91011121314 15
16171819202122
23242526272829
30      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 19th, 2026 04:15 pm
Powered by Dreamwidth Studios