[identity profile] lions-tambua.livejournal.com posting in [community profile] techrecovery
Problem1: DDS4 tapedrive rips tapes apart. tape is ejected correctly but when you pull it out, you see that the tape is ripped in two pieces. (Cartridge is not damaged. only the tape)


Problem2: Redhat EL SAMBA-Server with dual Gbit Broadcom NICs (onboard) and 4x78G Raid10 always crashes when getting HEAVY access to disks over Net


Problem3: Windows 2003 SBS. Black Screen. Hardware-Diagnostic LEDs show "Suspend to RAM" after about 2h of uptime (1,5h to 2,5h)
When staying in front of system, you see its shown that system is shutting down now before the "suspend to RAM" is shown and monitor is black


Problem4: Access to harddrive (Seagate Cheetah 36LP) sporadically not possible. Hardware Diagnostic shows no Errors. HD is 'offline' sometimes. after Reboot of System, HD is available again.


Problem5: Server's integrated Motherboard Hardware Log shows 25 Single-Bit Errors on an 1GB ECC-Ram Module during the last Month. Sporadically. not all of them at once.

What are the solution ? *G*

EDIT: OK, i should have said that to Problem3 that the PowerManagement in Bios AND OS are totaly deactivated.

Date: 2005-11-30 09:59 am (UTC)
From: [identity profile] canthlian.livejournal.com
1) Blame the user and send them to the manufacturer.
2) Blame the internet provider and transfer to another department.
3) Blame the system installer and transfer to another department.
4) Blame the software and transfer to another department.
5) Blame the hardware and transfer to another department.

... what? >.>

Date: 2005-11-30 10:25 am (UTC)
From: [identity profile] canthlian.livejournal.com
Well... assuming that, here's a couple of guesses.

1) Most likely a problem with the ejection mechanism. Return the drive to the manufacturer for checking and repair, and either get a tape repair kit, or look for a professional tape repair company.

2) Most likely out-of-date firmware. Update firmware on all areas, check if system still has problem. If so, temporarily replace onboard NICs with carded NICs to attempt to isolate problem. If problem persists, it's a problem with the RAID. Unfortunately, I have almost no experience with that, so there my guess ends. :P

The remaining three, I don't have much idea on yet. I shall Ponder.

Date: 2005-11-30 11:32 am (UTC)
From: [identity profile] spitefulcrow.livejournal.com
For #2:
Give me the kernel version of that system. A lot of people have been experiencing crashes under heavy system load recently, and everyone over at the Gentoo support forums is obsessed with the idea that it's caused by X.Org, which is BS because I've had my desktop go kaput in the middle of a Portage sync (think 200k itty-bitty files being updated over rsync if you're not familiar with Gentoo) in console mode, and my friend runs a Linux (Debian?) webserver that crashes like this too.
I have noticed that it tends to correlate with patterns of heavy network AND disk usage on my system.
Nobody has a fix yet, though. Revert back to something around 2.6.8, if at all possible.

Date: 2005-11-30 12:29 pm (UTC)
From: [identity profile] the-s-guy.livejournal.com
1. Swap out the drive. Confirm new drive does not shred tapes. Gut original drive for use as fake component corpses next time management demands to see a physical result for a software fix.

2. Swap drive, observe result. Try equivalent NICs, observe result. Degrade OS, observe result. If solution is still not clear, Google for more information. Yes, it could be firmware.

3. Turn off powersave options in BIOS and/or OS.

4. Replace drive and observe result. Problem may lie with drive, dodgy bus cable or interface, power to motherboard or drive, OS, odd driver software, the cleaner unplugging the drive rack power, you name it.

5. Swap module with known working one from another server. Observe logs from both servers to see where problem appears to originate. Take appropriate action accordingly.

Of course, most of these assume you have access to a pool of equivalent hardware.

I'm a little rusty, but...

Date: 2005-11-30 01:48 pm (UTC)
From: [identity profile] coyoteden.livejournal.com
1) You say "rips tapes apart" implying more than one. So it's not a bad tape. There could be grease or some kind of gunk in the tape path. If cleaning doesn't fix it, just replace the drive. Mechanical problems that destroy expensive media aren't worth dicking around with.

2) Busmastering or PCI INT conflicts. Look up the PCI IRQ routing table and see if a NIC isn't sharing with the RAID controller. Also make sure options like "Improve PCI performance" or "PCI wait state" are all at the BIOS defaults: You're not going to overclock the damn server!

3) Something is triggering an OS shutdown if you actually see the "system is shutting down now" before it shuts down windows and then goes into hardware sleep. The STR is an anomaly, but some systems indicate STR when they have been soft powered-off. Still, a crash in a critical system process would cause a reboot, and a kernal failure would BSOD and reboot... not shut down is there a task or something that is actually triggering a shutdown? This could be a thermal shutdown if ACPI monitoring is enabled... it could even be a shorting soft power switch!

4) SCSI termination, connector, or cable. A drive diagnostic might not reveal that kind of thing, but that fact that it comes back after a SCSI init gives me a clue.

5) Marginal stick of RAM, although this is exactly the kind of thing ECC is designed to not even break a sweat over. reseat the RAM or move it to a different slot if possible. If it continues, swap the RAM. If it still continues, turn off the nearby particle accelerator.

Date: 2005-11-30 02:00 pm (UTC)
From: [identity profile] spitefulcrow.livejournal.com
I wish the solution were as easy as your mysteriously deleted comment made it sound.
Unfortunately, my system has a Linksys LNE100TX (Davicom Tulip-based) NIC, so we're looking at an issue somewhere else in Linux.

Date: 2005-11-30 03:25 pm (UTC)
From: [identity profile] the-paco.livejournal.com
1: yank it and replace it. Depending on workload/boredom/warranty, take it apart and find what's wrong with it/destroy it.

2: No real experience with El Samba. Update all drivers if possible, google error messages if possible, try with another set of NIC's.

3: Is it SUPPOSED to be suspending to ram? Why suspend at all? Just set the drives to spin down after 2 hours inactivity, and let it run it's little heart out if it wants to. Disable auto shutdown/suspend in OS and in BIOS. I suppose if you want to insist upon it doing so, update everything, and disable only the bios suspend/shutdown options, as it's likely messing with windows. Are there any apps running that would force it down like that? Start with bare bones to see, no apps. (no hands on experience with win2k3 either)

4. Hard drive spinning down during inactivity? Disable all windows power saving options. Update bios, drivers, OS, make sure the drive is in correct priority on the scsi and nothing else is on the chain messing it up. If possible isolate it to it's own port. Is access from net, terminal, or system itself? Does issue occur with any other devices on the chain or array or on the system? Does replacing the drive fix the issue?

5. Replace the RAM. Does the issue recur? Do the other ram modules encounter any issue? If not it's either the slot or the RAM. If yes, it may be an OS issue, either the RAM size is miscounted by OS or being assigned incorrectly by OS or programs.

Date: 2005-11-30 03:30 pm (UTC)
From: [identity profile] the-paco.livejournal.com
Post more of these! Others, too! If I want deeper in this industry, I need to be given access to this kind of stuff, and nobody I work with uses it. Being able to go into an interview and start rattling off about Win2k3 after a few months of brain stretching in here would be nice.

Date: 2005-11-30 03:43 pm (UTC)
jecook: (Default)
From: [personal profile] jecook
These appear to all be hardware issues except #3.

1: RNR tape drive. (RNR = Remove 'n Replace)
2: PCI conflicts, solution was posted earlier about fiddling with the BIOS.
3: For dog's sake, turn off tall power management in the BIOS and the OS! Not even laptops still do power management right in these enlightened times! (well, except macs, but there's a reason for that...)
4: sounds like the drive might be flaking out. swap it with a known good one and see if it continues. if it does, it's the controller.
5: ECC ram is correcting those errors. solutions listed in above comments.

Date: 2005-11-30 03:43 pm (UTC)
jecook: (Moderator)
From: [personal profile] jecook
This is a type of post that I approve of. Let's see more of these!

Date: 2005-11-30 05:18 pm (UTC)
From: [identity profile] tertiumquid.livejournal.com
If all from one site, call a Priest.
Actually, I have recommended that to at least one customer. :)

Date: 2005-11-30 05:32 pm (UTC)
From: [identity profile] geekgrrl-ca.livejournal.com
All from the same place? replace the user.

Date: 2005-11-30 06:45 pm (UTC)
From: [identity profile] archatos.livejournal.com
Problem 1:
Would probably end up with replacing the drive. Or maybe see if they use some kind of el-cheapo tapes imported from a child labour factory in Western Mongolia.

Problem 2:
Haven't done that stuff enough to know straight off the bat.

Problem 3:
Disable all power management.

Problem 4:
Could be a lot of stuff. Would probably suspect controller issues, could also be quite a bit of software issues.

Problem 5:
Try re-seating RAM chip. Try another known good chip. Could also be the slot or OS.

Date: 2005-11-30 06:51 pm (UTC)
From: [identity profile] redqueenmeg.livejournal.com
I have too. To the customers who insist that their portable computers will not turn off, NO MATTER WHAT.

"And you have removed the power cord?"
"Yes! But there's a battery! So it will never turn off!"
"And what happens if you remove the battery and the power cord at the same time?"

Many people don't listen to this and insist that even if the cord & battery are removed, the computer will stay on.

Hence, a priest.

Date: 2005-11-30 08:28 pm (UTC)
From: [identity profile] major-error.livejournal.com
Problem 1) Definite hardware fault. warranty replacement if possible.

Problem 2) not sure. Other comments suggest some sort of kernel-level optimizing issue? *nix experience lacking... :(

Problem 3) Windows sucks at suspend/hibernation. Check BIOS--look for Suspend in the APM config (my old Epox workstation board has S1 & S3/STR settings--set to S1. STR= suspend to ram) Check windows power settings. Make power saving impossible ;)

Problem 4) (*lots* of homebrew SCSI exp...) cable-length? Termination? ID collision? (if a windows system, bet there's a ton of errors in the system log saying error in driver servicing that chain)

Problem 5) high system temp? faulty connection?
jjjiii: It's pug! (Default)
From: [personal profile] jjjiii
We always flame people when they ask for help with problems, but you fooled everyone by making it sound like it was a contest. You're cleverer than all of us, except me, who isn't falling for it! Nice goin!
From: [identity profile] ace-brickman.livejournal.com
yes they're technical support questions, but they're not about how to switch applications when the mouse isn't working.. They're pertinent to "the industry" and seem to be basic enterprise/corporate solutions.

I'm sure the intent was to see if anyone had a different angle on a solution i.e. "choose your adventure" books. I was kinda shocked at the response given prior (http://www.livejournal.com/community/techsupport/883616.html) posts (http://www.livejournal.com/community/techsupport/874019.html) in this community. These didn't insult my intelligence, but I don't know if it follows the FM. The mods got involved, so I guess it was legally within the written rules (it was posted as a game rather than outright questions) although possibly against the assumed rules (they were still technical problems that needed to [and probably got] fixed)..

IANAM, I'm rambling, and I apologize.

Date: 2005-12-01 05:28 am (UTC)
From: [identity profile] japester.livejournal.com

  1. DDS4 drive:
    I hope that drive is still under warranty :) it sounds like it needs replacing
  2. RHLinux crashing:
    get some real network ports off that motherboard - I've never found on board NICs reliable
  3. Sleeping 2k3 server:
    disable the sleep options in the windows control panel. ie say 'i'm a server, not a workstation or a laptop'
  4. intermittent hard drive:
    weird? dodgy controller board on the hard drive?
    dodgy IDE controller?
    update the windows motherboard drivers?
  5. ECC ram errors:
    dodgy RAM, replace it.


Date: 2005-12-02 03:15 am (UTC)
From: [identity profile] theogrin.livejournal.com
I'm not a guru, by any means, so these are most likely wrong -- but I figure i might as well give a try.

1. Kill the equipment and replace it. If it's tearing tapes apart, there /might/ be a buildup of dust or odds and ends in the machine; if you don't want to replace it and have the know-how to take it apart (and put it back together afterwards), see if you can clean the interior -- then fill a blank with garbage of whatever sort to make sure it's not bad.

2. A crash on heavy disk access might indicate a number of issues; if the RAID array is intended to speed access, one of the drives most likely is corrupt. The long way...run a sanity test on each individual drive; if one fails, then check for media errors. Replace it ASAP. This assumes that heavy use merely triggers the law of averages, though; check the hardware to ensure that all of the hard drives are connected properly, and if they're not, check the RAID array with four different drives and hammer it.

3. Keyboard shortage, power supply going 'click', BIOS shutdown from overheating...there are more possibilities than can reasonably be accounted for with the information given. Although the power management is disabled, check for other factors, including viruses in the system.

4. Once again, check to make sure that the hard drive is connected properly on both power and motherboard. After it shuts itself off, check the Device Manager to see if the system actually sees it; if not, check power management settings for it. (This might only be the case if the hard drive is idle for several minutes or hours.) Check with a separate hard drive, again, to make sure that it's not just the hardware itself.

5. Run memtest86, swap the slot it's in, run memtest again, swap the stick, run it /again/...repeat until you isolate the problem. Likely the stick rather than the board, but you can never tell until you test.
From: [identity profile] ihateemo.livejournal.com
This guy is a regular and we know he's not a dumbass asking how to make the internet work. We approve of this post.

Profile

techrecovery: (Default)
Elitist Computer Nerd Posse

April 2017

S M T W T F S
      1
2345678
91011121314 15
16171819202122
23242526272829
30      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 19th, 2026 08:14 pm
Powered by Dreamwidth Studios