Tech-Support Test
Nov. 30th, 2005 10:09 amProblem1: DDS4 tapedrive rips tapes apart. tape is ejected correctly but when you pull it out, you see that the tape is ripped in two pieces. (Cartridge is not damaged. only the tape)
Problem2: Redhat EL SAMBA-Server with dual Gbit Broadcom NICs (onboard) and 4x78G Raid10 always crashes when getting HEAVY access to disks over Net
Problem3: Windows 2003 SBS. Black Screen. Hardware-Diagnostic LEDs show "Suspend to RAM" after about 2h of uptime (1,5h to 2,5h)
When staying in front of system, you see its shown that system is shutting down now before the "suspend to RAM" is shown and monitor is black
Problem4: Access to harddrive (Seagate Cheetah 36LP) sporadically not possible. Hardware Diagnostic shows no Errors. HD is 'offline' sometimes. after Reboot of System, HD is available again.
Problem5: Server's integrated Motherboard Hardware Log shows 25 Single-Bit Errors on an 1GB ECC-Ram Module during the last Month. Sporadically. not all of them at once.
What are the solution ? *G*
EDIT: OK, i should have said that to Problem3 that the PowerManagement in Bios AND OS are totaly deactivated.
Problem2: Redhat EL SAMBA-Server with dual Gbit Broadcom NICs (onboard) and 4x78G Raid10 always crashes when getting HEAVY access to disks over Net
Problem3: Windows 2003 SBS. Black Screen. Hardware-Diagnostic LEDs show "Suspend to RAM" after about 2h of uptime (1,5h to 2,5h)
When staying in front of system, you see its shown that system is shutting down now before the "suspend to RAM" is shown and monitor is black
Problem4: Access to harddrive (Seagate Cheetah 36LP) sporadically not possible. Hardware Diagnostic shows no Errors. HD is 'offline' sometimes. after Reboot of System, HD is available again.
Problem5: Server's integrated Motherboard Hardware Log shows 25 Single-Bit Errors on an 1GB ECC-Ram Module during the last Month. Sporadically. not all of them at once.
What are the solution ? *G*
EDIT: OK, i should have said that to Problem3 that the PowerManagement in Bios AND OS are totaly deactivated.
no subject
Date: 2005-11-30 09:30 am (UTC)I'll post the answers next monday ;)
no subject
Date: 2005-11-30 09:59 am (UTC)2) Blame the internet provider and transfer to another department.
3) Blame the system installer and transfer to another department.
4) Blame the software and transfer to another department.
5) Blame the hardware and transfer to another department.
... what? >.>
no subject
Date: 2005-11-30 10:00 am (UTC)Still. the question was assuming YOU are the one who is responsible for this issue *G*
no subject
Date: 2005-11-30 10:25 am (UTC)1) Most likely a problem with the ejection mechanism. Return the drive to the manufacturer for checking and repair, and either get a tape repair kit, or look for a professional tape repair company.
2) Most likely out-of-date firmware. Update firmware on all areas, check if system still has problem. If so, temporarily replace onboard NICs with carded NICs to attempt to isolate problem. If problem persists, it's a problem with the RAID. Unfortunately, I have almost no experience with that, so there my guess ends. :P
The remaining three, I don't have much idea on yet. I shall Ponder.
no subject
Date: 2005-11-30 11:32 am (UTC)Give me the kernel version of that system. A lot of people have been experiencing crashes under heavy system load recently, and everyone over at the Gentoo support forums is obsessed with the idea that it's caused by X.Org, which is BS because I've had my desktop go kaput in the middle of a Portage sync (think 200k itty-bitty files being updated over rsync if you're not familiar with Gentoo) in console mode, and my friend runs a Linux (Debian?) webserver that crashes like this too.
I have noticed that it tends to correlate with patterns of heavy network AND disk usage on my system.
Nobody has a fix yet, though. Revert back to something around 2.6.8, if at all possible.
no subject
Date: 2005-11-30 12:29 pm (UTC)2. Swap drive, observe result. Try equivalent NICs, observe result. Degrade OS, observe result. If solution is still not clear, Google for more information. Yes, it could be firmware.
3. Turn off powersave options in BIOS and/or OS.
4. Replace drive and observe result. Problem may lie with drive, dodgy bus cable or interface, power to motherboard or drive, OS, odd driver software, the cleaner unplugging the drive rack power, you name it.
5. Swap module with known working one from another server. Observe logs from both servers to see where problem appears to originate. Take appropriate action accordingly.
Of course, most of these assume you have access to a pool of equivalent hardware.
I'm a little rusty, but...
Date: 2005-11-30 01:48 pm (UTC)2) Busmastering or PCI INT conflicts. Look up the PCI IRQ routing table and see if a NIC isn't sharing with the RAID controller. Also make sure options like "Improve PCI performance" or "PCI wait state" are all at the BIOS defaults: You're not going to overclock the damn server!
3) Something is triggering an OS shutdown if you actually see the "system is shutting down now" before it shuts down windows and then goes into hardware sleep. The STR is an anomaly, but some systems indicate STR when they have been soft powered-off. Still, a crash in a critical system process would cause a reboot, and a kernal failure would BSOD and reboot... not shut down is there a task or something that is actually triggering a shutdown? This could be a thermal shutdown if ACPI monitoring is enabled... it could even be a shorting soft power switch!
4) SCSI termination, connector, or cable. A drive diagnostic might not reveal that kind of thing, but that fact that it comes back after a SCSI init gives me a clue.
5) Marginal stick of RAM, although this is exactly the kind of thing ECC is designed to not even break a sweat over. reseat the RAM or move it to a different slot if possible. If it continues, swap the RAM. If it still continues, turn off the nearby particle accelerator.
no subject
Date: 2005-11-30 02:00 pm (UTC)Unfortunately, my system has a Linksys LNE100TX (Davicom Tulip-based) NIC, so we're looking at an issue somewhere else in Linux.
no subject
Date: 2005-11-30 02:09 pm (UTC)lsmod should show you :)
no subject
Date: 2005-11-30 03:25 pm (UTC)2: No real experience with El Samba. Update all drivers if possible, google error messages if possible, try with another set of NIC's.
3: Is it SUPPOSED to be suspending to ram? Why suspend at all? Just set the drives to spin down after 2 hours inactivity, and let it run it's little heart out if it wants to. Disable auto shutdown/suspend in OS and in BIOS. I suppose if you want to insist upon it doing so, update everything, and disable only the bios suspend/shutdown options, as it's likely messing with windows. Are there any apps running that would force it down like that? Start with bare bones to see, no apps. (no hands on experience with win2k3 either)
4. Hard drive spinning down during inactivity? Disable all windows power saving options. Update bios, drivers, OS, make sure the drive is in correct priority on the scsi and nothing else is on the chain messing it up. If possible isolate it to it's own port. Is access from net, terminal, or system itself? Does issue occur with any other devices on the chain or array or on the system? Does replacing the drive fix the issue?
5. Replace the RAM. Does the issue recur? Do the other ram modules encounter any issue? If not it's either the slot or the RAM. If yes, it may be an OS issue, either the RAM size is miscounted by OS or being assigned incorrectly by OS or programs.
no subject
Date: 2005-11-30 03:30 pm (UTC)no subject
Date: 2005-11-30 03:41 pm (UTC)no subject
Date: 2005-11-30 03:43 pm (UTC)1: RNR tape drive. (RNR = Remove 'n Replace)
2: PCI conflicts, solution was posted earlier about fiddling with the BIOS.
3: For dog's sake, turn off tall power management in the BIOS and the OS! Not even laptops still do power management right in these enlightened times! (well, except macs, but there's a reason for that...)
4: sounds like the drive might be flaking out. swap it with a known good one and see if it continues. if it does, it's the controller.
5: ECC ram is correcting those errors. solutions listed in above comments.
no subject
Date: 2005-11-30 03:43 pm (UTC)no subject
Date: 2005-11-30 05:18 pm (UTC)Actually, I have recommended that to at least one customer. :)
no subject
Date: 2005-11-30 05:47 pm (UTC)no subject
Date: 2005-11-30 06:51 pm (UTC)"And you have removed the power cord?"
"Yes! But there's a battery! So it will never turn off!"
"And what happens if you remove the battery and the power cord at the same time?"
Many people don't listen to this and insist that even if the cord & battery are removed, the computer will stay on.
Hence, a priest.
no subject
Date: 2005-11-30 05:32 pm (UTC)no subject
Date: 2005-11-30 06:45 pm (UTC)Would probably end up with replacing the drive. Or maybe see if they use some kind of el-cheapo tapes imported from a child labour factory in Western Mongolia.
Problem 2:
Haven't done that stuff enough to know straight off the bat.
Problem 3:
Disable all power management.
Problem 4:
Could be a lot of stuff. Would probably suspect controller issues, could also be quite a bit of software issues.
Problem 5:
Try re-seating RAM chip. Try another known good chip. Could also be the slot or OS.
no subject
Date: 2005-11-30 08:28 pm (UTC)Problem 2) not sure. Other comments suggest some sort of kernel-level optimizing issue? *nix experience lacking... :(
Problem 3) Windows sucks at suspend/hibernation. Check BIOS--look for Suspend in the APM config (my old Epox workstation board has S1 & S3/STR settings--set to S1. STR= suspend to ram) Check windows power settings. Make power saving impossible ;)
Problem 4) (*lots* of homebrew SCSI exp...) cable-length? Termination? ID collision? (if a windows system, bet there's a ton of errors in the system log saying error in driver servicing that chain)
Problem 5) high system temp? faulty connection?
Congratulations on social engineering <lj user=techsupport>!
Date: 2005-12-01 12:06 am (UTC)Re: Congratulations on social engineering <lj user=techsupport>!
Date: 2005-12-01 05:10 am (UTC)I'm sure the intent was to see if anyone had a different angle on a solution i.e. "choose your adventure" books. I was kinda shocked at the response given prior (http://www.livejournal.com/community/techsupport/883616.html) posts (http://www.livejournal.com/community/techsupport/874019.html) in this community. These didn't insult my intelligence, but I don't know if it follows the FM. The mods got involved, so I guess it was legally within the written rules (it was posted as a game rather than outright questions) although possibly against the assumed rules (they were still technical problems that needed to [and probably got] fixed)..
IANAM, I'm rambling, and I apologize.
Re: Congratulations on social engineering <lj user=techsupport>!
Date: 2005-12-02 06:31 pm (UTC)no subject
Date: 2005-12-01 05:28 am (UTC)I hope that drive is still under warranty :) it sounds like it needs replacing
get some real network ports off that motherboard - I've never found on board NICs reliable
disable the sleep options in the windows control panel. ie say 'i'm a server, not a workstation or a laptop'
weird? dodgy controller board on the hard drive?
dodgy IDE controller?
update the windows motherboard drivers?
dodgy RAM, replace it.
no subject
Date: 2005-12-02 03:15 am (UTC)1. Kill the equipment and replace it. If it's tearing tapes apart, there /might/ be a buildup of dust or odds and ends in the machine; if you don't want to replace it and have the know-how to take it apart (and put it back together afterwards), see if you can clean the interior -- then fill a blank with garbage of whatever sort to make sure it's not bad.
2. A crash on heavy disk access might indicate a number of issues; if the RAID array is intended to speed access, one of the drives most likely is corrupt. The long way...run a sanity test on each individual drive; if one fails, then check for media errors. Replace it ASAP. This assumes that heavy use merely triggers the law of averages, though; check the hardware to ensure that all of the hard drives are connected properly, and if they're not, check the RAID array with four different drives and hammer it.
3. Keyboard shortage, power supply going 'click', BIOS shutdown from overheating...there are more possibilities than can reasonably be accounted for with the information given. Although the power management is disabled, check for other factors, including viruses in the system.
4. Once again, check to make sure that the hard drive is connected properly on both power and motherboard. After it shuts itself off, check the Device Manager to see if the system actually sees it; if not, check power management settings for it. (This might only be the case if the hard drive is idle for several minutes or hours.) Check with a separate hard drive, again, to make sure that it's not just the hardware itself.
5. Run memtest86, swap the slot it's in, run memtest again, swap the stick, run it /again/...repeat until you isolate the problem. Likely the stick rather than the board, but you can never tell until you test.