[Discuss] ATA Access Errors For Spinning Disk
    jbk 
    jbk at kjkelra.com
       
    Tue Dec 19 08:06:31 EST 2023
    
    
  
Thank's Steve.
I may incorporate some of the recommendations in the future. 
For the present I'm going to look at the power connection to 
the disk. I'm also going to grab a newish spare and connect 
it up to see if it throws similar errors.
Jimk
On 12/17/23 21:05, Steve Litt wrote:
> jbk said on Sun, 17 Dec 2023 10:13:36 -0500
>
>> I periodically get access errors for a specific spinning
>> disk that I have done these things to diagnose:
>> Changed Sata Cable
>> Switched Sata bus on MB
>> Run E2fsck on the 3 formatted ext4 partitions w/ no errors found
>> Run smartctl -a: all results within norms
>> Run smartctl -t short: No errors found
>>
>> Disk operation age is about 7.5 years with around a couple
>> hundred starts. It has been in continuous operation for over
>> 8 years except during vacations. On occasion the disk
>> partitions will become unmounted and a mount -a will remount
>> the partitions as a different device from lets say sda to
>> sdd. I've not lost any data and I do regular backups to
>> another device that's rotated out of system.
>>
>> I seem to have always had these errors present on this MB
>> that is maybe 4 or 5 years in operation. Any thoughts on the
>> cause of this issue? Do others see this behavior on occasion
>> on systems they manage?
>>
>> On this same system my Rocky OS on an SSD is showing no
>> issues at all. Same operation age as the spinner.
> I really like the troubleshooting strategy you've pursued in trying to
> find the root cause of this intermittent problem. As we all know,
> intermittents are much more difficult to diagnose than reproducible
> symptoms. If you look at the Universal Troubleshooting Process (UTP) on
> Troubleshooters.Com, you'll see that UTP step 5, Corrective
> Maintenance, is extremely powerful and necessary with intermittent
> problems. I have some suggestions for Corrective Maintenance and
> further diagnostic tests...
>
> * You get occasional disk errors, any of which could cause data
>    corruption. To prevent things from getting worse, boot a rescue
>    distro and ddrescue your current disk to a larger disk, and if you
>    ever mount that backup disk, mount it read-only.
>
> * Lubricate all electronic contacts for all cables, daughter cards, RAM
>    sticks, switches with associated cables, and jacks and plugs for all
>    peripherals. Apply the lubricant to conductive surfaces on both plug
>    (male) and jack (female), then insert and remove twenty times to bust
>    off all corrosion. Please take 10 minutes to read this 20 year old
>    discussion of electronic lubrication:
>    http://troubleshooters.com/tpromag/200310/200310.htm
>    I've used transmission fluid, WD-40, Lube-Job electronics lubricant,
>    Breakthrough CLP, WD-40, Deoxit Gold, Superslick Slick Stuff, and CRC
>    QD Contact Cleaner, and was very satisfied with all of them. I
>    currently use mostly Superslick Slick Stuff. The important thing is
>    that there's residual lubrication to prevent build-up of Fretting
>    Corrosion. Stabilant 22 and Deoxit Gold are the safest to prevent
>    damage to non-metals and prevent conduction between non-mating
>    surfaces, but they're pretty expensive. My experience has been that
>    as long as I carefully limit application to the mating conductors.
>    Lubricating all mating electronic contacts takes 2 or 3 hours, but
>    doing so can save you weeks of frustration if an intermittent is
>    being caused by fretting corrosion between electronic contacts. I do
>    complete electronic contact during the initial build of all my
>    computers. Because you've observed this intermittent since you bought
>    the mobo several years ago, lubricating the RAM stick contacts is
>    especially important, as it's likely those sticks have been in place
>    since you bought the mobo.
>
> * Run a complete RAM test overnight by booting a memtest86 CD or thumb
>    drive. Get rid of any sticks with errors. Intermittents are too
>    expensive to try to limp along with RAM errors. Note that if you're
>    not using UEFI, you'll need an older version of memtest86.
>
> * Temporarily swap in a known good power supply, use for several days,
>    and see whether the problem has gone away. If so, use the known good
>    power supply or a known good newly purchased power supply. If the
>    problem persists, put back the original power supply at the
>    conclusion of troubleshooting.
>
> * Power switches and reboot switches can go intermittent and cause
>    hangs and spontaneous reboots. If I have suspicions of these things,
>    I disconnect the reboot switch (you can always unplug the computer
>    for an abrupt shutdown), and temporarily disconnect the power switch,
>    starting and stopping the computer by CAREFULLY shorting the power
>    switch pins with a screwdriver. I then run the machine for about 3
>    days to see if the problem really went away. If the problem appears
>    to be the power switch, I replace it with a cheap, wired, no light, 2
>    contact doorbell switch, available at home warehouse stores. If you
>    can't find it there (most doorbell switches are now lighted), I'm
>    pretty sure that this is what you need:
>    https://www.ebay.com/itm/155929670486  . You might need extra wire so
>    your front panel can be removed enough to service the front parts
>    without needing to disconnect the power button leads and fish them
>    around the motherboard and through the chassis.
>
> * If you're overclocked, roll it back to the non-overclocked
>    frequencies. Often simply telling the BIOS to reset to its factory
>    state is a great way to rule out a whole bunch of BIOS caused
>    problems. As always, test for several days to make sure the
>    intermittent symptom really went away.
>
> * Use various sensor programs to check various CPU temperatures and
>    disk temperatures. If temperatures even begin to approach maximum
>    specs, take
>
> * Try to observe whether this intermittent symptom occurs significantly
>    more when running a specific set of software, and act accordingly.
>
> * Boot a radically different distro, use for several days, and see if
>    the intermittent symptom still occurs. If so, you've for the most
>    part ruled out your distro, software, and config settings. If not,
>    investigate your software and configs.
>
> * If none of the preceding works, you need to consider how much time,
>    money and energy you're willing to throw at this intermittent problem.
>    If you have a known good spinning rust hard disk bigger than the
>    current one, you could ddrescue the current one onto the new, bigger
>    one, test for a few days, and if the symptom doesn't recur, the hard
>    disk had a problem not detected by smartctl.
>
> * If none of the preceding works, you need to consider how much time,
>    money and energy you're willing to throw at this intermittent
>    problem. Personally, at this point, I'd byte the bullet and buy a new
>    motherboard, ram and processor and processor heat sink. Be sure to
>    use high quality thermal heat sink compound between processor and heat
>    sink, be sure to remove any labels the manufacturer stupidly put on
>    the processor where it should be mating with the heat sink, and clean
>    all label adhesive residue before applying heat sink compound. Don't
>    cheap out on the heat sink: A lot of times the heat sink packaged
>    with the processor is great for email and light web browsing, but
>    allows overheat in intense operations like compiling a kernel.
>    Remember, you want this new setup to last for many years.
>
> * If you're going to buy a new mobo, CPU and RAM anyway, it costs you
>    nothing to take the very risky step of updating your BIOS. Who knows,
>    it might work. Because of risks involved in BIOS updates, I don't
>    recommend them except in cases where your symptom is a well known
>    effect of your specific BIOS version, or else when you're about to
>    throw the mobo in the trash anyway. Be sure to run the computer on a
>    known good uninterruptable power supply when updating your BIOS so
>    your electric company's problems don't brick your computer.
>
> I'm very aware of the time and energy the preceding steps require. Your
> computer is now 8 years old and probably anemic by today's standards.
> If your current computer has enough capability for your needs, you
> could probably buy a whole new computer of equal capability for under
> $700. If you want to replace it with a modern computer with huge
> capacity, you can probably do it for between $1500 and $2300. Remember,
> the alternative is all the troubleshooting steps I listed (and probably
> other people can think of even more).
>
> HTH,
>
> SteveT
>
> Steve Litt
>
> Autumn 2023 featured book: Rapid Learning for the 21st Century
> http://www.troubleshooters.com/rl21
> _______________________________________________
> Discuss mailing list
> Discuss at lists.blu.org
> http://lists.blu.org/mailman/listinfo/discuss
-- 
Jim KR
    
    
More information about the Discuss
mailing list