BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] Help with I/O errors on RAID array?

Subject: [Discuss] Help with I/O errors on RAID array?
From: bogstad at pobox.com (Bill Bogstad)
Date: Tue, 14 Oct 2025 18:34:24 -0400
In-reply-to: <26860.17140.637176.437320@gargle.gargle.HOWL>
References: <mailman.1.1749052801.15856.discuss@lists.blu.org> <26860.17140.637176.437320@gargle.gargle.HOWL>

On Sun, Oct 12, 2025 at 8:08?PM Daniel Barrett <dbarrett at blazemonger.com> 
wrote:
>
> Hi folks. A few files on my Ubuntu RAID-1 partition (two M.2 SSDs) are
> producing input/output errors when read, like this:
>
> $ cat myfile
> cat: myfile: Input/output error
>
> I can't seem to figure out what to fix. I've run fsck, badblocks,
> smartctl, and mdadm, as shown below, and none of them reports any
> errors. I'd appreciate any advice, especially (1) what do to next, and
> (2) how to figure out which (if any) of the two SSDs is faulty.

It's been quite a while since I dealt with this kind of thing so I
don't remember specific commands,
but these ideas might help.

1. There is a "try really hard" copy command.   It will copy a file in
large chunks until it gets an error and then use seeks and smaller
block size reads to recover as much of the file as possible.  It will
also retry the bad single block reads multiple times in case it is a
temporary problem.  (More likely to work for magnetic media.)  Since
this is an SSD retrying is unlikely to cause any further damage to the
drive.   Generally I would do this first to recover as much data from
the problematic file(s) before making changes.

2. Back up whatever is important on the drive.   Since it passes fsck,
the metadata blocks are okay; so you can easily mv problematic files
to a directory at the root of the filesystem, then ignore that
directory while backing up the rest of the disk.  You might even
consider stopping at this point and just leaving the bad file in its
new location and just ignore it.  The file you show is under 300
Mbytes so it's not that much of a loss.

3. Most filesystems have a command which will let you determine what
blocks on the disk are allocated to a specific file.  You can then
overwrite the raw blocks on the disk which are allocated to the file
which might clear the error.  This can be dones with dd and judicious
use of the "seek, bs, and count" options on writes to the whole
partition.   There is probably a specialized tool to do this as well,
but I don't remember.  In step #1, you will have already copied all
the good data out.   The same command from step #1 will probably give
you the block #s in the file, so you might be able to just overwrite
just the bad blocks.

Obviously MD (or LVM) complicates this and as suggested by others you
will probably need to break the mirror and do each separately.
Booting from a rescue USB drive is a good idea in general and
required? if we are talking about the root of the whole filesystem
tree.

I've never wrapped my head around exactly how NVME SSDs work, but they
seem to be way more complicated then the essentially magnetic drive
emulation that SATA SSDs use.   You might get some ideas by digging
into NVME specs and specialized tools.   There is something called
"nvme-cli" that might be helpful.

Good luck,
Bill Bogstad

Follow-Ups:
- [Discuss] Help with I/O errors on RAID array?
  - From: dsr at randomstring.org (Dan Ritter)
- [Discuss] Help with I/O errors on RAID array?
  - From: ron at bclug.ca (Ron)

References:
- [Discuss] Help with I/O errors on RAID array?
  - From: dbarrett at blazemonger.com (Daniel Barrett)

Prev by Date: [Discuss] Boston Linux VIRTUAL Meeting reminder, tomorrow, Wednesday, October 15, 2025 - Jerry's Journey to Viet Nam
Next by Date: [Discuss] Help with I/O errors on RAID array?
Previous by thread: [Discuss] Help with I/O errors on RAID array?
Next by thread: [Discuss] Help with I/O errors on RAID array?
Index(es):
- Date
- Thread