BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Discuss] Help with I/O errors on RAID array?

Subject: [Discuss] Help with I/O errors on RAID array?
From: markw at mohawksoft.com (markw at mohawksoft.com)
Date: Mon, 13 Oct 2025 08:29:08 -0400
In-reply-to: <26860.17140.637176.437320@gargle.gargle.HOWL>
References: <mailman.1.1749052801.15856.discuss@lists.blu.org> <26860.17140.637176.437320@gargle.gargle.HOWL>
Look at /var/log/messages and/or run "dmesg" and look for I/O errors. It
should show you which drive is failing.

Also, if you are on ubuntu, you should check out ZFS.

> Hi folks. A few files on my Ubuntu RAID-1 partition (two M.2 SSDs) are
> producing input/output errors when read, like this:
>
> $ cat myfile
> cat: myfile: Input/output error
>
> I can't seem to figure out what to fix. I've run fsck, badblocks,
> smartctl, and mdadm, as shown below, and none of them reports any
> errors. I'd appreciate any advice, especially (1) what do to next, and
> (2) how to figure out which (if any) of the two SSDs is faulty.
>
> Here's what I checked, in detail.
>
> File owner & mode (both normal):
>
>   $ ls -l myfile
>   -rw-r--r-- 1 dbarrett dbarrett 29235272 Apr 14  2006 myfile
>
> Identify the RAID partition containing the file. It's /dev/md1p1:
>
>   $ df .
>   Filesystem      Size  Used Avail Use% Mounted on
>   /dev/md1p1      1.8T  1.2T  539G  70% /
>
> Reboot and force fsck on /dev/md1p1:
>
>   $ sudo touch /forcefsck
>   $ sudo reboot
>   (no errors reported)
>
> Find the RAID configuration for device md1:
>
>   $ grep '^md1 ' /proc/mdstat
>   md1 : active raid1 nvme1n1p1[2] nvme0n1p1[3]
>
> Boot on an Ubuntu rescue disk and run badblocks on the two underlying
> disks.  No bad blocks found (empty output):
>
>   $ sudo badblocks /dev/nvme0n1
>   $ sudo badblocks /dev/nvme1n1
>
> Run a RAID check on /dev/md1. The operation completed successfully:
>
>   $ sudo mdadm --action=check /dev/md1
>   (Wait 2 hours for completion)
>   $ sudo mdadm --detail /dev/md1
>   /dev/md1:
>              Version : 1.2
>        Creation Time : Sat Feb 29 12:33:09 2020
>           Raid Level : raid1
>           Array Size : 1953381376 (1862.89 GiB 2000.26 GB)
>        Used Dev Size : 1953381376 (1862.89 GiB 2000.26 GB)
>         Raid Devices : 2
>        Total Devices : 2
>          Persistence : Superblock is persistent
>
>        Intent Bitmap : Internal
>
>          Update Time : Sun Oct 12 20:06:22 2025
>                State : active
>       Active Devices : 2
>      Working Devices : 2
>       Failed Devices : 0
>        Spare Devices : 0
>
>   Consistency Policy : bitmap
>
>                 Name : kubuntu:1
>                 UUID : 7c84adca:31e96bad:b1be03ae:d7d0349d
>               Events : 190769
>
>       Number   Major   Minor   RaidDevice State
>          3     259        2        0      active sync   /dev/nvme0n1p1
>          2     259        3        1      active sync   /dev/nvme1n1p1
>
> View SMART data for the two underlying SSDs. I see the following lines
> for both SSDs:
>
>   Media and Data Integrity Errors:    0
>   Error Information Log Entries:      [a number greater than zero]
>
> I am pretty sure the "Error Information Log Entries" are due to a bug
> that produces 2-3 harmless SMART errors (0x4004 = "Invalid Field in
> Command") on every reboot.  Full smartctl output follows:
>
>   $ sudo smartctl -a /dev/nvme0n1
>   smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-157-generic] (local
> build)
>   Copyright (C) 2002-20, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
>   === START OF INFORMATION SECTION ===
>   Model Number:                       Samsung SSD 970 EVO Plus 2TB
>   Serial Number:                      S59CNM0W632762B
>   Firmware Version:                   2B2QEXM7
>   PCI Vendor/Subsystem ID:            0x144d
>   IEEE OUI Identifier:                0x002538
>   Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
>   Unallocated NVM Capacity:           0
>   Controller ID:                      4
>   NVMe Version:                       1.3
>   Number of Namespaces:               1
>   Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
>   Namespace 1 Utilization:            1,421,133,746,176 [1.42 TB]
>   Namespace 1 Formatted LBA Size:     512
>   Namespace 1 IEEE EUI-64:            002538 5631916c9e
>   Local Time is:                      Sun Oct 12 13:18:22 2025 EDT
>   Firmware Updates (0x16):            3 Slots, no Reset required
>   Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
>   Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero
> Sav/Sel_Feat Timestmp
>   Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
>   Maximum Data Transfer Size:         512 Pages
>   Warning  Comp. Temp. Threshold:     85 Celsius
>   Critical Comp. Temp. Threshold:     85 Celsius
>
>   Supported Power States
>   St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
>    0 +     7.50W       -        -    0  0  0  0        0       0
>    1 +     5.90W       -        -    1  1  1  1        0       0
>    2 +     3.60W       -        -    2  2  2  2        0       0
>    3 -   0.0700W       -        -    3  3  3  3      210    1200
>    4 -   0.0050W       -        -    4  4  4  4     2000    8000
>
>   Supported LBA Sizes (NSID 0x1)
>   Id Fmt  Data  Metadt  Rel_Perf
>    0 +     512       0         0
>
>   === START OF SMART DATA SECTION ===
>   SMART overall-health self-assessment test result: PASSED
>
>   SMART/Health Information (NVMe Log 0x02)
>   Critical Warning:                   0x00
>   Temperature:                        59 Celsius
>   Available Spare:                    100%
>   Available Spare Threshold:          10%
>   Percentage Used:                    0%
>   Data Units Read:                    255,077,405 [130 TB]
>   Data Units Written:                 108,605,632 [55.6 TB]
>   Host Read Commands:                 931,054,813
>   Host Write Commands:                2,127,365,573
>   Controller Busy Time:               12,608
>   Power Cycles:                       26
>   Power On Hours:                     7,311
>   Unsafe Shutdowns:                   17
>   Media and Data Integrity Errors:    0
>   Error Information Log Entries:      187
>   Warning  Comp. Temperature Time:    0
>   Critical Comp. Temperature Time:    0
>   Temperature Sensor 1:               59 Celsius
>   Temperature Sensor 2:               66 Celsius
>
>   Error Information (NVMe Log 0x01, 16 of 64 entries)
>   Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
>     0        187     0  0x800a  0x4004      -            0     0     -
>
>
>   $ sudo smartctl -a /dev/nvme1n1
>   smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-157-generic] (local
> build)
>   Copyright (C) 2002-20, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
>   === START OF INFORMATION SECTION ===
>   Model Number:                       Samsung SSD 970 EVO Plus 2TB
>   Serial Number:                      S59CNM0W432692Z
>   Firmware Version:                   2B2QEXM7
>   PCI Vendor/Subsystem ID:            0x144d
>   IEEE OUI Identifier:                0x002538
>   Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
>   Unallocated NVM Capacity:           0
>   Controller ID:                      4
>   NVMe Version:                       1.3
>   Number of Namespaces:               1
>   Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
>   Namespace 1 Utilization:            1,421,133,684,736 [1.42 TB]
>   Namespace 1 Formatted LBA Size:     512
>   Namespace 1 IEEE EUI-64:            002538 5431926d30
>   Local Time is:                      Sun Oct 12 13:18:27 2025 EDT
>   Firmware Updates (0x16):            3 Slots, no Reset required
>   Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
>   Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero
> Sav/Sel_Feat Timestmp
>   Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
>   Maximum Data Transfer Size:         512 Pages
>   Warning  Comp. Temp. Threshold:     85 Celsius
>   Critical Comp. Temp. Threshold:     85 Celsius
>
>   Supported Power States
>   St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
>    0 +     7.50W       -        -    0  0  0  0        0       0
>    1 +     5.90W       -        -    1  1  1  1        0       0
>    2 +     3.60W       -        -    2  2  2  2        0       0
>    3 -   0.0700W       -        -    3  3  3  3      210    1200
>    4 -   0.0050W       -        -    4  4  4  4     2000    8000
>
>   Supported LBA Sizes (NSID 0x1)
>   Id Fmt  Data  Metadt  Rel_Perf
>    0 +     512       0         0
>
>   === START OF SMART DATA SECTION ===
>   SMART overall-health self-assessment test result: PASSED
>
>   SMART/Health Information (NVMe Log 0x02)
>   Critical Warning:                   0x00
>   Temperature:                        59 Celsius
>   Available Spare:                    100%
>   Available Spare Threshold:          10%
>   Percentage Used:                    0%
>   Data Units Read:                    194,438,360 [99.5 TB]
>   Data Units Written:                 104,493,528 [53.5 TB]
>   Host Read Commands:                 444,201,317
>   Host Write Commands:                2,118,258,242
>   Controller Busy Time:               12,148
>   Power Cycles:                       23
>   Power On Hours:                     7,148
>   Unsafe Shutdowns:                   16
>   Media and Data Integrity Errors:    0
>   Error Information Log Entries:      154
>   Warning  Comp. Temperature Time:    0
>   Critical Comp. Temperature Time:    0
>   Temperature Sensor 1:               59 Celsius
>   Temperature Sensor 2:               66 Celsius
>
>   Error Information (NVMe Log 0x01, 16 of 64 entries)
>   Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
>     0        154     0  0x7011  0x4004      -            0     0     -
>
> Thank you for any clues what to do next.
> Dan
> _______________________________________________
> Discuss mailing list
> Discuss at lists.blu.org
> https://lists.blu.org/mailman/listinfo/discuss
>
Follow-Ups:
- [Discuss] Help with I/O errors on RAID array?
  - From: dbarrett at blazemonger.com (Daniel Barrett)
References:
- [Discuss] Help with I/O errors on RAID array?
  - From: dbarrett at blazemonger.com (Daniel Barrett)
Prev by Date: [Discuss] Help with I/O errors on RAID array?
Next by Date: [Discuss] rpi-connect
Previous by thread: [Discuss] Help with I/O errors on RAID array?
Next by thread: [Discuss] Help with I/O errors on RAID array?
Index(es):
- Date
- Thread