Boston Linux & UNIX was originally founded in 1994 as part of The Boston Computer Society. We meet on the third Wednesday of each month, online, via Jitsi Meet.

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] Help with I/O errors on RAID array?



Hi folks. A few files on my Ubuntu RAID-1 partition (two M.2 SSDs) are
producing input/output errors when read, like this:

$ cat myfile
cat: myfile: Input/output error

I can't seem to figure out what to fix. I've run fsck, badblocks,
smartctl, and mdadm, as shown below, and none of them reports any
errors. I'd appreciate any advice, especially (1) what do to next, and
(2) how to figure out which (if any) of the two SSDs is faulty.

Here's what I checked, in detail.

File owner & mode (both normal):

  $ ls -l myfile
  -rw-r--r-- 1 dbarrett dbarrett 29235272 Apr 14  2006 myfile

Identify the RAID partition containing the file. It's /dev/md1p1:

  $ df .
  Filesystem      Size  Used Avail Use% Mounted on
  /dev/md1p1      1.8T  1.2T  539G  70% /

Reboot and force fsck on /dev/md1p1:

  $ sudo touch /forcefsck
  $ sudo reboot
  (no errors reported)

Find the RAID configuration for device md1:

  $ grep '^md1 ' /proc/mdstat
  md1 : active raid1 nvme1n1p1[2] nvme0n1p1[3]

Boot on an Ubuntu rescue disk and run badblocks on the two underlying
disks.  No bad blocks found (empty output):

  $ sudo badblocks /dev/nvme0n1
  $ sudo badblocks /dev/nvme1n1

Run a RAID check on /dev/md1. The operation completed successfully:

  $ sudo mdadm --action=check /dev/md1
  (Wait 2 hours for completion)
  $ sudo mdadm --detail /dev/md1
  /dev/md1:
             Version : 1.2
       Creation Time : Sat Feb 29 12:33:09 2020
          Raid Level : raid1
          Array Size : 1953381376 (1862.89 GiB 2000.26 GB)
       Used Dev Size : 1953381376 (1862.89 GiB 2000.26 GB)
        Raid Devices : 2
       Total Devices : 2
         Persistence : Superblock is persistent
  
       Intent Bitmap : Internal
  
         Update Time : Sun Oct 12 20:06:22 2025
               State : active 
      Active Devices : 2
     Working Devices : 2
      Failed Devices : 0
       Spare Devices : 0
  
  Consistency Policy : bitmap
  
                Name : kubuntu:1
                UUID : 7c84adca:31e96bad:b1be03ae:d7d0349d
              Events : 190769
  
      Number   Major   Minor   RaidDevice State
         3     259        2        0      active sync   /dev/nvme0n1p1
         2     259        3        1      active sync   /dev/nvme1n1p1

View SMART data for the two underlying SSDs. I see the following lines
for both SSDs:

  Media and Data Integrity Errors:    0
  Error Information Log Entries:      [a number greater than zero]

I am pretty sure the "Error Information Log Entries" are due to a bug
that produces 2-3 harmless SMART errors (0x4004 = "Invalid Field in
Command") on every reboot.  Full smartctl output follows:

  $ sudo smartctl -a /dev/nvme0n1
  smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-157-generic] (local 
build)
  Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
  
  === START OF INFORMATION SECTION ===
  Model Number:                       Samsung SSD 970 EVO Plus 2TB
  Serial Number:                      S59CNM0W632762B
  Firmware Version:                   2B2QEXM7
  PCI Vendor/Subsystem ID:            0x144d
  IEEE OUI Identifier:                0x002538
  Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
  Unallocated NVM Capacity:           0
  Controller ID:                      4
  NVMe Version:                       1.3
  Number of Namespaces:               1
  Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
  Namespace 1 Utilization:            1,421,133,746,176 [1.42 TB]
  Namespace 1 Formatted LBA Size:     512
  Namespace 1 IEEE EUI-64:            002538 5631916c9e
  Local Time is:                      Sun Oct 12 13:18:22 2025 EDT
  Firmware Updates (0x16):            3 Slots, no Reset required
  Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
  Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero 
Sav/Sel_Feat Timestmp
  Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
  Maximum Data Transfer Size:         512 Pages
  Warning  Comp. Temp. Threshold:     85 Celsius
  Critical Comp. Temp. Threshold:     85 Celsius
  
  Supported Power States
  St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
   0 +     7.50W       -        -    0  0  0  0        0       0
   1 +     5.90W       -        -    1  1  1  1        0       0
   2 +     3.60W       -        -    2  2  2  2        0       0
   3 -   0.0700W       -        -    3  3  3  3      210    1200
   4 -   0.0050W       -        -    4  4  4  4     2000    8000
  
  Supported LBA Sizes (NSID 0x1)
  Id Fmt  Data  Metadt  Rel_Perf
   0 +     512       0         0
  
  === START OF SMART DATA SECTION ===
  SMART overall-health self-assessment test result: PASSED
  
  SMART/Health Information (NVMe Log 0x02)
  Critical Warning:                   0x00
  Temperature:                        59 Celsius
  Available Spare:                    100%
  Available Spare Threshold:          10%
  Percentage Used:                    0%
  Data Units Read:                    255,077,405 [130 TB]
  Data Units Written:                 108,605,632 [55.6 TB]
  Host Read Commands:                 931,054,813
  Host Write Commands:                2,127,365,573
  Controller Busy Time:               12,608
  Power Cycles:                       26
  Power On Hours:                     7,311
  Unsafe Shutdowns:                   17
  Media and Data Integrity Errors:    0
  Error Information Log Entries:      187
  Warning  Comp. Temperature Time:    0
  Critical Comp. Temperature Time:    0
  Temperature Sensor 1:               59 Celsius
  Temperature Sensor 2:               66 Celsius
  
  Error Information (NVMe Log 0x01, 16 of 64 entries)
  Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
    0        187     0  0x800a  0x4004      -            0     0     -
  
  
  $ sudo smartctl -a /dev/nvme1n1
  smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-157-generic] (local 
build)
  Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
  
  === START OF INFORMATION SECTION ===
  Model Number:                       Samsung SSD 970 EVO Plus 2TB
  Serial Number:                      S59CNM0W432692Z
  Firmware Version:                   2B2QEXM7
  PCI Vendor/Subsystem ID:            0x144d
  IEEE OUI Identifier:                0x002538
  Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
  Unallocated NVM Capacity:           0
  Controller ID:                      4
  NVMe Version:                       1.3
  Number of Namespaces:               1
  Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
  Namespace 1 Utilization:            1,421,133,684,736 [1.42 TB]
  Namespace 1 Formatted LBA Size:     512
  Namespace 1 IEEE EUI-64:            002538 5431926d30
  Local Time is:                      Sun Oct 12 13:18:27 2025 EDT
  Firmware Updates (0x16):            3 Slots, no Reset required
  Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
  Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero 
Sav/Sel_Feat Timestmp
  Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
  Maximum Data Transfer Size:         512 Pages
  Warning  Comp. Temp. Threshold:     85 Celsius
  Critical Comp. Temp. Threshold:     85 Celsius
  
  Supported Power States
  St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
   0 +     7.50W       -        -    0  0  0  0        0       0
   1 +     5.90W       -        -    1  1  1  1        0       0
   2 +     3.60W       -        -    2  2  2  2        0       0
   3 -   0.0700W       -        -    3  3  3  3      210    1200
   4 -   0.0050W       -        -    4  4  4  4     2000    8000
  
  Supported LBA Sizes (NSID 0x1)
  Id Fmt  Data  Metadt  Rel_Perf
   0 +     512       0         0
  
  === START OF SMART DATA SECTION ===
  SMART overall-health self-assessment test result: PASSED
  
  SMART/Health Information (NVMe Log 0x02)
  Critical Warning:                   0x00
  Temperature:                        59 Celsius
  Available Spare:                    100%
  Available Spare Threshold:          10%
  Percentage Used:                    0%
  Data Units Read:                    194,438,360 [99.5 TB]
  Data Units Written:                 104,493,528 [53.5 TB]
  Host Read Commands:                 444,201,317
  Host Write Commands:                2,118,258,242
  Controller Busy Time:               12,148
  Power Cycles:                       23
  Power On Hours:                     7,148
  Unsafe Shutdowns:                   16
  Media and Data Integrity Errors:    0
  Error Information Log Entries:      154
  Warning  Comp. Temperature Time:    0
  Critical Comp. Temperature Time:    0
  Temperature Sensor 1:               59 Celsius
  Temperature Sensor 2:               66 Celsius
  
  Error Information (NVMe Log 0x01, 16 of 64 entries)
  Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
    0        154     0  0x7011  0x4004      -            0     0     -

Thank you for any clues what to do next.
Dan



Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org