[Discuss] Help with I/O errors on RAID array?
markw at mohawksoft.com
markw at mohawksoft.com
Mon Oct 13 08:29:08 EDT 2025
Look at /var/log/messages and/or run "dmesg" and look for I/O errors. It
should show you which drive is failing.
Also, if you are on ubuntu, you should check out ZFS.
> Hi folks. A few files on my Ubuntu RAID-1 partition (two M.2 SSDs) are
> producing input/output errors when read, like this:
>
> $ cat myfile
> cat: myfile: Input/output error
>
> I can't seem to figure out what to fix. I've run fsck, badblocks,
> smartctl, and mdadm, as shown below, and none of them reports any
> errors. I'd appreciate any advice, especially (1) what do to next, and
> (2) how to figure out which (if any) of the two SSDs is faulty.
>
> Here's what I checked, in detail.
>
> File owner & mode (both normal):
>
> $ ls -l myfile
> -rw-r--r-- 1 dbarrett dbarrett 29235272 Apr 14 2006 myfile
>
> Identify the RAID partition containing the file. It's /dev/md1p1:
>
> $ df .
> Filesystem Size Used Avail Use% Mounted on
> /dev/md1p1 1.8T 1.2T 539G 70% /
>
> Reboot and force fsck on /dev/md1p1:
>
> $ sudo touch /forcefsck
> $ sudo reboot
> (no errors reported)
>
> Find the RAID configuration for device md1:
>
> $ grep '^md1 ' /proc/mdstat
> md1 : active raid1 nvme1n1p1[2] nvme0n1p1[3]
>
> Boot on an Ubuntu rescue disk and run badblocks on the two underlying
> disks. No bad blocks found (empty output):
>
> $ sudo badblocks /dev/nvme0n1
> $ sudo badblocks /dev/nvme1n1
>
> Run a RAID check on /dev/md1. The operation completed successfully:
>
> $ sudo mdadm --action=check /dev/md1
> (Wait 2 hours for completion)
> $ sudo mdadm --detail /dev/md1
> /dev/md1:
> Version : 1.2
> Creation Time : Sat Feb 29 12:33:09 2020
> Raid Level : raid1
> Array Size : 1953381376 (1862.89 GiB 2000.26 GB)
> Used Dev Size : 1953381376 (1862.89 GiB 2000.26 GB)
> Raid Devices : 2
> Total Devices : 2
> Persistence : Superblock is persistent
>
> Intent Bitmap : Internal
>
> Update Time : Sun Oct 12 20:06:22 2025
> State : active
> Active Devices : 2
> Working Devices : 2
> Failed Devices : 0
> Spare Devices : 0
>
> Consistency Policy : bitmap
>
> Name : kubuntu:1
> UUID : 7c84adca:31e96bad:b1be03ae:d7d0349d
> Events : 190769
>
> Number Major Minor RaidDevice State
> 3 259 2 0 active sync /dev/nvme0n1p1
> 2 259 3 1 active sync /dev/nvme1n1p1
>
> View SMART data for the two underlying SSDs. I see the following lines
> for both SSDs:
>
> Media and Data Integrity Errors: 0
> Error Information Log Entries: [a number greater than zero]
>
> I am pretty sure the "Error Information Log Entries" are due to a bug
> that produces 2-3 harmless SMART errors (0x4004 = "Invalid Field in
> Command") on every reboot. Full smartctl output follows:
>
> $ sudo smartctl -a /dev/nvme0n1
> smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-157-generic] (local
> build)
> Copyright (C) 2002-20, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Number: Samsung SSD 970 EVO Plus 2TB
> Serial Number: S59CNM0W632762B
> Firmware Version: 2B2QEXM7
> PCI Vendor/Subsystem ID: 0x144d
> IEEE OUI Identifier: 0x002538
> Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
> Unallocated NVM Capacity: 0
> Controller ID: 4
> NVMe Version: 1.3
> Number of Namespaces: 1
> Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
> Namespace 1 Utilization: 1,421,133,746,176 [1.42 TB]
> Namespace 1 Formatted LBA Size: 512
> Namespace 1 IEEE EUI-64: 002538 5631916c9e
> Local Time is: Sun Oct 12 13:18:22 2025 EDT
> Firmware Updates (0x16): 3 Slots, no Reset required
> Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
> Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero
> Sav/Sel_Feat Timestmp
> Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
> Maximum Data Transfer Size: 512 Pages
> Warning Comp. Temp. Threshold: 85 Celsius
> Critical Comp. Temp. Threshold: 85 Celsius
>
> Supported Power States
> St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
> 0 + 7.50W - - 0 0 0 0 0 0
> 1 + 5.90W - - 1 1 1 1 0 0
> 2 + 3.60W - - 2 2 2 2 0 0
> 3 - 0.0700W - - 3 3 3 3 210 1200
> 4 - 0.0050W - - 4 4 4 4 2000 8000
>
> Supported LBA Sizes (NSID 0x1)
> Id Fmt Data Metadt Rel_Perf
> 0 + 512 0 0
>
> === START OF SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> SMART/Health Information (NVMe Log 0x02)
> Critical Warning: 0x00
> Temperature: 59 Celsius
> Available Spare: 100%
> Available Spare Threshold: 10%
> Percentage Used: 0%
> Data Units Read: 255,077,405 [130 TB]
> Data Units Written: 108,605,632 [55.6 TB]
> Host Read Commands: 931,054,813
> Host Write Commands: 2,127,365,573
> Controller Busy Time: 12,608
> Power Cycles: 26
> Power On Hours: 7,311
> Unsafe Shutdowns: 17
> Media and Data Integrity Errors: 0
> Error Information Log Entries: 187
> Warning Comp. Temperature Time: 0
> Critical Comp. Temperature Time: 0
> Temperature Sensor 1: 59 Celsius
> Temperature Sensor 2: 66 Celsius
>
> Error Information (NVMe Log 0x01, 16 of 64 entries)
> Num ErrCount SQId CmdId Status PELoc LBA NSID VS
> 0 187 0 0x800a 0x4004 - 0 0 -
>
>
> $ sudo smartctl -a /dev/nvme1n1
> smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-157-generic] (local
> build)
> Copyright (C) 2002-20, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Number: Samsung SSD 970 EVO Plus 2TB
> Serial Number: S59CNM0W432692Z
> Firmware Version: 2B2QEXM7
> PCI Vendor/Subsystem ID: 0x144d
> IEEE OUI Identifier: 0x002538
> Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
> Unallocated NVM Capacity: 0
> Controller ID: 4
> NVMe Version: 1.3
> Number of Namespaces: 1
> Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
> Namespace 1 Utilization: 1,421,133,684,736 [1.42 TB]
> Namespace 1 Formatted LBA Size: 512
> Namespace 1 IEEE EUI-64: 002538 5431926d30
> Local Time is: Sun Oct 12 13:18:27 2025 EDT
> Firmware Updates (0x16): 3 Slots, no Reset required
> Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
> Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero
> Sav/Sel_Feat Timestmp
> Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
> Maximum Data Transfer Size: 512 Pages
> Warning Comp. Temp. Threshold: 85 Celsius
> Critical Comp. Temp. Threshold: 85 Celsius
>
> Supported Power States
> St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
> 0 + 7.50W - - 0 0 0 0 0 0
> 1 + 5.90W - - 1 1 1 1 0 0
> 2 + 3.60W - - 2 2 2 2 0 0
> 3 - 0.0700W - - 3 3 3 3 210 1200
> 4 - 0.0050W - - 4 4 4 4 2000 8000
>
> Supported LBA Sizes (NSID 0x1)
> Id Fmt Data Metadt Rel_Perf
> 0 + 512 0 0
>
> === START OF SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> SMART/Health Information (NVMe Log 0x02)
> Critical Warning: 0x00
> Temperature: 59 Celsius
> Available Spare: 100%
> Available Spare Threshold: 10%
> Percentage Used: 0%
> Data Units Read: 194,438,360 [99.5 TB]
> Data Units Written: 104,493,528 [53.5 TB]
> Host Read Commands: 444,201,317
> Host Write Commands: 2,118,258,242
> Controller Busy Time: 12,148
> Power Cycles: 23
> Power On Hours: 7,148
> Unsafe Shutdowns: 16
> Media and Data Integrity Errors: 0
> Error Information Log Entries: 154
> Warning Comp. Temperature Time: 0
> Critical Comp. Temperature Time: 0
> Temperature Sensor 1: 59 Celsius
> Temperature Sensor 2: 66 Celsius
>
> Error Information (NVMe Log 0x01, 16 of 64 entries)
> Num ErrCount SQId CmdId Status PELoc LBA NSID VS
> 0 154 0 0x7011 0x4004 - 0 0 -
>
> Thank you for any clues what to do next.
> Dan
> _______________________________________________
> Discuss mailing list
> Discuss at lists.blu.org
> https://lists.blu.org/mailman/listinfo/discuss
>
More information about the Discuss
mailing list