BLU Discuss list archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Discuss] Help with I/O errors on RAID array?
- Subject: [Discuss] Help with I/O errors on RAID array?
- From: dbarrett at blazemonger.com (Daniel Barrett)
- Date: Sun, 12 Oct 2025 20:08:20 -0400
- References: <mailman.1.1749052801.15856.discuss@lists.blu.org>
Hi folks. A few files on my Ubuntu RAID-1 partition (two M.2 SSDs) are
producing input/output errors when read, like this:
$ cat myfile
cat: myfile: Input/output error
I can't seem to figure out what to fix. I've run fsck, badblocks,
smartctl, and mdadm, as shown below, and none of them reports any
errors. I'd appreciate any advice, especially (1) what do to next, and
(2) how to figure out which (if any) of the two SSDs is faulty.
Here's what I checked, in detail.
File owner & mode (both normal):
$ ls -l myfile
-rw-r--r-- 1 dbarrett dbarrett 29235272 Apr 14 2006 myfile
Identify the RAID partition containing the file. It's /dev/md1p1:
$ df .
Filesystem Size Used Avail Use% Mounted on
/dev/md1p1 1.8T 1.2T 539G 70% /
Reboot and force fsck on /dev/md1p1:
$ sudo touch /forcefsck
$ sudo reboot
(no errors reported)
Find the RAID configuration for device md1:
$ grep '^md1 ' /proc/mdstat
md1 : active raid1 nvme1n1p1[2] nvme0n1p1[3]
Boot on an Ubuntu rescue disk and run badblocks on the two underlying
disks. No bad blocks found (empty output):
$ sudo badblocks /dev/nvme0n1
$ sudo badblocks /dev/nvme1n1
Run a RAID check on /dev/md1. The operation completed successfully:
$ sudo mdadm --action=check /dev/md1
(Wait 2 hours for completion)
$ sudo mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Sat Feb 29 12:33:09 2020
Raid Level : raid1
Array Size : 1953381376 (1862.89 GiB 2000.26 GB)
Used Dev Size : 1953381376 (1862.89 GiB 2000.26 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Oct 12 20:06:22 2025
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : kubuntu:1
UUID : 7c84adca:31e96bad:b1be03ae:d7d0349d
Events : 190769
Number Major Minor RaidDevice State
3 259 2 0 active sync /dev/nvme0n1p1
2 259 3 1 active sync /dev/nvme1n1p1
View SMART data for the two underlying SSDs. I see the following lines
for both SSDs:
Media and Data Integrity Errors: 0
Error Information Log Entries: [a number greater than zero]
I am pretty sure the "Error Information Log Entries" are due to a bug
that produces 2-3 harmless SMART errors (0x4004 = "Invalid Field in
Command") on every reboot. Full smartctl output follows:
$ sudo smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-157-generic] (local
build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 2TB
Serial Number: S59CNM0W632762B
Firmware Version: 2B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,421,133,746,176 [1.42 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5631916c9e
Local Time is: Sun Oct 12 13:18:22 2025 EDT
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero
Sav/Sel_Feat Timestmp
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.50W - - 0 0 0 0 0 0
1 + 5.90W - - 1 1 1 1 0 0
2 + 3.60W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 59 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 255,077,405 [130 TB]
Data Units Written: 108,605,632 [55.6 TB]
Host Read Commands: 931,054,813
Host Write Commands: 2,127,365,573
Controller Busy Time: 12,608
Power Cycles: 26
Power On Hours: 7,311
Unsafe Shutdowns: 17
Media and Data Integrity Errors: 0
Error Information Log Entries: 187
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 59 Celsius
Temperature Sensor 2: 66 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 187 0 0x800a 0x4004 - 0 0 -
$ sudo smartctl -a /dev/nvme1n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-157-generic] (local
build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 2TB
Serial Number: S59CNM0W432692Z
Firmware Version: 2B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,421,133,684,736 [1.42 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5431926d30
Local Time is: Sun Oct 12 13:18:27 2025 EDT
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero
Sav/Sel_Feat Timestmp
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.50W - - 0 0 0 0 0 0
1 + 5.90W - - 1 1 1 1 0 0
2 + 3.60W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 59 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 194,438,360 [99.5 TB]
Data Units Written: 104,493,528 [53.5 TB]
Host Read Commands: 444,201,317
Host Write Commands: 2,118,258,242
Controller Busy Time: 12,148
Power Cycles: 23
Power On Hours: 7,148
Unsafe Shutdowns: 16
Media and Data Integrity Errors: 0
Error Information Log Entries: 154
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 59 Celsius
Temperature Sensor 2: 66 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 154 0 0x7011 0x4004 - 0 0 -
Thank you for any clues what to do next.
Dan
- Follow-Ups:
- [Discuss] Help with I/O errors on RAID array?
- From: bogstad at pobox.com (Bill Bogstad)
- [Discuss] Help with I/O errors on RAID array?
- From: markw at mohawksoft.com (markw at mohawksoft.com)
- [Discuss] Help with I/O errors on RAID array?
- Prev by Date: [Discuss] Boston Linux VIRTUAL Meeting , Wednesday, October 15, 2025 - Jerry's Journey to Viet Nam
- Next by Date: [Discuss] Help with I/O errors on RAID array?
- Previous by thread: [Discuss] Boston Linux VIRTUAL Meeting , Wednesday, October 15, 2025 - Jerry's Journey to Viet Nam
- Next by thread: [Discuss] Help with I/O errors on RAID array?
- Index(es):
