BLU Discuss list archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Discuss] Help with I/O errors on RAID array?
- Subject: [Discuss] Help with I/O errors on RAID array?
- From: markw at mohawksoft.com (markw at mohawksoft.com)
- Date: Mon, 13 Oct 2025 08:29:08 -0400
- In-reply-to: <26860.17140.637176.437320@gargle.gargle.HOWL>
- References: <mailman.1.1749052801.15856.discuss@lists.blu.org> <26860.17140.637176.437320@gargle.gargle.HOWL>
Look at /var/log/messages and/or run "dmesg" and look for I/O errors. It should show you which drive is failing. Also, if you are on ubuntu, you should check out ZFS. > Hi folks. A few files on my Ubuntu RAID-1 partition (two M.2 SSDs) are > producing input/output errors when read, like this: > > $ cat myfile > cat: myfile: Input/output error > > I can't seem to figure out what to fix. I've run fsck, badblocks, > smartctl, and mdadm, as shown below, and none of them reports any > errors. I'd appreciate any advice, especially (1) what do to next, and > (2) how to figure out which (if any) of the two SSDs is faulty. > > Here's what I checked, in detail. > > File owner & mode (both normal): > > $ ls -l myfile > -rw-r--r-- 1 dbarrett dbarrett 29235272 Apr 14 2006 myfile > > Identify the RAID partition containing the file. It's /dev/md1p1: > > $ df . > Filesystem Size Used Avail Use% Mounted on > /dev/md1p1 1.8T 1.2T 539G 70% / > > Reboot and force fsck on /dev/md1p1: > > $ sudo touch /forcefsck > $ sudo reboot > (no errors reported) > > Find the RAID configuration for device md1: > > $ grep '^md1 ' /proc/mdstat > md1 : active raid1 nvme1n1p1[2] nvme0n1p1[3] > > Boot on an Ubuntu rescue disk and run badblocks on the two underlying > disks. No bad blocks found (empty output): > > $ sudo badblocks /dev/nvme0n1 > $ sudo badblocks /dev/nvme1n1 > > Run a RAID check on /dev/md1. The operation completed successfully: > > $ sudo mdadm --action=check /dev/md1 > (Wait 2 hours for completion) > $ sudo mdadm --detail /dev/md1 > /dev/md1: > Version : 1.2 > Creation Time : Sat Feb 29 12:33:09 2020 > Raid Level : raid1 > Array Size : 1953381376 (1862.89 GiB 2000.26 GB) > Used Dev Size : 1953381376 (1862.89 GiB 2000.26 GB) > Raid Devices : 2 > Total Devices : 2 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Sun Oct 12 20:06:22 2025 > State : active > Active Devices : 2 > Working Devices : 2 > Failed Devices : 0 > Spare Devices : 0 > > Consistency Policy : bitmap > > Name : kubuntu:1 > UUID : 7c84adca:31e96bad:b1be03ae:d7d0349d > Events : 190769 > > Number Major Minor RaidDevice State > 3 259 2 0 active sync /dev/nvme0n1p1 > 2 259 3 1 active sync /dev/nvme1n1p1 > > View SMART data for the two underlying SSDs. I see the following lines > for both SSDs: > > Media and Data Integrity Errors: 0 > Error Information Log Entries: [a number greater than zero] > > I am pretty sure the "Error Information Log Entries" are due to a bug > that produces 2-3 harmless SMART errors (0x4004 = "Invalid Field in > Command") on every reboot. Full smartctl output follows: > > $ sudo smartctl -a /dev/nvme0n1 > smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-157-generic] (local > build) > Copyright (C) 2002-20, Bruce Allen, Christian Franke, > www.smartmontools.org > > === START OF INFORMATION SECTION === > Model Number: Samsung SSD 970 EVO Plus 2TB > Serial Number: S59CNM0W632762B > Firmware Version: 2B2QEXM7 > PCI Vendor/Subsystem ID: 0x144d > IEEE OUI Identifier: 0x002538 > Total NVM Capacity: 2,000,398,934,016 [2.00 TB] > Unallocated NVM Capacity: 0 > Controller ID: 4 > NVMe Version: 1.3 > Number of Namespaces: 1 > Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB] > Namespace 1 Utilization: 1,421,133,746,176 [1.42 TB] > Namespace 1 Formatted LBA Size: 512 > Namespace 1 IEEE EUI-64: 002538 5631916c9e > Local Time is: Sun Oct 12 13:18:22 2025 EDT > Firmware Updates (0x16): 3 Slots, no Reset required > Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test > Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero > Sav/Sel_Feat Timestmp > Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg > Maximum Data Transfer Size: 512 Pages > Warning Comp. Temp. Threshold: 85 Celsius > Critical Comp. Temp. Threshold: 85 Celsius > > Supported Power States > St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat > 0 + 7.50W - - 0 0 0 0 0 0 > 1 + 5.90W - - 1 1 1 1 0 0 > 2 + 3.60W - - 2 2 2 2 0 0 > 3 - 0.0700W - - 3 3 3 3 210 1200 > 4 - 0.0050W - - 4 4 4 4 2000 8000 > > Supported LBA Sizes (NSID 0x1) > Id Fmt Data Metadt Rel_Perf > 0 + 512 0 0 > > === START OF SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > SMART/Health Information (NVMe Log 0x02) > Critical Warning: 0x00 > Temperature: 59 Celsius > Available Spare: 100% > Available Spare Threshold: 10% > Percentage Used: 0% > Data Units Read: 255,077,405 [130 TB] > Data Units Written: 108,605,632 [55.6 TB] > Host Read Commands: 931,054,813 > Host Write Commands: 2,127,365,573 > Controller Busy Time: 12,608 > Power Cycles: 26 > Power On Hours: 7,311 > Unsafe Shutdowns: 17 > Media and Data Integrity Errors: 0 > Error Information Log Entries: 187 > Warning Comp. Temperature Time: 0 > Critical Comp. Temperature Time: 0 > Temperature Sensor 1: 59 Celsius > Temperature Sensor 2: 66 Celsius > > Error Information (NVMe Log 0x01, 16 of 64 entries) > Num ErrCount SQId CmdId Status PELoc LBA NSID VS > 0 187 0 0x800a 0x4004 - 0 0 - > > > $ sudo smartctl -a /dev/nvme1n1 > smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-157-generic] (local > build) > Copyright (C) 2002-20, Bruce Allen, Christian Franke, > www.smartmontools.org > > === START OF INFORMATION SECTION === > Model Number: Samsung SSD 970 EVO Plus 2TB > Serial Number: S59CNM0W432692Z > Firmware Version: 2B2QEXM7 > PCI Vendor/Subsystem ID: 0x144d > IEEE OUI Identifier: 0x002538 > Total NVM Capacity: 2,000,398,934,016 [2.00 TB] > Unallocated NVM Capacity: 0 > Controller ID: 4 > NVMe Version: 1.3 > Number of Namespaces: 1 > Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB] > Namespace 1 Utilization: 1,421,133,684,736 [1.42 TB] > Namespace 1 Formatted LBA Size: 512 > Namespace 1 IEEE EUI-64: 002538 5431926d30 > Local Time is: Sun Oct 12 13:18:27 2025 EDT > Firmware Updates (0x16): 3 Slots, no Reset required > Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test > Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero > Sav/Sel_Feat Timestmp > Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg > Maximum Data Transfer Size: 512 Pages > Warning Comp. Temp. Threshold: 85 Celsius > Critical Comp. Temp. Threshold: 85 Celsius > > Supported Power States > St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat > 0 + 7.50W - - 0 0 0 0 0 0 > 1 + 5.90W - - 1 1 1 1 0 0 > 2 + 3.60W - - 2 2 2 2 0 0 > 3 - 0.0700W - - 3 3 3 3 210 1200 > 4 - 0.0050W - - 4 4 4 4 2000 8000 > > Supported LBA Sizes (NSID 0x1) > Id Fmt Data Metadt Rel_Perf > 0 + 512 0 0 > > === START OF SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > SMART/Health Information (NVMe Log 0x02) > Critical Warning: 0x00 > Temperature: 59 Celsius > Available Spare: 100% > Available Spare Threshold: 10% > Percentage Used: 0% > Data Units Read: 194,438,360 [99.5 TB] > Data Units Written: 104,493,528 [53.5 TB] > Host Read Commands: 444,201,317 > Host Write Commands: 2,118,258,242 > Controller Busy Time: 12,148 > Power Cycles: 23 > Power On Hours: 7,148 > Unsafe Shutdowns: 16 > Media and Data Integrity Errors: 0 > Error Information Log Entries: 154 > Warning Comp. Temperature Time: 0 > Critical Comp. Temperature Time: 0 > Temperature Sensor 1: 59 Celsius > Temperature Sensor 2: 66 Celsius > > Error Information (NVMe Log 0x01, 16 of 64 entries) > Num ErrCount SQId CmdId Status PELoc LBA NSID VS > 0 154 0 0x7011 0x4004 - 0 0 - > > Thank you for any clues what to do next. > Dan > _______________________________________________ > Discuss mailing list > Discuss at lists.blu.org > https://lists.blu.org/mailman/listinfo/discuss >
- Follow-Ups:
- [Discuss] Help with I/O errors on RAID array?
- From: dbarrett at blazemonger.com (Daniel Barrett)
- [Discuss] Help with I/O errors on RAID array?
- References:
- [Discuss] Help with I/O errors on RAID array?
- From: dbarrett at blazemonger.com (Daniel Barrett)
- [Discuss] Help with I/O errors on RAID array?
- Prev by Date: [Discuss] Help with I/O errors on RAID array?
- Next by Date: [Discuss] rpi-connect
- Previous by thread: [Discuss] Help with I/O errors on RAID array?
- Next by thread: [Discuss] Help with I/O errors on RAID array?
- Index(es):
