CAM SCSI errors / Smart Errors with Spinning disks only?
9211-8i HBA IT mode
BPN-SAS2-216EL1 backplane/SAS expander.
I started a thread in the freenas forums (https://forums.freenas.org/index.php?threads/cam-status-errors-with-only-spinning-disks-not-ssd.50146/) but it was recommended I open a bug ticket as well.
I have two pools of spinning media, and one pool of SSD media. I receive CAM scsi errors guaranteed quickly and swiftly on an older disk spinning disk pool when using it in any ZFS configuration (stripe, mirrors, raidz, etc). I receive smart errors relating to Current_Sector_Pending and offline_uncorrectable errors on the other spinning disk pool within hours of rsyncing data to it. Disks in this pool will constantly drop out and resilver. These smart errors go away when zero'd out.
All spinning media has been tested with badblocks and the burnin methods as described on the Freenas forums. I can dd zeroes to all disks simultaneously and each disk shows 100% busy without issue. I can dd from each disk to another disk in a pair pool without issue as well.
My SSD pool however, is completely fine and shows no issues, so I find it unlikely this is controller related though I have tested another controller in jbod mode (It was not a pure HBA card so I cannot compare apples to oranges.)
As noted in the thread, I have verified power to the rails (though not each individual molex connector) is in spec, and verified it under load. I have replaced the backplane as well in my chassis.
For some reason, these issues only occur under ZFS usage.
One of the disks in the largepool did return an error about "FPDMA queue" being the last commands (Read and write) when the error was caught, so I question if this is related to queueing somehow?
#4 Updated by Terry Zink over 3 years ago
- Assignee changed from Chris Torek to Kris Moore
- Seen in changed from Unspecified to 9.10.2-U1
- Hardware Configuration updated (diff)
Following errors in dmesg log from raidz2 pool (Seagate 4tbs) upon syncing data from my old linux NAS to a pool via rsync.
Again, these smart errors appear and I start seeing disks drop out/fail (my raidz2 is failed now).
Again all disks can be badblocks'd and dd'd to fully and this clears out the smart errors / reports no bad sectors.
#6 Updated by Chris Torek over 3 years ago
- Status changed from Unscreened to Closed: Cannot reproduce
It's almost certainly hardware (I never rule out software bugs even in software that's been working perfectly for 40 years :-) ). The FPDMA-queued thing is just because all these drives work best (throughput wise) if you queue the I/O as a tagged operation, so virtually all I/O is "FPDMA-queued".
Others have seen similar problems (on all kinds of OSes, not just FreeBSD/FreeNAS) due to drive failures, bad cables, or failing power supplies. That includes power supplies that test as good. I've seen some people identify these by oscilloscope readings, watching the voltage drop below some critical point for just a millisecond or so as the disk does something particularly power-hungry, but mostly people seem to diagnose failing power supplies by swapping out the power supply. (The old DEC service joke: "how does the field service guy diagnose a flat tire? by swapping out all four tires until the system is OK.")
Power supply, cables, and/or connectors (e.g., a tiny bit of corrosion leading to voltage issues) seem likely given that the error occurs only when the drive is doing "real work", and that you can make the bad spot go away by rewriting it.
#7 Updated by Terry Zink over 3 years ago
Just an update thanks for the info.
It's worth noting I've done some more investigation and it appears this is related to NCQ?
If I disable NCQ on the WD 750 gig spinners (camcontrol tags daX -N 1), the issues go away. (Obviously with a performance hit for rand io).
If I run 4x of the 750s in a pool with NCQ enabled (camcontrol tags daX -N255) it's fine. If I run more than 4 it starts erroring immediately.
Strangely as well, so far , the seagates are functioning "ok" with 2 vdevs in a pool of 4 disk in a raidz1 each.
I'm wondering , could a hardware issue cause issues to only occur with queueing?