Project

General

Profile

Bug #20784

CAM SCSI errors / Smart Errors with Spinning disks only?

Added by Terry Zink over 3 years ago. Updated about 3 years ago.

Status:
Closed: Cannot reproduce
Priority:
No priority
Assignee:
Chris Torek
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

9211-8i HBA IT mode
BPN-SAS2-216EL1 backplane/SAS expander.

ChangeLog Required:
No

Description

I started a thread in the freenas forums (https://forums.freenas.org/index.php?threads/cam-status-errors-with-only-spinning-disks-not-ssd.50146/) but it was recommended I open a bug ticket as well.

I have two pools of spinning media, and one pool of SSD media. I receive CAM scsi errors guaranteed quickly and swiftly on an older disk spinning disk pool when using it in any ZFS configuration (stripe, mirrors, raidz, etc). I receive smart errors relating to Current_Sector_Pending and offline_uncorrectable errors on the other spinning disk pool within hours of rsyncing data to it. Disks in this pool will constantly drop out and resilver. These smart errors go away when zero'd out.

All spinning media has been tested with badblocks and the burnin methods as described on the Freenas forums. I can dd zeroes to all disks simultaneously and each disk shows 100% busy without issue. I can dd from each disk to another disk in a pair pool without issue as well.

My SSD pool however, is completely fine and shows no issues, so I find it unlikely this is controller related though I have tested another controller in jbod mode (It was not a pure HBA card so I cannot compare apples to oranges.)

As noted in the thread, I have verified power to the rails (though not each individual molex connector) is in spec, and verified it under load. I have replaced the backplane as well in my chassis.

For some reason, these issues only occur under ZFS usage.

One of the disks in the largepool did return an error about "FPDMA queue" being the last commands (Read and write) when the error was caught, so I question if this is related to queueing somehow?

History

#1 Updated by Terry Zink over 3 years ago

  • File debug-freenas-20170201163241.txz added

#2 Updated by Bonnie Follweiler over 3 years ago

  • Assignee set to Kris Moore

#3 Avatar?id=14398&size=24x24 Updated by Kris Moore over 3 years ago

  • Assignee changed from Kris Moore to Chris Torek

While I do suspect some sort of hardware issue here, I'll send it over to Torek for a second opinion.

#4 Updated by Terry Zink over 3 years ago

  • Assignee changed from Chris Torek to Kris Moore
  • Seen in changed from Unspecified to 9.10.2-U1
  • Hardware Configuration updated (diff)

Further updates:

Following errors in dmesg log from raidz2 pool (Seagate 4tbs) upon syncing data from my old linux NAS to a pool via rsync.
http://pastebin.com/WYUEwtNJ

Again, these smart errors appear and I start seeing disks drop out/fail (my raidz2 is failed now).

Again all disks can be badblocks'd and dd'd to fully and this clears out the smart errors / reports no bad sectors.

#5 Updated by Terry Zink over 3 years ago

  • Assignee changed from Kris Moore to Chris Torek

#6 Updated by Chris Torek over 3 years ago

  • Status changed from Unscreened to Closed: Cannot reproduce

It's almost certainly hardware (I never rule out software bugs even in software that's been working perfectly for 40 years :-) ). The FPDMA-queued thing is just because all these drives work best (throughput wise) if you queue the I/O as a tagged operation, so virtually all I/O is "FPDMA-queued".

Others have seen similar problems (on all kinds of OSes, not just FreeBSD/FreeNAS) due to drive failures, bad cables, or failing power supplies. That includes power supplies that test as good. I've seen some people identify these by oscilloscope readings, watching the voltage drop below some critical point for just a millisecond or so as the disk does something particularly power-hungry, but mostly people seem to diagnose failing power supplies by swapping out the power supply. (The old DEC service joke: "how does the field service guy diagnose a flat tire? by swapping out all four tires until the system is OK.")

Power supply, cables, and/or connectors (e.g., a tiny bit of corrosion leading to voltage issues) seem likely given that the error occurs only when the drive is doing "real work", and that you can make the bad spot go away by rewriting it.

#7 Updated by Terry Zink over 3 years ago

Hi Chris,

Just an update thanks for the info.

It's worth noting I've done some more investigation and it appears this is related to NCQ?

If I disable NCQ on the WD 750 gig spinners (camcontrol tags daX -N 1), the issues go away. (Obviously with a performance hit for rand io).

If I run 4x of the 750s in a pool with NCQ enabled (camcontrol tags daX -N255) it's fine. If I run more than 4 it starts erroring immediately.

Strangely as well, so far , the seagates are functioning "ok" with 2 vdevs in a pool of 4 disk in a raidz1 each.

I'm wondering , could a hardware issue cause issues to only occur with queueing?

#8 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170201163241.txz)

#9 Updated by Dru Lavigne about 3 years ago

  • Target version set to N/A
  • Private changed from Yes to No

Also available in: Atom PDF