Project

General

Profile

Feature #38978

Add support for NVMe device replacement

Added by Alexander Motin 9 months ago. Updated about 2 months ago.

Status:
Ready for Testing
Priority:
No priority
Assignee:
Alexander Motin
Category:
Hardware
Target version:
Estimated time:
Severity:
Med High
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Description

NVMe needs support for both hot-plug and un-plug.

For hot-plug there are two potential issues: a) make PCI report that something happened, and b) hope there is enough resources reserved to allocate from (which may be difficult).

For un-plug obviously clean teardown is needed, and one of the problem is that device is no longer responding to accesses, since it is no longer there.

History

#1 Updated by Alexander Motin 9 months ago

  • Description updated (diff)
  • Status changed from Unscreened to Screened

#2 Updated by Alexander Motin 2 months ago

  • Status changed from Screened to In Progress
  • Target version changed from Backlog to 11.2-U3
  • Parent task deleted (#31596)

I've made few fixes there, including r343447 already in 11.3-stable branch. Unfortunately resource allocation problem on plug-in is complicated and may still require reboot, number of others should be handled now.

#3 Updated by Alexander Motin 2 months ago

Just for notice, FreeBSD head finally enabled PCI BARs reallocation (https://svnweb.freebsd.org/changeset/base/344022), that, if works fine, may be a step towards PCI resource reservation.

#4 Updated by Alexander Motin about 2 months ago

  • Status changed from In Progress to Ready for Testing

I've merged to 11.2-stable change that should allow hot NVMe device replacement, at least when it is disabled with `devctl disable nvmeX` before removal, even under load.

QE: Minimal test doable on any NVMe hardware include `devctl disable nvmeX`/`devctl enable nvmeX` under load (make sure to not upset ZFS removing critical/only vdev). Maximal test would include real hot NVMe device replacement on M50 platform (with explicit `devctl disable` first). I haven't tested that after these changes, but there is a chance it may work now, since resources freed by removed device should be enough for the inserted one.

Also available in: Atom PDF