Project

General

Profile

Bug #80319

Merge from FreeBSD fix for deadlock in ZFS IO pipeline

Added by Alexander Motin over 2 years ago. Updated over 2 years ago.

Status:
Done
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
Medium
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

At least twice performance team triggered ZFS deadlock by running heavy NFS benchmark. Investigation shown that this problem was already fixed in FreeBSD at r339299. We need to merge it in.

Associated revisions

Revision 404319ed (diff)
Added by Alexander Motin over 2 years ago

MFC r339299: Pull in a follow-on commit to resolve a deadlock in ZFS sequential resilver (r334844) MFV/ZoL: Fix deadlock in IO pipeline commit a76f3d0437e5e974f0f748f8735af3539443b388 Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Fri Mar 16 16:46:06 2018 -0700 Fix deadlock in IO pipeline In vdev_queue_aggregate() the zio_execute() bypass should not be called under the vdev queue lock. This can result in a deadlock as shown in the stack traces below. Drop the vdev queue lock then walk the parents of the aggregate IO to determine the list of component IOs to be bypassed. This can be done safely without holding the io_lock since the new aggregate IO has not yet been returned and its parents cannot change. --- THREAD 1 --- arc_read() zio_nowait() zio_vdev_io_start() vdev_queue_io() <--- mutex_enter(vq->vq_lock) vdev_queue_io_to_issue() vdev_queue_aggregate() zio_execute() vdev_queue_io_to_issue() vdev_queue_aggregate() zio_execute() zio_vdev_io_assess() zio_wait_for_children() <- mutex_enter(zio->io_lock) --- THREAD 2 --- (inverse order) arc_read() zio_change_priority() <- mutex_enter(zio->zio_lock) vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock) Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ticket: #80319 (cherry picked from commit 4c0a72f14452f163891b74a31d221bceb4672ed7)

History

#1 Updated by Dru Lavigne over 2 years ago

  • Target version changed from 11.2-U4 to 11.2-U3

#2 Updated by Alexander Motin over 2 years ago

QE: There is no particular test for this, since issue is quite difficult to reproduce. Performance team can stress-test it as part of their normal work.

#3 Updated by Alexander Motin over 2 years ago

  • Status changed from In Progress to Passed Testing
  • Needs QA changed from Yes to No
  • Needs Merging changed from Yes to No

11-stable got it as part of regular merge.
11.2-stable commit: https://github.com/freenas/os/commit/404319ed4a332b9481caa13f5a847b36a4f2f9f0

#4 Updated by Dru Lavigne over 2 years ago

  • Status changed from Passed Testing to Done
  • Needs Doc changed from Yes to No

Also available in: Atom PDF