Project

General

Profile

Bug #18150

QLogic NetXtreme II BCM57800 - Causes total crash when binding jail

Added by Bryon Brinkmann almost 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Expected
Assignee:
Chris Torek
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Build FreeNAS-9.10.1-U2 (f045a8b)
Platform Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
Memory 392965MB DDR 4 2133
Dell R630
2x 120gb SATADOM's
4x 1.9TB Samsung SSD's
6x 960gb Samsung SSD's
QLogic NetXtreme II BCM57800 2x1gb 2x10gb
Dell PERC H330 - Configured as NON-RAID (HBA)
2x 8gb Radian cache cards RMS-200

ChangeLog Required:
No

Description

I believe there is a bug with the QLogic NetXtreme II BCM57800. This is a bare metal Dell R630 server. When creating a jail manually or via plugin, the system will hard crash due to an attempt to bind to the QLogic NetXtreme II BCM57800 to the jail. Tried LACP and simple network interface still crashes. After reboot and FreeNas attempts to initiate the jail the system will crash and continue crashing in a loop. Crash Dumps attached and forum discussion for review.

https://forums.freenas.org/index.php?threads/dell-r630-freenas-v9-10-crashing-on-many-fronts.46648/

History

#1 Updated by Bryon Brinkmann almost 5 years ago

  • Category changed from 38 to 129

Not sure what category this fails under Networking, jails, drivers ETC

#2 Updated by Josh Paetzel almost 5 years ago

  • Status changed from Unscreened to Screened
  • Assignee set to Chris Torek
  • Priority changed from No priority to Nice to have

It appears the driver is indeed misbehaving when interacting with the bridging code. Please attach a full system -> advanced -> save debug to this ticket when you get a chance.

#3 Updated by Bryon Brinkmann almost 5 years ago

  • File debug-uber-20161011081225.tar added

Josh Paetzel wrote:

It appears the driver is indeed misbehaving when interacting with the bridging code. Please attach a full system -> advanced -> save debug to this ticket when you get a chance.

Josh,
As requested - Attached is the debug. Do you know if this is addressed in the 10 release?

#4 Updated by Josh Paetzel almost 5 years ago

Well, there are no jails in 10, so I guess maybe. I think it uses nat instead of bridging too. But if 10 tried to bridge this interface it would panic there (the underlying OS is the same)

#5 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

  • Status changed from Screened to Closed: Not To Be Fixed

This is something I'm not sure we'll have resources to fix, however future updates to base OS may bring with it fixes down the road.

#6 Updated by Chris Torek over 4 years ago

BTW this is almost certainly a bug in the BCM57800 driver (dev/bxe/*), failing to unlock a mutex somewhere. Without a full crash dump it's really hard to see where though. I'm suspicious of:

#define ECORE_SPIN_LOCK_BH(_spin)   mtx_lock(_spin) /* bh = bottom-half */
#define ECORE_SPIN_UNLOCK_BH(_spin) mtx_unlock(_spin) /* bh = bottom-half */

simply on the basis that it claims (by name) to be doing spinlocks but is obviously using a regular sleep mutex.

(Presumably there's some reason just changing this to a spin mutex, and using mtx_lock_spin on it, has never been done.)

#7 Updated by Chris Torek over 4 years ago

One final note, I see that the driver was basically rewritten in FreeBSD 10 (which 9.10 uses) and this got added:

/*
 * For the main interface up/down code paths, a not-so-fine-grained CORE
 * mutex lock is used. Inside this code are various calls to kernel routines
 * that can cause a sleep to occur. Namely memory allocations and taskqueue
 * handling. If using an MTX lock we are *not* allowed to sleep but we can
 * with an SX lock. This define forces the CORE lock to use and SX lock.
 * Undefine this and an MTX lock will be used instead. Note that the IOCTL
 * path can cause problems since it's called by a non-sleepable thread. To
 * alleviate a potential sleep, any IOCTL processing that results in the
 * chip/interface being started/stopped/reinitialized, the actual work is
 * offloaded to a taskqueue.
 */
#define BXE_CORE_LOCK_SX

and that should be the only lock held here, which should not cause problems. But... obviously not.

#8 Updated by Chris Torek over 4 years ago

  • Status changed from Closed: Not To Be Fixed to Investigation

I couldn't quite give it up, and found something. I've sent mail off to the freebsd-net list. I'm going to re-open this, and see what happens if and when anyone replies...

#9 Updated by Chris Torek over 4 years ago

  • Category changed from 129 to 137
  • Status changed from Investigation to Fix In Progress
  • Priority changed from Nice to have to Expected

It's a general OS bug. Since we use bridges pretty heavily we'll want to fix this.

See https://lists.freebsd.org/pipermail/freebsd-net/2016-December/046569.html for details.

#10 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

Ping! Is this relevant since we are on 11/stable now?

#11 Updated by Chris Torek over 4 years ago

I saw some commit go in that looked fix-ey, but am not sure if it's in 11 or if it was for this particular issue. I'll take a closer look later (today I hope).

#12 Updated by Chris Torek over 4 years ago

The commit to fix this was 86695e45f6c1f03347498616e8c81d70ac56fb58 (aka -r312782).

It was MFC'ed to stable/11 in 6fb6e78d7db7ea7b185cfb56f02ccc51ed1a1ec8.

So, we should be good. What's the right state for this bug now?

#13 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

  • Status changed from Fix In Progress to Resolved
  • Target version set to 9.10.3

There you go! Thanks!

#14 Updated by Bryon Brinkmann over 4 years ago

Kris Moore wrote:

There you go! Thanks!

Thanks for all the work - once the update is released I'll try to get it tested...

#15 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

  • Target version changed from 9.10.3 to 11.0

#16 Updated by Vaibhav Chauhan about 4 years ago

  • Target version changed from 11.0 to 11.0-RC

#17 Updated by Dru Lavigne over 3 years ago

  • File deleted (textdump.tar.0.gz)

#18 Updated by Dru Lavigne over 3 years ago

  • File deleted (textdump.tar.last.gz)

#19 Updated by Dru Lavigne over 3 years ago

  • File deleted (debug-uber-20161011081225.tar)

Also available in: Atom PDF