Project

General

Profile

Bug #35988

mpr(4) not enumerating some disks

Added by Lee Clements about 1 year ago. Updated 6 months ago.

Status:
Closed
Priority:
No priority
Assignee:
Alexander Motin
Category:
Hardware
Target version:
Seen in:
Severity:
Low Medium
Reason for Closing:
Cannot Reproduce
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

LSI 9300-8e Controller
8 x 800GB DP SAS 12G
48 x 6TB DP SAS 12G
5 JBOD Chassis

ChangeLog Required:
No

Description

Doing a new FreeNAS build with an LSI 9300-8e HBA in IT mode. Firmware on HBA flashed to latest available from Broadcom, 16.00.01.00. The FreeNAS server has 5 JBOD arrays hanging off of the HBA, with a total of 56 drives and what should be a total of 112 paths. With mpr driver version 18.03.00.00-fbsd, approximately 10-15 drives don't show up in camcontrol whatsoever. After upgrading the driver to the latest mpr from Broadcom, version 20.00.00.00, I can now see all 56 drives, but even though the mpr driver shows 112 paths, FreeNAS isn't picking this up and is only displaying 79 total paths. I shot a message over to Stephen McConnell, the original maintainer of the mpr(4) driver for FreeBSD, and he suggested to me that this was a middleware problem, as the driver itself can see all 112 paths just fine, something must be going on in between the driver and FreeNAS.

I've attached the latest dmesg output as a tar. Happy to provide any other diagnostic information as necessary.


Related issues

Related to FreeNAS - Bug #36618: mpr(4) drive enumeration issueClosed

History

#1 Updated by Lee Clements about 1 year ago

  • File gmultipath.txt added

Adding the output of gmultipath list -a.

#2 Updated by Dru Lavigne about 1 year ago

  • Category changed from Middleware to OS
  • Assignee changed from Release Council to Alexander Motin
  • Private changed from No to Yes

#3 Updated by Alexander Motin about 1 year ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Need additional information from Author

In the dmesg provided I see only 80 da devices, while there is indeed 112 lines of "SAS Address from SAS device page0". So unless there is some output artifact, I would not say that it is a middleware bug, at least it seems like not only it, but also something is not good either within the driver or in CAM layer of FreeBSD.

I'd like to ask for more information, for example, full FreeNAS debug archive (System -> Advanced -> Save Debug), possibly after enabling verbose messages on boot. Also may be useful to enable dev.mpr.0.debug_level=0x23 or even dev.mpr.0.debug_level=0x223 to see what is going on with device mapping inside the driver.

Actually not so long ago in nightly builds I've fixed one nasty bug in device mapping of the mpr driver, leading to device not being reported: https://github.com/freenas/os/commit/313e837c8c3ce7ae36ea16f1e7bdbde8541c15ba , so you may try some recent nightly build to see whether it change anything.

#4 Updated by Lee Clements about 1 year ago

  • File dmesg-6-28-00-02.txt added

Alexander Motin wrote:

In the dmesg provided I see only 80 da devices, while there is indeed 112 lines of "SAS Address from SAS device page0". So unless there is some output artifact, I would not say that it is a middleware bug, at least it seems like not only it, but also something is not good either within the driver or in CAM layer of FreeBSD.

I'd like to ask for more information, for example, full FreeNAS debug archive (System -> Advanced -> Save Debug), possibly after enabling verbose messages on boot. Also may be useful to enable dev.mpr.0.debug_level=0x23 or even dev.mpr.0.debug_level=0x223 to see what is going on with device mapping inside the driver.

Actually not so long ago in nightly builds I've fixed one nasty bug in device mapping of the mpr driver, leading to device not being reported: https://github.com/freenas/os/commit/313e837c8c3ce7ae36ea16f1e7bdbde8541c15ba , so you may try some recent nightly build to see whether it change anything.

Working with Ken Merry from FreeBSD on this in conjunction with this bug as well. I've enabled debugging on the mpr driver, after having to reverting back from the latest Broadcom back to the induced mpr driver as hw.mpr.0.debug_level seemed to have no discernible effect on dmesg output with driver version 20.00.00.00. Attached is that dmesg output. Let me know if this gives you what you need or whether we need further debug info.

#5 Updated by Lee Clements about 1 year ago

Alexander Motin wrote:

In the dmesg provided I see only 80 da devices, while there is indeed 112 lines of "SAS Address from SAS device page0". So unless there is some output artifact, I would not say that it is a middleware bug, at least it seems like not only it, but also something is not good either within the driver or in CAM layer of FreeBSD.

I'd like to ask for more information, for example, full FreeNAS debug archive (System -> Advanced -> Save Debug), possibly after enabling verbose messages on boot. Also may be useful to enable dev.mpr.0.debug_level=0x23 or even dev.mpr.0.debug_level=0x223 to see what is going on with device mapping inside the driver.

Actually not so long ago in nightly builds I've fixed one nasty bug in device mapping of the mpr driver, leading to device not being reported: https://github.com/freenas/os/commit/313e837c8c3ce7ae36ea16f1e7bdbde8541c15ba , so you may try some recent nightly build to see whether it change anything.

Booting into the latest nightly build, 201806270414, still unfortunately only shows 80 da devices.

#6 Updated by Alexander Motin about 1 year ago

  • Subject changed from LSI 9300-8e Controller JBOD Multipath not functioning to LSI 9300-8e not enumerating some disks
  • Status changed from Blocked to Screened
  • Severity changed from New to Low Medium
  • Reason for Blocked deleted (Need additional information from Author)

Looking on the provided dmesg I see 122 lines of "Target ID for added device", which would have sense for 112 disks and 10 enclosure devices, but only 89 of them have unique IDs, and there are 33 lines of "Attempting to reuse target id ". Looking on that I am pretty sure the problem is either somewhere inside the driver or firmware or hardware. I'd recommend to work on that with LSI/Broadcom or other FreeBSD developers. I may take a look on it sometimes, but can not guess when, since I am pretty busy and we haven't seen that issue ourselves.

#7 Updated by Lee Clements about 1 year ago

Alexander Motin wrote:

Looking on the provided dmesg I see 122 lines of "Target ID for added device", which would have sense for 112 disks and 10 enclosure devices, but only 89 of them have unique IDs, and there are 33 lines of "Attempting to reuse target id ". Looking on that I am pretty sure the problem is either somewhere inside the driver or firmware or hardware. I'd recommend to work on that with LSI/Broadcom or other FreeBSD developers. I may take a look on it sometimes, but can not guess when, since I am pretty busy and we haven't seen that issue ourselves.

Alex,

With permission from Ken, I am reposting his findings here so you don't unnecessarily burn cycles.

From Ken:

"So, the mapping debugging shows this:

mpr0: Attempting to reuse target id 63 handle 0x000b
mpr0: Attempting to reuse target id 64 handle 0x000c
mpr0: Attempting to reuse target id 65 handle 0x000d
mpr0: Attempting to reuse target id 66 handle 0x000e
mpr0: Attempting to reuse target id 67 handle 0x000f
mpr0: Attempting to reuse target id 68 handle 0x0010
mpr0: Attempting to reuse target id 69 handle 0x0011
mpr0: Attempting to reuse target id 70 handle 0x0012
mpr0: Attempting to reuse target id 66 handle 0x000e
mpr0: Attempting to reuse target id 67 handle 0x000f
mpr0: Attempting to reuse target id 68 handle 0x0010
mpr0: Attempting to reuse target id 69 handle 0x0011
mpr0: Attempting to reuse target id 70 handle 0x0012
mpr0: Attempting to reuse target id 73 handle 0x0046
mpr0: Attempting to reuse target id 74 handle 0x004f
mpr0: Attempting to reuse target id 75 handle 0x0050
mpr0: Attempting to reuse target id 76 handle 0x0051
mpr0: Attempting to reuse target id 77 handle 0x0052
mpr0: Attempting to reuse target id 80 handle 0x0062
mpr0: Attempting to reuse target id 81 handle 0x006b
mpr0: Attempting to reuse target id 82 handle 0x006c
mpr0: Attempting to reuse target id 83 handle 0x006d
mpr0: Attempting to reuse target id 84 handle 0x006e
mpr0: Attempting to reuse target id 164 handle 0x0013
mpr0: Attempting to reuse target id 165 handle 0x0014
mpr0: Attempting to reuse target id 166 handle 0x0015
mpr0: Attempting to reuse target id 167 handle 0x0016
mpr0: Attempting to reuse target id 168 handle 0x0017
mpr0: Attempting to reuse target id 157 handle 0x002b
mpr0: Attempting to reuse target id 158 handle 0x002c
mpr0: Attempting to reuse target id 159 handle 0x002d
mpr0: Attempting to reuse target id 160 handle 0x002e
mpr0: Attempting to reuse target id 161 handle 0x002f

The code from FreeBSD/head in mprsas_add_device() in mpr_sas_lsi.c: is:

/*
 * Only do the ID check and reuse check if the target is not from a
 * RAID Component. For Physical Disks of a Volume, the ID will be reused
 * when a volume is deleted because the mapping entry for the PD will
 * still be in the mapping table. The ID check should not be done here
 * either since this PD is already being used.
*/
targ = &sassc->targets[id];
if (!(targ->flags & MPR_TARGET_FLAGS_RAID_COMPONENT)) {
if (mprsas_check_id(sassc, id) != 0) {
mpr_dprint(sc, MPR_MAPPING|MPR_INFO,
"Excluding target id %d\n", id);
error = ENXIO;
goto out;
}
if (targ->handle != 0x0) {
mpr_dprint(sc, MPR_MAPPING, "Attempting to reuse "
"target id %d handle 0x%04x\n", id, targ->handle);
error = ENXIO;
goto out;
}
}

What has happened just above here is that it has called mpr_mapping_get_tid() (in mpr_mapping.c) to get the target ID for the given SAS address and handle.

The problem is that there are duplicate target IDs in the mapping table. So the driver logically bails out in the code above, and doesn’t add a second device at the same target ID.

I think the mapping code should do something different, perhaps rewriting the entry with a different target ID.

The band aid approach would probably be to clear the mapping table. Is there a way to clear it with the standard LSI utilities or does Lee need lsiutil?"

#8 Updated by Alexander Motin about 1 year ago

  • Related to Bug #36618: mpr(4) drive enumeration issue added

#9 Updated by Alexander Motin about 1 year ago

  • Subject changed from LSI 9300-8e not enumerating some disks to mpr(4) not enumerating some disks

#10 Updated by Alexander Motin about 1 year ago

  • Category changed from OS to Hardware

#11 Updated by Alexander Motin 6 months ago

  • Target version changed from Backlog to N/A

I am closing this, since for last 7 months haven't seen any other reports like that.

#12 Updated by Dru Lavigne 6 months ago

  • File deleted (dmesg.tar)

#13 Updated by Dru Lavigne 6 months ago

  • File deleted (gmultipath.txt)

#14 Updated by Dru Lavigne 6 months ago

  • File deleted (dmesg-6-28-00-02.txt)

#15 Updated by Dru Lavigne 6 months ago

  • Status changed from Screened to Closed
  • Private changed from Yes to No
  • Reason for Closing set to Cannot Reproduce

Also available in: Atom PDF