Project

General

Profile

Bug #36618

mpr(4) drive enumeration issue

Added by Gary Wolfe 10 months ago. Updated 2 months ago.

Status:
Closed
Priority:
No priority
Assignee:
Alexander Motin
Category:
Hardware
Target version:
Seen in:
Severity:
Low Medium
Reason for Closing:
Cannot Reproduce
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

cpu: intel i7 3930k
ram: g.skill 64GB kit (F3-12800CL10Q2-64GBZL)
motherboard: Asus Rampage IV Extreme
NIC: Intel x540-T2
HBAs: 1x Avago 9305-24i, 1x Avago 9305-16e, 1x Avago LSI 9300-8e
Storage:
0) 1 Norco RPC-4224 populated w/above components and 24 Seagate ST4000VN000 4TB 'NAS' drives.
1) 3x Norco DS24E populated w/48 (24 each) IBM Deskstar 750GB SATA disks and the other w/24 Seagate ST4000VN000 4TB 'NAS' drives.
2) 2x Quantum Superloader 3 SAS LTO6 tape autoloaders.

ChangeLog Required:
No

Description

I put this in the forums and was advised to file it as a bug. Note that the 'seen in' field has 11.1-u5 but I had originally seen this in 11.1-RELEASE but just rolled back to 11.0-u4 to 'wait it out'. I only tried 11.1-u5 as I had some time and thought it had been long enough that this issue would surely have been discovered/addressed. Seems not.

From the forum post:

I have a very weird issue. I'm not even certain what to check to root-cause the thing. It's not strictly 11.1-u5 issues as it happened in 11.1-RELEASE and I just rolled back to 11.0-u4 and figured I'd wait it out. I tried 11.1-u5 and the same thing's happening. So I figured I'd break down and ask wtfo!?

Topology:
4 fully populated 24 bay SAS jbod boxes w/Arecca SAS expanders. 2 w/750GB drives and 2 w/4TB drives.

When I update to 11.1-(RELEASE|u5) I get a hole in the number of drives the system sees. /dev/da47 in 11.0-u4 is the last drive in the 2nd of the 4TB populated jbod. In 11.1-u5 that drive/slot isn't even enumerated in the OS. /dev/da47 is the first drive of the first of the 750GB populated jbod. When doing a sequential access 'test':

for d in {0..95}; do echo "--- /dev/da${d} ---"; dd if=/dev/da${d} of=/dev/null bs=512m count=1; done

And watching the leds on the drive carriers. It gets to 46, as slot 22 (0 index) of 2nd 4TB jbod, skips over slot 23 (0 index) of that same jbod, and slot 0 of the first 750GB jbod is accessed. In 11.0-u4 there are no issues. All 96 drives are present and accounted for.

I have a diskmap from both versions for further verification:

root@freenas11: for d in {45..50}; do grep da${d} 11-0-u4/disk_map.txt; done                                                                  ~ 0
da45: Serial Z3051GXE ; GPTID=gptid/1b227d4d-022b-11e7-a647-a0369f3c3d84
da46: Serial Z3051H2T ; GPTID=gptid/1c056812-022b-11e7-a647-a0369f3c3d84
da47: Serial Z3051PC5 ; GPTID=gptid/1ced0d90-022b-11e7-a647-a0369f3c3d84
da48: Serial GTD200P8G4MY6D ; GPTID=gptid/6a81b334-0224-11e7-92db-a0369f3c3d84
da49: Serial GTA200P8G4UTZA ; GPTID=gptid/6ca5f360-0224-11e7-92db-a0369f3c3d84
da50: Serial GTD200P8G4N6JD ; GPTID=gptid/6ed038c6-0224-11e7-92db-a0369f3c3d84

root@freenas11: for d in {45..50}; do grep da${d} 11-1-u5/disk_map.txt; done                                                                  ~ 0
da45: Serial Z3051GXE ; GPTID=gptid/1b227d4d-022b-11e7-a647-a0369f3c3d84
da46: Serial Z3051H2T ; GPTID=gptid/1c056812-022b-11e7-a647-a0369f3c3d84
da47: Serial GTD200P8G4MY6D ; GPTID=gptid/6a81b334-0224-11e7-92db-a0369f3c3d84
da48: Serial GTA200P8G4UTZA ; GPTID=gptid/6ca5f360-0224-11e7-92db-a0369f3c3d84
da49: Serial GTD200P8G4N6JD ; GPTID=gptid/6ed038c6-0224-11e7-92db-a0369f3c3d84
da50: Serial GTF200P8G5094F ; GPTID=gptid/70ed41e5-0224-11e7-92db-a0369f3c3d84

root@freenas11: diskinfo -v /dev/da47                                                                                                    ~ 0
/dev/da47
        512             # sectorsize
        4000787030016   # mediasize in bytes (3.6T)
        7814037168      # mediasize in sectors
        4096            # stripesize
        0               # stripeoffset
        486401          # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        Z3051PC5        # Disk ident.
        id1,enc@n5001b4d50dbba03d/type@0/slot@18/elmdesc@SLOT_24        # Physical path
        Not_Zoned       # Zone Mode

root@freenas11: grep -A12 'da47' 11-1-u5/diskinfo.txt                                                                                         ~ 0
--- /dev/da47 ---
/dev/da47
        512             # sectorsize
        750156374016    # mediasize in bytes (699G)
        1465149168      # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        91201           # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        ATA Hitachi HDS72107    # Disk descr.
        GTD200P8G4MY6D  # Disk ident.
        id1,enc@n5001b4d5123ed03d/type@0/slot@1/elmdesc@SLOT_01 # Physical path
        Not_Zoned       # Zone Mode

Controllers:

mpr0: <Avago Technologies (LSI) SAS3216> port 0xd000-0xd0ff mem 0xfb400000-0xfb40ffff irq 26 at device 0.0 on pci1
mpr0: Firmware: 14.00.00.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

mpr1: <Avago Technologies (LSI) SAS3224> port 0xe000-0xe0ff mem 0xfb600000-0xfb60ffff irq 32 at device 0.0 on pci2
mpr1: Firmware: 14.00.00.00, Driver: 18.03.00.00-fbsd
mpr1: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

mpr2: <Avago Technologies (LSI) SAS3008> port 0xc000-0xc0ff mem 0xfb240000-0xfb24ffff,0xfb200000-0xfb23ffff irq 42 at device 0.0 on pci4
mpr2: Firmware: 12.00.00.00, Driver: 18.03.00.00-fbsd
mpr2: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

There looks to be some odd enumeration in the 11-0-u4 case (all drives showing up):

root@freenas11: grep da47 11-0-u4/dmesg.txt                                                                                                   ~ 0
da47 at mpr0 bus 0 scbus0 target 122 lun 0
da47: <ATA Hitachi HDS72107 A70M> Fixed Direct Access SPC-4 SCSI device
da47: Serial Number GTD200P8G4MY6D
da47: 300.000MB/s transfers
da47: Command Queueing enabled
da47: 715404MB (1465149168 512 byte sectors)
ses2: da47,pass50: Element descriptor: 'SLOT 01'
ses2: da47,pass50: SAS Device Slot Element: 1 Phys at Slot 0
da47 at mpr0 bus 0 scbus0 target 122 lun 0
da47: <ATA Hitachi HDS72107 A70M> Fixed Direct Access SPC-4 SCSI device
da47: Serial Number GTD200P8G4MY6D
da47: 300.000MB/s transfers
da47: Command Queueing enabled
da47: 715404MB (1465149168 512 byte sectors)
ses2: da47,pass50: Element descriptor: 'SLOT 01'
ses2: da47,pass50: SAS Device Slot Element: 1 Phys at Slot 0
da47 at mpr0 bus 0 scbus0 target 121 lun 0
da47: <ATA ST4000VN000-1H41 SC46> Fixed Direct Access SPC-4 SCSI device
da47: Serial Number Z3051PC5
da47: 600.000MB/s transfers
da47: Command Queueing enabled
da47: 3815447MB (7814037168 512 byte sectors)
ses1: da47: Element descriptor: 'SLOT 24'
ses1: da47: SAS Device Slot Element: 1 Phys at Slot 23
GEOM_ELI: Device da47p1.eli created.

root@freenas11: grep da47 11-1-u5/dmesg.txt                                                                                                   ~ 0
da47 at mpr0 bus 0 scbus0 target 122 lun 0
da47: <ATA Hitachi HDS72107 A70M> Fixed Direct Access SPC-4 SCSI device
da47: Serial Number GTD200P8G4MY6D
da47: 300.000MB/s transfers
da47: Command Queueing enabled
da47: 715404MB (1465149168 512 byte sectors)
ses2: da47,pass50: Element descriptor: 'SLOT 01'
ses2: da47,pass50: SAS Device Slot Element: 1 Phys at Slot 0
da47 at mpr0 bus 0 scbus0 target 122 lun 0
da47: <ATA Hitachi HDS72107 A70M> Fixed Direct Access SPC-4 SCSI device
da47: Serial Number GTD200P8G4MY6D
da47: 300.000MB/s transfers
da47: Command Queueing enabled
da47: 715404MB (1465149168 512 byte sectors)
ses2: da47,pass50: Element descriptor: 'SLOT 01'
ses2: da47,pass50: SAS Device Slot Element: 1 Phys at Slot 0

Why does it enumerate da47 2x as the 750GB drive and once as the 4TB drive and bump the 750GB to da48 in the 11-0-u4 case but no enumeration entirely of that drive in
the 11-1-u5 case?

Any/all assitance would be grand!

Thanks!


Related issues

Related to FreeNAS - Bug #35988: mpr(4) not enumerating some disksClosed

History

#1 Updated by Gary Wolfe 10 months ago

  • Hardware Configuration updated (diff)

#2 Updated by Dru Lavigne 10 months ago

  • Private changed from No to Yes
  • Reason for Blocked set to Need additional information from Author

Gary: please attach a debug (System -> Advanced -> Save Debug) to this ticket.

#3 Updated by Gary Wolfe 10 months ago

  • File debug-freenas11-20180705012303_11.0-u4.tgz added
  • File debug-freenas11-20180705004504_11.1-u5.tgz added

I've attached debug logs for both 11.0-u4 and 11.1-u5. The issue is only present in the 11.1-u5 case.

#4 Updated by Dru Lavigne 10 months ago

  • Assignee changed from Release Council to Alexander Motin

#5 Updated by Alexander Motin 10 months ago

  • Category changed from OS to Hardware
  • Status changed from Unscreened to Blocked

It seems like problem exist in case of 11.0, but shows itself different. In 11.0 I see such device:

<ATA ST4000VN000-1H41 SC46>        at scbus0 target 121 lun 0 (pass49,da47)

, while in 11.1 on the same place I see:
<Areca ARC-8026-.01.14. 0114>      at scbus0 target 121 lun 0 (ses2,pass49)

, which was not visible before. It seems to be a problem of enclosure enumeration somewhere between the JBODs, HBA and the HBA driver. I'd recommend you to update all firmwares you can: HBAs, and, if possible, JBODs too.

If that won't help, it may help to enable additional mpr driver debugging by setting loader tunable hw.mpr.debug_level=0x223 and reboot to analyze what actually goes wrong with device enumeration there. Those data could be also forwarded to Broadcom support or FreeBSD committers for review.

Also as I see you have 3 JBODs connected to the first HBA, and only one to second. Have you tried to balance those? That is definitely not a fix, but may be it could be a workaround.

#6 Updated by Gary Wolfe 10 months ago

  • File debug-freenas11-20180706031731_11.1-u5_post_hba_expander_updates_and_driver_debug_enable.tgz added

Add new debug for 11.1-u5 w/HBA/expander fw update as well as driver debug tunable set to 0x0223.

That doesn't appear to have made any impact, but maybe the issue immediately jumps out at you?

#7 Updated by Alexander Motin 10 months ago

  • Status changed from Blocked to Screened
  • Reason for Blocked deleted (Need additional information from Author)

Here are two interesting log chunks:

mpr0: SAS Address from SAS device page0 = 5001b4d5123ed03d
mpr0: mprsas_add_device: Target ID for added device is 121.
mpr0: SAS Address from SAS device page0 = 5001b4d5123ed03d
mpr0: Found device <4451<SmpInit,SspInit,SspTarg,SepDev>,End Device> <6.0Gbps> handle<0x0045> enclosureHandle<0x0003> slot 0
mpr0: At enclosure level 0 and connector name (    )
mpr0: Target id 0x79 added

mpr0: SAS Address from SAS device page0 = 5001b4d50dbba01f
mpr0: mprsas_get_sas_address_for_sata_disk: got SATA identify successfully for handle = 0x59 with try_count = 1
mpr0: SAS Address from SATA device = 3b2c3b356a8d4833
mpr0: mprsas_add_device: Target ID for added device is 121.
mpr0: Attempting to reuse target id 121 handle 0x0045
mpr0: mprsas_fw_work: failed to add device with handle 0x59
mpr0: mprsas_prepare_remove : invalid handle 0x59.

It seems like target mapping table in HBA flash has two records with the same target ID. I don't know how that happen, it can be either firmware or driver issue. Unfortunately I don't know how to erase that table. If you have identical HBAs, I'd try to swap it, so that those duplicate records did not match anything and hopefully get expunged. Broadcom promised updated driver version "soon". May be in improve the situation.

Just in case it show anything more interesting (like enclosure misreporting its slots, since the slot in question is the last), show please output of `mprutil show all`.

#8 Updated by Alexander Motin 10 months ago

  • Related to Bug #35988: mpr(4) not enumerating some disks added

#9 Updated by Alexander Motin 10 months ago

This ticket looks quite alike to #35988.

#10 Updated by Gary Wolfe 10 months ago

  • File mprutil_show_all_11.0-u4.txt added
  • File mprutil_show_all_11.1-u5.txt added

I've attached the mprutil output from w/in either version. diff claims them to be == but I've attached them both for completeness.

#11 Updated by Gary Wolfe 10 months ago

Alexander Motin wrote:

This ticket looks quite alike to #35988.

I can't find that ticket in the 'all issues' (even if I remove all filters) nor can I get to it through url (no perm to view page).

#12 Updated by Alexander Motin 10 months ago

Gary Wolfe wrote:

Alexander Motin wrote:

This ticket looks quite alike to #35988.

I can't find that ticket in the 'all issues' (even if I remove all filters) nor can I get to it through url (no perm to view page).

The ticket is still open and contains user data, so we a closing such tickets (and this your's also) from third-parties until the problem is solved and valuable data are purged. But in that case there are also number of missing drives, just more then one. That reported contacted Broadcom/LSI and FreeBSD developers, so something may come out of it.

#13 Updated by Alexander Motin 10 months ago

Gary Wolfe wrote:

I've attached the mprutil output from w/in either version.

May be it is nothing, I don't remember all the specifications from the top of my head, but I see at least some irregularity that in list of enclosures one of your enclosures report there 26 slots, while two of others -- 24. Looking on the list of devices for the first one I see devices in slots from 0 to 25 (0 and 25 of which are virtual), which kind of matches 26. On the other side for other two enclosures I see devices in slots 0 to 24 (0 is virtual), which seems to be off by one, unless my interpretation of the data is wrong. That could explain overlap between mapping of the last device in second enclosure and first virtual (SES) device of the third enclosure.

#14 Updated by Gary Wolfe 10 months ago

Alexander Motin wrote:

Gary Wolfe wrote:

I've attached the mprutil output from w/in either version.

May be it is nothing, I don't remember all the specifications from the top of my head, but I see at least some irregularity that in list of enclosures one of your enclosures report there 26 slots, while two of others -- 24. Looking on the list of devices for the first one I see devices in slots from 0 to 25 (0 and 25 of which are virtual), which kind of matches 26. On the other side for other two enclosures I see devices in slots 0 to 24 (0 is virtual), which seems to be off by one, unless my interpretation of the data is wrong. That could explain overlap between mapping of the last device in second enclosure and first virtual (SES) device of the third enclosure.

Odd. All of the expanders are 24 drive and all but one are serviced by an ARC8026. The other one, the 12Gb SAS3 one, I upgraded a few years ago to the newer ARC8028. The wiring is set up w/the SAS9300-8e connecting to the two Superloader 3 tape units. For whatever reason tape devices must be direct attached; no daisy chaining. I had tried this in the past and was informed that was a thing. The 9305-16e card connects to the ARC8028 expander, the out from there goes to one of the ARC8026 expanders and its out goes to the other one. I don't have another 8644 to 8088 cable to go from the HBA direction to the other ARC8026 expander or I'd try that or put a tape drive off it. Anyway, moving the daisy chained ARC8026 expander to the SAS9300-8e card skirts the issue altogether.

As you say, it's not a fix but it is a work-around. Daisy chaining w/SAS is supposed to be a well-supported/'select is not broken' kind of topology. Arecca and LSI make good hw and are by no means, 'cheap' (cost or quality). So I'm kind of surprised/saddened this is an issue.

I tried to call Broadcom and Arecca to see who they want to push the blame onto. But they were busy and 302'd me to their support pages. Hard to really pin down who's really at fault. Given that this works pre-11.1, it would make some sense that it's a driver thing. Not necessarily a bug, it could be that Arecca are doing something naughty and it working was a bug that was fixed in the driver for 11.1? Just a theory.

I'll try again here shortly.

#15 Updated by Alexander Motin 9 months ago

  • Severity changed from New to Low Medium

#16 Updated by Alexander Motin 9 months ago

  • Subject changed from SAS/SATA drive enumeration issue in 11.0-u4 -> 11.1-(RELEASE|u5). to mpr(4) drive enumeration issue

#17 Updated by Alexander Motin 2 months ago

  • Status changed from Screened to Closed
  • Target version changed from Backlog to N/A
  • Reason for Closing set to Cannot Reproduce

I am closing this, since for last 7 months haven't seen any other reports like that.

#18 Updated by Dru Lavigne 2 months ago

  • File deleted (debug-freenas11-20180705012303_11.0-u4.tgz)

#19 Updated by Dru Lavigne 2 months ago

  • File deleted (debug-freenas11-20180705004504_11.1-u5.tgz)

#20 Updated by Dru Lavigne 2 months ago

  • File deleted (debug-freenas11-20180706031731_11.1-u5_post_hba_expander_updates_and_driver_debug_enable.tgz)

#21 Updated by Dru Lavigne 2 months ago

  • File deleted (mprutil_show_all_11.0-u4.txt)

#22 Updated by Dru Lavigne 2 months ago

  • File deleted (mprutil_show_all_11.1-u5.txt)

#23 Updated by Dru Lavigne 2 months ago

  • Private changed from Yes to No

Also available in: Atom PDF