Project

General

Profile

Bug #26508

Intel Optane 900p will not work in ESX passthrough

Added by Thomas Rottig about 2 years ago. Updated 8 months ago.

Status:
Closed: Third party to resolve
Priority:
Important
Assignee:
Alexander Motin
Category:
OS
Target version:
-
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Host: Xeon E3 v5 1270, 64 GB Ram
VM: 4 Cores, 32 GB Ram

ChangeLog Required:
No

Description

Hi,
just obtained a Intel 900p 280GB which I am trying to pass through to a FreeNas 11 U4 vm.

ESX is
vmware -vl
VMware ESXi 6.5.0 build-6765664
VMware ESXi 6.5.0 Update 1

Bootup fails with error indicated in attached screenshot.

Same drive in a fresh physical installation is working without error
Same VM without this drive is working without error
Same drive is fine in ESX

freenas_900p_error.PNG (122 KB) freenas_900p_error.PNG Thomas Rottig, 11/05/2017 02:08 PM
freenas_900p_error3.PNG (143 KB) freenas_900p_error3.PNG Thomas Rottig, 11/06/2017 12:12 PM
freenas_900p_error2.PNG (212 KB) freenas_900p_error2.PNG Thomas Rottig, 11/06/2017 12:12 PM
freenas_900p_error5.PNG (128 KB) freenas_900p_error5.PNG Thomas Rottig, 11/08/2017 01:19 PM
freenas_900p_error4.PNG (183 KB) freenas_900p_error4.PNG Thomas Rottig, 11/08/2017 01:19 PM
freenas_900p_error6.PNG (274 KB) freenas_900p_error6.PNG Thomas Rottig, 11/09/2017 01:01 PM
freenas_900p_error7.PNG (181 KB) freenas_900p_error7.PNG Thomas Rottig, 11/18/2017 09:41 AM
freenas_900p_error8.PNG (173 KB) freenas_900p_error8.PNG Intel nvme 1.3.2.4-1OEM.650.0.0.4598673 Thomas Rottig, 11/21/2017 01:10 PM
FreBSD12.png (87.8 KB) FreBSD12.png FreeBSD Sisyphe -, 07/06/2018 01:44 PM
optane900-freenas.png (80.6 KB) optane900-freenas.png Sisyphe -, 08/11/2018 03:01 PM
Screenshot 2019-04-02 at 22.23.58.png (94.4 KB) Screenshot 2019-04-02 at 22.23.58.png Cy Borg, 04/02/2019 02:29 PM
12919
12929
12930
12976
12977
12990
13087
13130
20623
24116
63516

Related issues

Has duplicate FreeNAS - Bug #54240: Kernel Panic in nvme_qpair_reset()Closed

History

#1 Avatar?id=14398&size=24x24 Updated by Kris Moore about 2 years ago

  • Assignee changed from Release Council to Alexander Motin
  • Priority changed from No priority to Nice to have
  • Target version set to 11.2-BETA1

#2 Updated by Alexander Motin about 2 years ago

  • Status changed from Unscreened to 15

Could you try it with FreeNAS 11.1-RC1? It got some updated to the NVMe driver, so it may be fixed already. If it won't help, please show more logs printed before that, and may be output of `bt` command typed after that.

#3 Updated by Thomas Rottig about 2 years ago

12929
12930

Alexander Motin wrote:

Could you try it with FreeNAS 11.1-RC1? It got some updated to the NVMe driver, so it may be fixed already. If it won't help, please show more logs printed before that, and may be output of `bt` command typed after that.

No difference, see additional info as requested attached.

#4 Updated by Alexander Motin about 2 years ago

  • Status changed from 15 to Investigation
  • Priority changed from Nice to have to Important

#5 Updated by Alexander Motin about 2 years ago

  • Status changed from Investigation to 15

Could you try to boot latest FreeBSD 12-CURRENT snapshot in that VM: https://download.freebsd.org/ftp/snapshots/ISO-IMAGES/12.0/FreeBSD-12.0-CURRENT-amd64-20171030-r325156-bootonly.iso.xz ? One of driver developers think there may be relevant fixes not merged to stable/11 branch yet.

#6 Updated by Thomas Rottig about 2 years ago

12976
12977

Still no luck - see pics.

#7 Updated by Alexander Motin about 2 years ago

  • Status changed from 15 to Investigation

#8 Updated by Alexander Motin about 2 years ago

  • Status changed from Investigation to 15

Thomas, could you again reproduce it on FreeBSD HEAD, type in `show threads` command and look through for nvme_ctrlr_identify() or nvme_ctrlr_start() function names in call stacks in the list?

#9 Updated by Thomas Rottig about 2 years ago

12990

Hi,
no nvme related calls visible, mostly fork_trampoline and sched_switch.
3 lines of cpustop_handle were the most exiting thing to be found - see screenshot.
Ran thrice to see if I had missed it :/

#10 Updated by Thomas Rottig about 2 years ago

Any updates or next steps? Happy to run some further tests if you let me know what you need...

#11 Updated by Alexander Motin about 2 years ago

Developer from Intel proposed to try such a patch:

diff --git a/sys/dev/nvme/nvme_ctrlr.c b/sys/dev/nvme/nvme_ctrlr.c
index b036eb6..47d9488 100644
--- a/sys/dev/nvme/nvme_ctrlr.c
+++ b/sys/dev/nvme/nvme_ctrlr.c
@@ -348,6 +348,7 @@ nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr)
     DELAY(100*1000);

     nvme_ctrlr_disable(ctrlr);
+    DELAY(5000);
     return (nvme_ctrlr_enable(ctrlr));
 }


, but I see no clear logic in it, it looks more like a workaround.

Will you be able to build FreeBSD kernel with that patch or try patched kernel if I send it to you?

#12 Updated by Thomas Rottig about 2 years ago

Haben's dabbled in Kernel building in a long while, so ready made might be easier.
Ideally with some instructions to use, but I can
Google those if need be...

#13 Updated by Alexander Motin about 2 years ago

Here is FreeBSD HEAD kernel with the nvme patch: https://www.dropbox.com/s/829p50uu92f67ap/kernel.tgz?dl=0

Please try to drop it into /boot/ of installed FreeBSD VM and try add the NVMe device there.

#14 Updated by Thomas Rottig about 2 years ago

13087

Hope I have done it correctly, but kernel and most files in the boot/kernel directory show updated timestamp so I assume so.

Result: negative, more or less same error.

Have tried Windows VM - no problem.
Also tried OmniOS 151024ce VM - no go either, so maybe I need to address to Intel instead...

#15 Updated by Dru Lavigne about 2 years ago

  • Status changed from 15 to Unscreened

#16 Updated by Thomas Rottig about 2 years ago

Hi Dru,
whats the effect of moving this to unscreened?
Thanks

#17 Updated by Dru Lavigne about 2 years ago

Thomas: it lets the developer know that the requested feedback was received.

#18 Updated by Thomas Rottig about 2 years ago

Ah good to know, thanks:)

#19 Updated by Sisyphe - about 2 years ago

Have you tried after installing Intel NVMe driver in ESXi?

v1.3.2.4 was published the 1st of November:
https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI65-INTEL-INTEL-NVME-1324&productId=614

#20 Updated by Thomas Rottig about 2 years ago

13130

No I had not, the card had been identified as Optane properly (was on driver from May).
Have tried now - no change.

esxcli system version get
Product: VMware ESXi
Version: 6.5.0
Build: Releasebuild-6765664
Update: 1
Patch: 29

esxcli software vib list |grep nvme
intel-nvme 1.3.2.4-1OEM.650.0.0.4598673 INT VMwareCertified 2017-11-21
nvme 1.2.0.32-4vmw.650.1.26.5969303 VMW VMwareCertified 2017-07-29
vmware-esx-esxcli-nvme-plugin 1.2.0.10-1.26.5969303 VMware VMwareCertified 2017-07-29

#21 Updated by Alexander Motin about 2 years ago

  • Status changed from Unscreened to Screened

I've looked through the code and can not guess how can that happen. I need to see that myself and be able to debug it directly. Can you provide me some remote access to that VM for experiments? Otherwise it may take time for me to reproduce this issue. Or you may try to contact Intel or FreeBSD developers.

#22 Updated by Thomas Rottig about 2 years ago

Hi,
it should be possible to give you access to a (freenas) vm with the optane passed through. What do you need ? Gui or ssh?

Regards

#23 Updated by Alexander Motin about 2 years ago

Considering we are talking about kernel panics, I'd prefer to have VM console for debugging.

#24 Updated by Thomas Rottig about 2 years ago

You mean access to the ESX? Sorry not entirely clear what you need

#25 Updated by Alexander Motin about 2 years ago

Thomas Rottig wrote:

You mean access to the ESX? Sorry not entirely clear what you need

Yes, I was thinking about ESX VM console.

#26 Updated by Thomas Rottig about 2 years ago

Ok, web interface of the ESX then. Will need some time to set it up. Can your provide me an email for access details please?

#27 Updated by Alexander Motin about 2 years ago

#28 Updated by Sisyphe - almost 2 years ago

Were you able to make progress on this issue? Thank you

#29 Updated by Thomas Rottig almost 2 years ago

Still working on supplying the test environment, sorry.

#30 Updated by Sisyphe - almost 2 years ago

Thank you for the update.

#31 Updated by Thomas Rottig almost 2 years ago

My preparations are done, sent the details to Alexander

#32 Updated by Sisyphe - almost 2 years ago

I've updated to Freenas 11.1 stable and I'm still seeing this issue.

Alexander, did you have the chance to look into this? Thanks.

#33 Updated by Alexander Motin almost 2 years ago

I'm sorry, not yet. I've been very busy recently, but I remember about this and hope to look on it nearest days,

#34 Avatar?id=14398&size=24x24 Updated by Kris Moore almost 2 years ago

  • Target version changed from 11.2-BETA1 to 11.3

#35 Updated by Alexander Motin almost 2 years ago

  • Status changed from Screened to Closed: Third party to resolve
  • Target version deleted (11.3)

Hello Thomas,

Thank you for the provided access. After some experiments I found that kernel panic in this case is caused by the FreeBSD NVMe driver bug, which can not correctly handle NVMe command timeouts during early controller initialization phase. In particular, nvme_ctrlr_cmd_set_num_queues(), which is a real problem here. As I see, that command is the first one sent to the controller, so it may be not a problem related to specific command, but to command submission/handling in general. I've tried to disable MSI-X support in case it is interrupt delivery problem, but it didn't help. Unfortunately the problem seems to be nontrivial, and I have no time to dive there deeper right now.

Have you tried to repeat the experiment on real hardware, not inside a VM? It would give us idea whether it is a problem of the driver in general, or some issue related to VMware's NVMe pass-through.

In either case you should probably contact FreeBSD developers more actively working on this driver. It was actually written by people from Intel, so they may be familiar with both the driver and the hardware.

#36 Updated by Thomas Rottig almost 2 years ago

Hi Alexander,

thanks a lot for your effort.

Very early in the analysis I have run the same setup baremetal and it worked fine.

Given that a similar issue is present on other *nix system (Solaris & variants) when used in ESX it seems that it is rather related to the virtualization part than the actual OS part. Unfortunately that card is not on VMWare HCL and I also have no support contrac.
I think I will try to open a ticket with Intel, they are likely in the best position to work on this.

Thanks a lot,
happy holidays,
Thomas

#37 Updated by Thomas Rottig almost 2 years ago

Hi Alexander,

Opened a bug at Intel, not sure what will come out of it.
Can you provide me with the name/contact details of the aforementioned Intel driver developer?
The support team would like to reach out to him.

Thanks,regards Thomas

#38 Updated by Alexander Motin almost 2 years ago

I've contacted Jim Harris <> and Warner Losh <>.

#39 Updated by Thomas Rottig almost 2 years ago

Intel told me they don't support *nix for this drive so will not support in this issue.

Not sure what else can be done.

#40 Updated by Alexander Motin almost 2 years ago

Have you written to some FreeBSD mailing list or the mentioned people?

#41 Updated by Thomas Rottig almost 2 years ago

No not yet. The issue is not limited to FreeBSD, it also hits OpenSolaris and variants. But might be worth a shot nevertheless, I'll give it a try.

#42 Updated by Sisyphe - almost 2 years ago

Hi Thomas,

Were you able to raise this issue to OpenSolaris or FreeBSD developers?

Thank you

#43 Updated by Thomas Rottig almost 2 years ago

Yes, Warner replied to my emails.

His latest suggestion was
"There is a small chance https://reviews.freebsd.org/D14053 fixes this. "
But I have not had the time to investigate that - feel free to chime in if you can:)

#44 Updated by Sisyphe - almost 2 years ago

I can run some tests. I would however need some help to understand how to compile/get the nvme updated driver and install it on my system.

Thanks!

#45 Updated by Alexander Motin almost 2 years ago

Sisyphe - wrote:

I can run some tests. I would however need some help to understand how to compile/get the nvme updated driver and install it on my system.

FreeNAS has NVMe driver statically linked into the kernel. For that reason you'd have to build whole new kernel with modules at least. I am personally doing that on FreeBSD system according to regular FreeBSD guides, just taking FreeNAS kernel sources from (https://github.com/freenas/os/tree/freenas/11.1-stable) and configuration from repo (https://github.com/freenas/build/blob/master/build/profiles/freenas/kernel/FREENAS.amd64). Building whole FreeNAS image is much more time-consuming, though that process should also be documented somewhere. By default FreeNAS has no means for self-building, but there is special SDK train, which includes some parts as compilers, etc.

#46 Updated by Wessel van Norel almost 2 years ago

Thomas Rottig wrote:

Yes, Warner replied to my emails.

His latest suggestion was
"There is a small chance https://reviews.freebsd.org/D14053 fixes this. "
But I have not had the time to investigate that - feel free to chime in if you can:)

Unfortunately I didn't find this issue before getting myself an Intel Optane 900P to passthrough via ESXi... Guess we should open an issue @FreeBSD since it's a kernel issue and not a FreeNAS issue. I've tested the latest nightly ISO FreeBSD-12.0-CURRENT-amd64-20180215-r329338-disc1.iso, and if the revision number in this nightly build is indeed the FreeBSD source revision number, then D14053 unfortunately doesn't fix things. When I only passthrough my Samsung 960 PRO, the ISO boots without errors. When I add the Optane 900P, it fails with the page fault.

#47 Updated by Wessel van Norel almost 2 years ago

I've created a bug @FreeBSD about this issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226086. Since I guess that's the proper channel for it.

#48 Updated by Dru Lavigne almost 2 years ago

Thanks!

#49 Updated by Wessel van Norel almost 2 years ago

I've good and bad news.
The good news: I just updated the ESXi driver to the latest version: 1.3.2.8 (https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESX65-INTEL-INTEL-NVME-1328&productId=614) and with this version installed the FreeBSD 12 iso I tried before works. I also tried FreeBSD-12.0-CURRENT-amd64-20180125-r328383-disc1.iso, which AFAIK should not contain the https://reviews.freebsd.org/D14053 fix, and also that one works.

The bad news: my FreeNAS 11.1 ISO still does not work. So I guess the remark from 4 months ago: "Could you try to boot latest FreeBSD 12-CURRENT snapshot in that VM: https://download.freebsd.org/ftp/snapshots/ISO-IMAGES/12.0/FreeBSD-12.0-CURRENT-amd64-20171030-r325156-bootonly.iso.xz ? One of driver developers think there may be relevant fixes not merged to stable/11 branch yet." is correct. Unfortunately (and understandable) the older FreeBSD 12 Snapshot builds are not available, so I can't find the exact version that contains the fix.

So I guess I've to install FreeNAS 11.1 without the device passthrough, update the kernel and then add the device in order to see if that fixes the problem for FreeNAS.

#50 Updated by Alexander Motin almost 2 years ago

I've merged many NVMe related commits into FreeBSD stable/11 and FreeNAS 11.1-U2, so you may try the last to see whether it is enough. IIRC it still does not include https://reviews.freebsd.org/D14053 , since it was very new at that point, but I'll try to merge it soon also.

#51 Updated by Wessel van Norel almost 2 years ago

Alexander Motin wrote:

I've merged many NVMe related commits into FreeBSD stable/11 and FreeNAS 11.1-U2, so you may try the last to see whether it is enough. IIRC it still does not include https://reviews.freebsd.org/D14053 , since it was very new at that point, but I'll try to merge it soon also.

Thanks for the quick reply. Unfortunately, FreeNAS 11.1-U2 doesn't work either. Working ISO's for me are at least:

FreeBSD-12.0-CURRENT-amd64-20180125-r328383-disc1.iso
FreeBSD-12.0-CURRENT-amd64-20180215-r329338-disc1.iso

The FreeBSD-12.0-CURRENT-amd64-20180125-r328383-disc1.iso is the oldest FreeBSD 12 iso that I could download. Should I try a FreeBSD 11 ISO as well, or is testing FreeNAS 11.1-U2 good enough for that?

See my remark below. Seems that FreeNAS 11.1-U2 (and even FreeNAS 11.1-U1) work depending on the virtual hardware configuration. I'll continue to investigate this when I've time.

#52 Updated by Wessel van Norel almost 2 years ago

I'm afraid the issue will be tougher to debug/resolve. After I said that I couldn't find older ISO images, I did find older VM-IMAGES. And here is where I'm getting worried about this issue. I downloaded the oldest VMDK that was available. Because of the format I'm not able to use the VMDK from the ESXi machine right away, I need to use VMWare Fusion to create a VM and then move it to the ESXi machine. First boot it without the passthrough: ok it works. Then boot it with the passthrough: no it didn't work. Ok, so perhaps the fix is not in that version. Next try to get one working version. So I download the VMDK that is of the same revision as the ISO that is working. But it didn't work.

Then I retry the VM with the ISO. That VM still works. Then I try the ISO on the VM that I created from the VMDK: it doesn't work. So I check the VMWare hardware configuration between the two and step by step make them the same. End result is that I've 2 vm's that look exactly the same from the ESXi web ui, except for the hard disk (since the FreeBSD build hard disk is 21GB in size and the one I made is 8GB in size) and still the new VM doesn't want to boot with the passed through Optane 900p.

So I had no clue what could be wrong. I've diffed the .vmx files and as far as I was able to determine nothing obviously is wrong there. The diff follows below, < is the not working VM, > is the working vm. The only things that I could imagine doing something were the virtualHW.productCompatibility = "hosted" and acpi.smbiosVersion2.7 = "FALSE", so I removed those settings, and tried again, without success.

2c2,3
< acpi.smbiosVersion2.7 = "FALSE" 
---
> RemoteDisplay.maxConnections = "-1" 
> bios.bootRetry.delay = "10" 
5,6c6,7
< displayName = "FreeBSDK-disks-test" 
< ehci.pciSlotNumber = "36" 
---
> displayName = "FreeBSD BugTesting" 
> ehci.pciSlotNumber = "34" 
9c10
< ethernet0.generatedAddress = "00:0c:29:23:26:30" 
---
> ethernet0.generatedAddress = "00:0c:29:62:5d:a9" 
12c13
< ethernet0.pciSlotNumber = "32" 
---
> ethernet0.pciSlotNumber = "33" 
14a16
> ethernet0.wakeOnPcktRcv = "FALSE" 
19c21
< migrate.hostLog = "./FreeBSDK-disks-test-d3c8fced.hlog" 
---
> migrate.hostLog = "./FreeBSD BugTesting-87d49086.hlog" 
24c26
< nvram = "FreeBSDK-disks-test.nvram" 
---
> nvram = "FreeBSD BugTesting.nvram" 
62c64
< sata0.pciSlotNumber = "34" 
---
> sata0.pciSlotNumber = "36" 
78c80
< sched.swap.derivedName = "/vmfs/volumes/5a8c8dc2-66208b14-2858-ac1f6b17235e/FreeBSDK-disks-test/FreeBSDK-disks-test-d3c8fced.vswp" 
---
> sched.swap.derivedName = "/vmfs/volumes/5a8c8dc2-66208b14-2858-ac1f6b17235e/FreeBSD BugTesting/FreeBSD BugTesting-87d49086.vswp" 
83c85
< scsi0:0.fileName = "FreeBSDK-disks-test.vmdk" 
---
> scsi0:0.fileName = "FreeBSD BugTesting.vmdk" 
93,95c95,97
< tools.syncTime = "TRUE" 
< tools.upgrade.policy = "upgradeAtPowerCycle" 
< usb.pciSlotNumber = "35" 
---
> tools.syncTime = "FALSE" 
> tools.upgrade.policy = "manual" 
> usb.pciSlotNumber = "32" 
106,109c108,110
< uuid.bios = "56 4d cc 7b c6 70 63 53-df e2 94 e6 06 23 26 30" 
< uuid.location = "56 4d cc 7b c6 70 63 53-df e2 94 e6 06 23 26 30" 
< vc.uuid = "52 14 4b cd 05 81 e4 a9-82 ad 8d 77 42 41 11 68" 
< virtualHW.productCompatibility = "hosted" 
---
> uuid.bios = "56 4d 68 ef dd 9b 1e 0d-40 7d 8f 7f e9 62 5d a9" 
> uuid.location = "56 4d 68 ef dd 9b 1e 0d-40 7d 8f 7f e9 62 5d a9" 
> vc.uuid = "52 4b 0c 79 f2 93 e9 66-2c 17 27 0b c6 67 f7 69" 
111,112c112,113
< vmci0.id = "102966832" 
< vmci0.pciSlotNumber = "33" 
---
> vmci0.id = "-379429463" 
> vmci0.pciSlotNumber = "35" 

The not working system hangs on the phase where the working system initialises the NVMe devices, but no output other then then page fault. I've tried creating another VM on the ESXi server and boot that with the ISO: that works. So it's not a fluke accidental working VM. But I'm confused by the VM that doesn't work. Should I reopen the FreeBSD 12 bug?

-- Edit

Seems I've figured out what difference is causing the issue: the order of the pciSlots. If I change the not working .vmx to contain the same pciSlotNumbers as the working .vmx the system boots from the .iso. Unfortunately I've updated all devices at the same time, so I guess I should figure out what device needs to be in what order to break stuff again. And the other unfortunate part is that even though it now boots from the .iso, booting from the disk image itself still fails, with the note that this time it does not fail with a kernel panic, but the system just hangs (I waited a bit more then 5 minutes) on the NVMe initialisation.

I've also retested the FreeNAS-11.1 ISO's on the VM where the FreeBSD-12 ISO's worked, and the FreeNAS-11.1 ISO's also work. So my previous tests where not as good as I thought. VM with same hardware selected does not mean that it's actually the same. My apologies for that :(

So, it's possible that the Intel Optane 900P works, if you have the latests ESXi drivers installed & have your PCI devices in "the correct order". Now I've "only" to determine what is the correct order. And the other question will be how stable is this, since I find it strange that the PCI order breaks things.

#53 Updated by Sisyphe - over 1 year ago

Hi,

Do you have an update on this issue? I did several tests but was not able to figure out a working order for PCI slots in the configuration of the VM.

Should the FreeBSD bug be reopened?

#54 Updated by Wessel van Norel over 1 year ago

Unfortunately no update yet :( Have to make some time to get the system up and running, waste of resources it idling for about a month now... And not sure if to reopen it @ FreeBSD, since I do not understand how it can work (in my case) depending on the PCI slot ordering... It's really weird.

#55 Updated by Ignacio Rocha over 1 year ago

Wessel van Norel wrote:

Unfortunately no update yet :( Have to make some time to get the system up and running, waste of resources it idling for about a month now... And not sure if to reopen it @ FreeBSD, since I do not understand how it can work (in my case) depending on the PCI slot ordering... It's really weird.

I'm also having the same problem with my Intel Optane 900p when I pass it through esxi. How did you managed to change the PCI slot ordering in the esx vm?
The problem is still active in the 11.1-U4.
I have also, per your recommendations, updated the Intel NVMe driver in esx.

If you need some screenshots or whatever, let me know

#56 Updated by Jan Eagleman over 1 year ago

I tried to switch from "real hardware" to virtual today, and me too hit this bug. First off ESXi didnt detect the device correctly when using passthrough, it didnt say its real name like this one for example: I210 Gigabit Network Connection. I then just forwarded the device since I knew it was the NVMe device. Then I tried to boot in a VM with and without the NVMe device and only with the Intel 900P attached it crashed FreeNAS.

Had the Intel 900P working for half a year on "real hardware". Only when using ESXi passthrough it crashes FreeNAS.

#57 Updated by Sisyphe - over 1 year ago

  • Severity set to New

I've updated to ESXI 6.7 and the issue persists.

@Wessel, should the FreeBSD bug be reopened? Should we have a bug opened at VMWare? Thanks.

#58 Updated by Wessel van Norel over 1 year ago

Sisyphe - wrote:

I've updated to ESXI 6.7 and the issue persists.

@Wessel, should the FreeBSD bug be reopened? Should we have a bug opened at VMWare? Thanks.

I'm not sure. I've been swamped with work but I've tried to get in touch with Paul Braren about this issue. I wrote a comment on: https://tinkertry.com/intel-optane-900p-should-be-great-for-home-lab-enthusiasts but it became flagged as spam (it seems it finally came through, but no response yet :( ). He didn't respond on twitter either. I think the strangeness in the way I was able to solve it means it's more related to VMWare than to FreeBSD. And since his post was quite positive about the 900p at first perhaps he is able to put me in touch with the correct people @ vmware to get into the internals of what is going wrong.

#59 Updated by Sisyphe - over 1 year ago

Installed newly released Intel NVMe driver v1.4.0.1016, same issue.

@Norel, did you get a feedback from Paul Braren?

I would reconsider re-opening the FreeBSD ticket. Even if VMWare can support debugging the issue, I suppose the fix will need to be implemented in FreeBSD kernel, as it is also impacting other OpenBSD variants.

#60 Updated by Wessel van Norel over 1 year ago

Sisyphe - wrote:

Installed newly released Intel NVMe driver v1.4.0.1016, same issue.

@Norel, did you get a feedback from Paul Braren?

@Sisyphe: unfortunately no I did not. And since the card is officially not supported by VMWare I'm afraid that we will not get any help from them here. I hoped that the strangeness of the workaround I found would be enough trigger for them to be willing to look at it. Perhaps someone else can try to ping him on the question, someone with a more public profile.

I would reconsider re-opening the FreeBSD ticket. Even if VMWare can support debugging the issue, I suppose the fix will need to be implemented in FreeBSD kernel, as it is also impacting other OpenBSD variants.

My main question is: who's to blame for the issue. Is it FreeBSD or is it VMWare. Since it can be "fixed" in my situation by just a different PCI slot ordering in the VM image configuration, perhaps the problem is within the VMWare kernel and not in the FreeBSD kernel.

What I should have tried is a different OS and see if that works. Unfortunately I need the machine for other stuff right now and I'm not able to properly test things because of that.

#61 Updated by Sisyphe - over 1 year ago

20623

Retested with FreeBSD 12 snapshot (FreeBSD-12.0-CURRENT-amd64-20180628-r335760-bootonly)

FreeBSD

System is not crashing and I get "nvme0: Missing interrupt" errors, so it appears an improvement mas made to the driver. I would suggest to have the FreeBSD team to have a look again...

#62 Updated by Sisyphe - over 1 year ago

24116

I found a simple fix for this issue by adding the Optane 900P device ID to passthru.map :)

- ssh to ESXi
- edit /etc/vmware/passthru.map
- add following lines at the end of the file:
# Intel Optane 900P
8086 2700 d3d0 false

- restart hypervisor

I can now pass through the 900P to Freenas 11.1-U5 without issue:

Enjoy!

#63 Updated by James McCoy over 1 year ago

Hi all. I was also about to buy and attempt this a few weeks ago before finding this thread at the last minute.
Is this 'fix' stable, has anyone else managed to verify it on their system?
Thanks.

#64 Updated by Steve Higton about 1 year ago

James McCoy wrote:

Hi all. I was also about to buy and attempt this a few weeks ago before finding this thread at the last minute.
Is this 'fix' stable, has anyone else managed to verify it on their system?
Thanks.

I have just migrated a two way mirrored pool from two Samsung 950Pro drives to two Optane 900p drives. All seems fine so far...

The 950Pros were in passed through to a Freenas 11.1-U6 VM with no issues, running on ESXi 6.0. I split the pool, powered down, swapped a 950Pro for a 900p, powered the ESXI host backup and the Freenas VM crashed on boot in a very similar manner to the crash in the OP. I then tried adding the lines to passthru.map from post #62 but this didn't help, same crash occurred. I then upgraded ESXi to 6.5U1 build 5969303 and the problem disappeared. I passed through the Optane 900p, added to the existing 950Pro, resilvered in a few minutes and all was fine. Another zpool split, power down, swap 950Pro for 900p, etc and now have a two way mirror of two 900p drives.

I only completed the migration about twenty minutes ago but fingers crossed all will remain OK.

#65 Updated by James McCoy about 1 year ago

Steve Higton wrote:

James McCoy wrote:

Hi all. I was also about to buy and attempt this a few weeks ago before finding this thread at the last minute.
Is this 'fix' stable, has anyone else managed to verify it on their system?
Thanks.

I have just migrated a two way mirrored pool from two Samsung 950Pro drives to two Optane 900p drives. All seems fine so far...

The 950Pros were in passed through to a Freenas 11.1-U6 VM with no issues, running on ESXi 6.0. I split the pool, powered down, swapped a 950Pro for a 900p, powered the ESXI host backup and the Freenas VM crashed on boot in a very similar manner to the crash in the OP. I then tried adding the lines to passthru.map from post #62 but this didn't help, same crash occurred. I then upgraded ESXi to 6.5U1 build 5969303 and the problem disappeared. I passed through the Optane 900p, added to the existing 950Pro, resilvered in a few minutes and all was fine. Another zpool split, power down, swap 950Pro for 900p, etc and now have a two way mirror of two 900p drives.

I only completed the migration about twenty minutes ago but fingers crossed all will remain OK.

I took the plunge and can confirm this has been running perfectly for over a month now on ESXi 6.7 and FreeNAS 11.1-U6.

#66 Updated by Alexander Motin about 1 year ago

  • Has duplicate Bug #54240: Kernel Panic in nvme_qpair_reset() added

#67 Updated by Richard May 8 months ago

This issue is still present in FreeNAS 11.2-U3 and VMware ESXi, 6.7.0, 13004448 with the Optane 800P Series. Editing passthru.map didn't help and the Intel NVMe driver update for ESXi has nothing to do with this particular Optane model (the PnP ID (8086/2522) is not listed in the VIB's XML files).

#68 Updated by Cy Borg 8 months ago

63516

Richard May wrote:

This issue is still present in FreeNAS 11.2-U3 and VMware ESXi, 6.7.0, 13004448 with the Optane 800P Series. Editing passthru.map didn't help and the Intel NVMe driver update for ESXi has nothing to do with this particular Optane model (the PnP ID (8086/2522) is not listed in the VIB's XML files).

Login to the shell console of your ESXi and edit the .vmx config file for your FreeNAS. Assuming that Optane is the 1st passthru device that you have added [0], add

pciPassthru0.msiEnabled = "FALSE"

That solves the problem with physical IRQ sharing that seems to be the cause here.

Enjoy,
CyBorg

Also available in: Atom PDF