Project

General

Profile

Bug #66130

11.2+ fatal trap 12 (and 9) a few minutes after boot, 11.1 no issues for weeks

Added by James Gaul 7 months ago. Updated 4 months ago.

Status:
Closed
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Severity:
New
Reason for Closing:
Cannot Reproduce
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

hi,
on just 11.2+ (also 11.3 betas) I keep getting either freezes / reboots or sometimes fatal trap 12 or 9, always a few minutes after successful boot up (or sometimes, about 15seconds before FN boot up would normally complete). This was a completely fresh system/disks, with no existing zfs volumes (but does have several disks attached).

Initially i did try removing all devices/pcie cards, and the result was the same.

Ive tried a few fresh installs on different types of boot devices, and also tried 11.3 beta dec 22 2018 (same issue).

At most, FN 11.2 / 11.3 stays up for ~5minutes, but usually fatal traps (or just reboots) right after FN boots (or a few seconds before boot completes).

However on 11.1 stable ive been stressing FN , hard, for over 17 days non-stop with not one issue of any kind.

My hardware is:

X9DR3-LN4F+ (latest bios 3.3) , 128gb ECC ram (4x32gb sticks), 2x e5-2620v2 cpus.
LSI SAS 9207-8i (no expander in use yet , just direct attach to 8x bays)
+ using 6x of the onboad intel c606 based sata ports
Most disks are 3tb and 4tb HGST 7k4 's (some HUS - sas some HUA- sata)
a 4u 24bay ( direct attached backplane, 846-TQ)
CHELSIO T520-CR (2x 10g sfp+) (+ the 4x onboard 1gbE eth ports)

(all memory has been tested, both individually and together, with memtest86 pro v8.0 for over 8 days on this MB as well as tested on different motherboards. Even running 32 or 64gb of different memory types, the issue with 11.2 is the same, all memory is on the supermicro QVL for my MB). I have also tested changing Power Supplies, and CPUs (just 1x e5-2640v1), no change to the issue.

in the "FN BUG REPORT.zip" ive attached, i have the output from several logs/commands (lspci, dmidecode,sas2flash ect), these are ofcourse all run under 11.1

Please let me know what i can do to assist or what i need to provide, i can provide remote access to system- this is still a test/play system for next several weeks as im learning FN still (im loving FN btw!).

I indirectly have a bit more detail in this thread-
https://forums.freenas.org/index.php?threads/sm-x9-4u-new-build-4u-disk-shelf-2u-disk-shelf-48-disks-s.71624/#post-497926

thank you very much!

Ca35345435pture (1).JPG (52 KB) Ca35345435pture (1).JPG James Gaul, 12/27/2018 01:00 PM
11.3 fault 12.JPG (55.6 KB) 11.3 fault 12.JPG James Gaul, 12/27/2018 01:00 PM
11.3 fault 12 (1).JPG (55.6 KB) 11.3 fault 12 (1).JPG James Gaul, 12/27/2018 01:00 PM
IMG_4648 (1).jpg (155 KB) IMG_4648 (1).jpg James Gaul, 12/27/2018 01:00 PM
panic on hdd boot up.JPG (79.5 KB) panic on hdd boot up.JPG James Gaul, 01/04/2019 11:17 PM
bt output.JPG (125 KB) bt output.JPG James Gaul, 01/04/2019 11:17 PM
11.2 panic w 1x v2 CPU when importing a 1 disk pool.JPG (77.7 KB) 11.2 panic w 1x v2 CPU when importing a 1 disk pool.JPG as soon as i imported a 1x disk pool, with 1x v2 CPU , panic James Gaul, 01/05/2019 03:48 PM
changelogCapture.JPG (99.8 KB) changelogCapture.JPG James Gaul, 01/07/2019 02:19 PM
Capture.JPG (70.6 KB) Capture.JPG James Gaul, 01/09/2019 05:12 PM
2nd crash Capture.JPG (65.8 KB) 2nd crash Capture.JPG James Gaul, 01/09/2019 05:17 PM
46474
46483
46492
46501
47602
47611
47665
47901
48205
48223

History

#1 Updated by James Gaul 7 months ago

  • File camcontrol.txt added
  • File dmesg.txt added
  • File dmidecode.txt added
  • File lspci.txt added
  • File messages.txt added
  • File sas2flash.txt added

i realized after you all my prefer direct files vs a zip, here are the contents of the "FN BUG REPORT.zip" file i attached above.

ALSO ive added the output for freenas-debug -A and -h respectively, all under 11.1 (not possible to get under 11.2)

thanks

#2 Updated by James Gaul 7 months ago

  • File fnDEBUG_all.txt added
  • File fnDEBUG_HWonly.txt added

add files

#3 Updated by William Grzybowski 7 months ago

  • Assignee changed from Release Council to Alexander Motin

#4 Updated by James Gaul 6 months ago

  • File debug-freenas-20190104204447.tgz added

not sure if this would help (or is duplicate to the other file ive posted) but attached is a system->adv->save debug .tgz from this exact system running FN 11.1. (i ofcourse cant extract this for the problem 11.2 installs).

the only change (that i have not retested w 11.2) is i have added a 9207-8e to the system recently.
tks

#5 Updated by James Gaul 6 months ago

i tried 11.2 again today (new iso download), this time in BIOS i disabled both my onboard SATA ports and also (in bios) disabled the Intel SCU (an 8x port sas controller, via 2x on MB 8087 ports, is part of the c606 chipset). So the only disk controllers for this test were the 2x 9207 PCIe cards (a -8i and a -8e).

I installed to a USB stick (from a usb), same result. towards the end of bootup FT 12 panic (supervior read data page not present).

I made a video (vkvm record) of the boot up, incase there is any relevant info from the boot up data that scrolls. This was the first boot , right after successful install.

https://youtu.be/A9t07g4gKn0

(is an unlisted youtube video, the url above is required to access it). thank you!

#6 Updated by James Gaul 6 months ago

47602
47611

i tried a 2nd (for today) fresh install of 11.2 this time onto a sata HDD (instead of a usb), the result was the exact same panic (during boot up post install). this time i did get a db> prompt, so i was able to run backtrace. not sure if this data helps, but here is that data.

a different video of it occuring (and i'll post a screen shot).

https://youtu.be/7mPGIipy9ME

tks

#7 Updated by Sean Fagan 6 months ago

This seems to be similar to some other panics -- initially reported against NFS, but it looks like ZFS. The NULL pointer and offset of 0x328 are the same.

#8 Updated by Sean Fagan 6 months ago

        mutex_enter(&zio->io_lock);
        while (zio->io_executor != NULL)
                cv_wait(&zio->io_cv, &zio->io_lock);
        mutex_exit(&zio->io_lock);

We can assume zio is not NULL (or we'd get a panic at the mutex_enter). I think this also means that

zio->io_lock
is ok, or we'd get a panic inside of mutex_enter. So that makes the
zio->io_cv
the likely source of trouble.

#9 Updated by James Gaul 6 months ago

47665

great, thanks for the info/reply.

its very likely to be ZFS related on this setup (in my humble opinion)

some new info- if i remove one of the 2x CPUs from this problem system (up to now it has been running 2x E5-2620v2 cpus), + i am able to boot up 11.2+ , but as soon as i try to do anything with disks/pools (ie try to import a test 1 disk pool), it panic's (see image attached)
Also, with no disks/pools running/connected, when i went to shut 11.2down (after about 10 minutes of just clicking around the GUI to see if it would panic while running wo disks), it panic'd during the shutdown process (fatal trap 9, but i didnt get a chance to snap a pict of that one).

I do have a 2nd of this exact MB, a 2nd X9DR3-LN4F+ , entirely separate build/HW, and on that setup i have a single v1 CPU (a single E5-2620v1 vs v2 on problem setup) and that system, ive had 11.2 running for about 12 hours, without ANY ISSUES, and with random stresstest disk IO to a 3x disk rdz1 pool for about 11 hours. no problems or panics. So maybe its connected to XEON v2 CPUs?

(above is new info i have only discovered in past 12-24h)

(i have a new fourm post with a bit more info and some replies:
https://forums.freenas.org/index.php?threads/anyone-running-11-2-on-x9dr3-ln4f-board-or-similar-x9-supermicro.72608/)

thanks

#10 Updated by James Gaul 6 months ago

So i have narrowed it down to EXACTLY one thing:

Using a v2 cpu will cause the 11.2 crash.

(i only a few e5-2620v2 - so i havent tested any other v2 cpu models, like a e5-2640v2 for example)

I currently have 2x e5-2620 v1 cpus in the same system (with all hardware added back) that was panicing with 11.2 - and it has been running fine for over 3 hours with constant random dd load to 2x pools.

If i use 1x e5-2620 v2 cpu in the same system, it will crash as soon as i do any pool disk IO (same goes for my other 2nd MB)

If i use 2x e5-2620 v2 cpu in the same system, it will crash as soon during the boot up (towards the end, as i outlined in posts above)

so the issue follows the v2 CPUs around (and is isolated to them).
thanks

#11 Updated by Alexander Motin 6 months ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Need additional information from Author

James, all this looks like some hardware issue, especially the part about v1 vs v2 CPUs swapping. Both of those CPUs are old enough to not expect surprises. I see you have some surprisingly recent BIOS flashed on your motherboard. It makes me think it could include some Spectre/Meltdown workaround firmware, that may be activated by 11.2, but not 11.1. But IIRC there were microcode revisions that Intel recalled after announcement, so I am not sure whether the version you have is trustworthy. If you have access to different BIOS versions (I tried to check Supermicro site, but don't see any at all), I'd try them. Alternatively FreeNAS includes own microcode images for number of CPUs, loading of which may be activated by setting microcode_update_enable="YES" rc.conf variable (that require booting first), that may or may not help.

Alternatively, if we assume it is really a software issue, I see that many of your panics end up with writing of the kernel text dumps. If you are able to boot 11.2 after that, saving and attaching here debug data after that should include them for us to inspect.

#12 Updated by James Gaul 6 months ago

47901

im more than happy to provide the kdump / ktrace.out file, (however its only panic'd 2 times of many, where i got that db> prompt, most times its just reboots right after showing the panic)

problem is, i dont know where this ktrace.out file (or kdump) file is saved. I assumed that it was not possible to get get it on FN as its a ram disk/non persistent type setup, so after reboot its gone.

I have searched for days on this: (location of freebsd/FN kdump file) and cant find any info, at all- so can you please tell me where this file would be located ? or how to get it so i can send it. (tks)

root@freenas:/tmp # find / | grep ktrace.out
(nothing, on the same 11.2 boot usb that has crashed b4, but is now running fine on my v1 cpus)

root@freenas:/tmp # find / | grep kdump
(nothing relevant)

I do have the older BIOS files for this MB, and will also try those and update (perhaps that will fix as there was a microcode update and something related to spectre in the current, bios 3.3 rel notes- (attached img bc sm website is down)

thanks

#13 Updated by Sean Fagan 6 months ago

The crash log files will be in the save-debug file.

#14 Updated by James Gaul 6 months ago

48205
48223

so i went back to bios 3.2 (which is from mid 2015), and am still seeing the same panic (same general location/time) when using v2 CPUs. (note in the videos, during boot, it does load up my pools well before the crash)

I understand that it could be hardware (or hw compatibility), but the issue is isolated to just fn 11.2

i have specifically stress tested this system for nearly 2 months now (only with the v2 cpus), under many different OSs , including ubuntu 16 lts, and 18 lts, win 2012r2 , esxi 6.5u2 (with vms doing the stressing), and ofcourse FN 11.1 (as 11.2 was crashing, so i went with 11.1 for a few weeks until filing this bug report). my entire focus has been stresstesting this hardware as there is no rush to production as its for my own personal lab setup (so no deadline). I have (no lie) 50+ of pages of notes from my stresstesting results under the different OSs.

(when i say stress-testing, im referring to mostly memory, but also concurrent disk io, 10g network, and cpu, testing/stressing) also have tested 2x sets of PSUs

for 6/7 weeks of the past 2 months- in all this testing i have not seen anything unusual nor panics/crashes/freezes, but for some very early-on memory issues (as i was at first testing 4x different sets of QVL memory i had, and 2x sticks were faulty). (this issue does follow my different sets of QVL memory, when using v2 cpus).

I normally would just run the v1 cpus and be done, but i spent a decent amount of extra money to get the v2s and related MB for it (and would like the lower power usage, and small amount of performance boost)

Im not complaining, just providing info. Here are 2x recordings of the BIOS 3.2 crashes (ran it twice), and Images of the panics (see private vid links below). I also then put the v1 cpus back in, booted the same 11.2 usb stick, and saved the debug log (so it should have the k/textdumps from the crash, i think).

*thanks for your help/time, pls lmk if there is anything further i can do or help with on this.

(ie should i maybe try recent some versions of freebsd + zfs pools to see if issue repeats? im not good with Freebsd, but will learn it, as im good with linux)*
tks

Video of 1st crash w 2x v2s and 11.2 (bios 3.2 now)
https://youtu.be/6scbqf9xol0

Video of 2nd run, 2nd crash w 2x v2s and 11.2 (bios 3.2 now)
https://youtu.be/TIIKnUDMyuE

#15 Updated by Alexander Motin 6 months ago

  • Status changed from Blocked to Unscreened
  • Reason for Blocked deleted (Need additional information from Author)

#16 Updated by Alexander Motin 6 months ago

  • Status changed from Unscreened to Closed
  • Target version changed from Backlog to N/A
  • Reason for Closing set to Cannot Reproduce

I've looked through the panics in the last debug, and they look different to me. The only common is that they look like some memory corruptions rather then trivial software bugs crashing immediately. So while it still may be a software issue, its correlation with CPUs generation does not make sense to me. We are still actively using both E5-2620v2 CPUs and Supermicro X9 motherboards, and we do not see problems like that. Another thought visited me just now, that problem may be not in CPUs themselves, but for example in PCIe buses, which also reside in CPUs. May be some hardware connected there does not like the change. You could try to remove Chelsio NIC, NVMe SSD and whatever else you have in the system that you can remove and still make it run enough to trigger the issue.

I am sorry, but I don't see what we can do here on software side. I am closing this ticket at least until some useful input found.

#17 Updated by Dru Lavigne 6 months ago

  • File deleted (post 2x v2 crashes _ debug-freenas-20190109191450.tgz)

#18 Updated by Dru Lavigne 6 months ago

  • File deleted (FN BUG REPORT.zip)

#19 Updated by Dru Lavigne 6 months ago

  • File deleted (camcontrol.txt)

#20 Updated by Dru Lavigne 6 months ago

  • File deleted (dmesg.txt)

#21 Updated by Dru Lavigne 6 months ago

  • File deleted (dmidecode.txt)

#22 Updated by Dru Lavigne 6 months ago

  • File deleted (lspci.txt)

#23 Updated by Dru Lavigne 6 months ago

  • File deleted (messages.txt)

#24 Updated by Dru Lavigne 6 months ago

  • File deleted (sas2flash.txt)

#25 Updated by Dru Lavigne 6 months ago

  • File deleted (fnDEBUG_HWonly.txt)

#26 Updated by Dru Lavigne 6 months ago

  • File deleted (fnDEBUG_all.txt)

#27 Updated by Dru Lavigne 6 months ago

  • File deleted (debug-freenas-20190104204447.tgz)

#28 Updated by James Gaul 6 months ago

understood, thanks for looking into this. I was coming back here to update with some more info.
1- I have tested quite a few times with every single device removed/pulled from the system (and also disabling all sata/scu + other devices in bios). Also on the 2nd x9dte setup i have, its sits on a test chasis setup (my test benhc), so it has never had any extra pcie or other devices attached in all my testing (both boards/setups reproduce the issue). So it seems to be this is more unique specifically to the SM X9DR3-LN4F+ rev 1.20A board and 11.2. (i have also tested 2x prior bioses, to 3.3)

2- I grabed almost the exact setup , but a bit different: a X9DTI-LN4F+ system. (the only difference i could find between these two boards, is that DTI uses a C602 chipset , vs a c606 chipset on DR3). 11.2 works on the x9DRi setup, and i stress-tested it with 10x disks, no problems. I also tried with the exact v2 cpus from the failing x9dr3 setup, no problems.

so 2x X9DR3-LN4F+ fail when all else (possible) is controlled for, 1x X9DRi-LN4F+ works.

(why did yall delete all the text files / data i uploaded? no big deal but id think that would be helpful in future maybe).

tks

#29 Updated by Alexander Motin 6 months ago

James Gaul wrote:

why did yall delete all the text files / data i uploaded? no big deal but id think that would be helpful in future maybe.

While they indeed could possibly be useful sometimes, we are trying to delete all potentially private information, so that we could make closed tickets visible for other people. Supposedly for your good. ;-)

#30 Updated by James Gaul 4 months ago

I have a pretty interesting / important update!
(i also have info in my post here: https://www.ixsystems.com/community/threads/k-panic-running-11-2-w-2620v2-x9dr3-ln4f-or-x9dri-board-supermicro.72608/)

(to recap, ONLY with 11.2 , not 11.1 - i kept getting kernel Panics during boot up, or about 30s after boot up- as soon as any ZFS disk IO occurred). This was on a x9dr3 board with 2x 2620v2 CPUs  (and ram from the boards SM QVL).  I also had a x9dri system (from a different source), with 2x 2640v2 CPUs (and with different ram , also on this boards SM QVL), that has been running 11.2 for months with not a single issue.

Today i wanted to move some parts around as im getting closer to my final/"production" freenas system,

So i put the 2x 2620v2 , into the x9dri board (which has been running 11.2 for months, but with 2x 2640v2 cpus), and installed the latest 11.2-U2.1 to a usb. and MUCH to my surprise, the exact same panic started happening again!! So this means the 11.2 issue is actually an issue with both of these x9 boards AND specifically the e5-2620v2 CPUs (but does not occur with 2640v2 cpus!).

Twice, I tried just 1x 2620v2 (via swapped the 1x around) , as well as my other set of memory (both my sets of memory are ecc, and on the SM QVL for both boards). I even tried swapping the 2620v2 's around , running just 1x CPU. in all cases the panics occur! (still 11.1 no problems, and still w 2640v2 , no problems).

given all this, i would maybe think there is something wrong with the 2x 2620v2 cpus i have, but i doubt it, as i have stress tested many different OSs on those specific cpus, and did not see a single issue (this was months ago, and b4 i even first installed FN). additionally, the same cpus have run 11.1 for months with no problems.

any ideas or info? (or do you guys know of any 11.2 FN systems running on specifically 2620v2 cpus?)

thanks

Also available in: Atom PDF