Kernel crash with reboot when writing to pool
I have an external USB disk containing a ZFS pool formerly written to with FreeNAS 11.2
My plan was to restore a dataset plus snapshots from this temporarily connected pool "rm-bu2"
The pool import via
zpool import rm-bu2
The command I issued for restore was :
zfs send -R email@example.com |pv -r| zfs recv -e data/backups
After some minutes, the pv output stopped and the machine rebooted.
This is not the first time there were instabilities, but I attributed those to an old Avago controller. Currently I am using the onboard SATA connectors of the
Supermicro X11SSH-CTF Motherboard, the 64GB RAM are ECC RAM that passed several days of burn-in via memtest86.
There have been similar crashes while copying data to the pool via NFSv3.
Debug info is attached, please let me know if more info is needed,
- Status changed from Unscreened to Blocked
- Reason for Blocked set to Need additional information from Author
There is obviously something wrong with your system. In provided debug I see 5 core dumps within the last month, all different. Plus I see two more resets by watchdog timer, that may mean OS not being responsive there. Unfortunately all symptoms are so different that they don't point any specific direction for me. The latest panic on December 11th looked like this:
panic: Memory modified after free 0xfffff80a0ee65000(4096) val=deadc09e @ 0xfffff80a0ee65a38 cpuid = 2 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe14a76e5650 vpanic() at vpanic+0x177/frame 0xfffffe14a76e56b0 panic() at panic+0x43/frame 0xfffffe14a76e5710 trash_ctor() at trash_ctor+0x4c/frame 0xfffffe14a76e5720 uma_zalloc_arg() at uma_zalloc_arg+0x553/frame 0xfffffe14a76e5780 abd_alloc() at abd_alloc+0x142/frame 0xfffffe14a76e57d0 arc_hdr_alloc_pabd() at arc_hdr_alloc_pabd+0x157/frame 0xfffffe14a76e5800 arc_write_ready() at arc_write_ready+0x463/frame 0xfffffe14a76e5850 zio_ready() at zio_ready+0xdf/frame 0xfffffe14a76e5890 zio_execute() at zio_execute+0x18c/frame 0xfffffe14a76e58e0 taskqueue_run_locked() at taskqueue_run_locked+0x14c/frame 0xfffffe14a76e5940 taskqueue_thread_loop() at taskqueue_thread_loop+0x88/frame 0xfffffe14a76e5970 fork_exit() at fork_exit+0x84/frame 0xfffffe14a76e59b0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe14a76e59b0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
The change of "deadc0de" to "deadc09e" there may be caused by either hardware issue (one bit loss) or a software bug (incorrect memory modification by some software). If you would not run debug kernel this could even pass unnoticed.
As I understand, the last crash happened when you was doing the send/receive. Is that reproducible? Is it reproducible when the system is completely idle otherwise and preferably freshly rebooted before that?
It is not critical, but instead of `pv` I usually use -v argument to `zfs send/receive` commands to report status. Or Ctrl-T IIRC also makes them to print status. Use of additional pipe may slightly increase CPU usage.
many thanks for looking into this!!
- Actually there was a spontaneous reboot last night while the system was mainly idling.
- I had pulled the watchdog jumper on the motherboard to switch off the hard resets asserted by the watchdog (Tip found on the forum)
BTW I see this in the dmesg output:
pmc: Unknown Intel CPU.
Not sure this is related to my problems or completely irrelevant.
If there are no known compatibility problems with the system components and FreeNAS/FreeBSD, there might be a HW problem.
However all components are new (from Supermicro except processor and disks) so I would not expect the power supply to be flakey.
RAM problems I would expect to show up as ECC errors at least partly (which I do not see).
Currently another zfs send/recv is running (from another external disk) with pv omitted. Runs so far, 50GB from ~300GB overall.
I'll report how that turns out and will include further debug info from crashes as they occur.
Again many thanks for your support,
we could trace the problem to RAM errors -- the wrong DIMM modules had been put into the machine, not only were they NOT ECC-protected but obviously not completely compatible with the MB :O
Not really full on-topic, but still interesting:
- the errors would not be detected by Memtest86, but only the Supermicro diagnosis software
- using only 1 mem bank (2 DIMM slots) instead of the available 2 banks got rid of the errors as well.
Since we put ECC RAM in ( the Samsung type approved by Supermicro ) FreeNAS has been as rock solid as I know it from earlier deployments.
[ and I am very aware of the pool corruption you easily get w/o ECC ]
So this is another case of underlying HW problems instead of a SW bug.
From my PoV this issue can be closed.
Thanks for your excellent support and sorry for wasting your time.