Project

General

Profile

Bug #63549

Kernel crash with reboot when writing to pool

Added by Uli Menzebach 3 months ago. Updated 2 months ago.

Status:
Closed
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Not Applicable
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

I have an external USB disk containing a ZFS pool formerly written to with FreeNAS 11.2
My plan was to restore a dataset plus snapshots from this temporarily connected pool "rm-bu2"
The pool import via
zpool import rm-bu2
worked.
The command I issued for restore was :
zfs send -R |pv -r| zfs recv -e data/backups
After some minutes, the pv output stopped and the machine rebooted.
This is not the first time there were instabilities, but I attributed those to an old Avago controller. Currently I am using the onboard SATA connectors of the
Supermicro X11SSH-CTF Motherboard, the 64GB RAM are ECC RAM that passed several days of burn-in via memtest86.
There have been similar crashes while copying data to the pool via NFSv3.

Debug info is attached, please let me know if more info is needed,

Many thanks,

Uli

History

#1 Updated by Uli Menzebach 3 months ago

  • File debug-rmnas2-20181211215638.txz added
  • Private changed from No to Yes

#2 Updated by Dru Lavigne 3 months ago

  • Assignee changed from Release Council to Alexander Motin

#3 Updated by Alexander Motin 3 months ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Need additional information from Author

There is obviously something wrong with your system. In provided debug I see 5 core dumps within the last month, all different. Plus I see two more resets by watchdog timer, that may mean OS not being responsive there. Unfortunately all symptoms are so different that they don't point any specific direction for me. The latest panic on December 11th looked like this:

panic: Memory modified after free 0xfffff80a0ee65000(4096) val=deadc09e @ 0xfffff80a0ee65a38                                            

cpuid = 2                                                                                                                               
KDB: stack backtrace:                                                                                                                   
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe14a76e5650                                                          
vpanic() at vpanic+0x177/frame 0xfffffe14a76e56b0                                                                                       
panic() at panic+0x43/frame 0xfffffe14a76e5710                                                                                          
trash_ctor() at trash_ctor+0x4c/frame 0xfffffe14a76e5720                                                                                
uma_zalloc_arg() at uma_zalloc_arg+0x553/frame 0xfffffe14a76e5780                                                                       
abd_alloc() at abd_alloc+0x142/frame 0xfffffe14a76e57d0                                                                                 
arc_hdr_alloc_pabd() at arc_hdr_alloc_pabd+0x157/frame 0xfffffe14a76e5800                                                               
arc_write_ready() at arc_write_ready+0x463/frame 0xfffffe14a76e5850                                                                     
zio_ready() at zio_ready+0xdf/frame 0xfffffe14a76e5890                                                                                  
zio_execute() at zio_execute+0x18c/frame 0xfffffe14a76e58e0                                                                             
taskqueue_run_locked() at taskqueue_run_locked+0x14c/frame 0xfffffe14a76e5940                                                           
taskqueue_thread_loop() at taskqueue_thread_loop+0x88/frame 0xfffffe14a76e5970                                                          
fork_exit() at fork_exit+0x84/frame 0xfffffe14a76e59b0                                                                                  
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe14a76e59b0                                                                       
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---                                                                                               

The change of "deadc0de" to "deadc09e" there may be caused by either hardware issue (one bit loss) or a software bug (incorrect memory modification by some software). If you would not run debug kernel this could even pass unnoticed.

As I understand, the last crash happened when you was doing the send/receive. Is that reproducible? Is it reproducible when the system is completely idle otherwise and preferably freshly rebooted before that?

It is not critical, but instead of `pv` I usually use -v argument to `zfs send/receive` commands to report status. Or Ctrl-T IIRC also makes them to print status. Use of additional pipe may slightly increase CPU usage.

#4 Updated by Uli Menzebach 3 months ago

Alexander,

many thanks for looking into this!!

- Actually there was a spontaneous reboot last night while the system was mainly idling.
- I had pulled the watchdog jumper on the motherboard to switch off the hard resets asserted by the watchdog (Tip found on the forum)

BTW I see this in the dmesg output:

pmc: Unknown Intel CPU.
hwpmc: SOFT/16/64/0x67<INT,USR,SYS,REA,WRI>

Not sure this is related to my problems or completely irrelevant.

If there are no known compatibility problems with the system components and FreeNAS/FreeBSD, there might be a HW problem.
However all components are new (from Supermicro except processor and disks) so I would not expect the power supply to be flakey.
RAM problems I would expect to show up as ECC errors at least partly (which I do not see).

Currently another zfs send/recv is running (from another external disk) with pv omitted. Runs so far, 50GB from ~300GB overall.
I'll report how that turns out and will include further debug info from crashes as they occur.

Again many thanks for your support,

Uli

#5 Updated by Alexander Motin 3 months ago

Uli Menzebach wrote:

BTW I see this in the dmesg output:

pmc: Unknown Intel CPU.
hwpmc: SOFT/16/64/0x67<INT,USR,SYS,REA,WRI>

Not sure this is related to my problems or completely irrelevant.

It is irrelevant.

#6 Updated by Uli Menzebach 2 months ago

Alexander,

we could trace the problem to RAM errors -- the wrong DIMM modules had been put into the machine, not only were they NOT ECC-protected but obviously not completely compatible with the MB :O
Not really full on-topic, but still interesting:
- the errors would not be detected by Memtest86, but only the Supermicro diagnosis software
- using only 1 mem bank (2 DIMM slots) instead of the available 2 banks got rid of the errors as well.

Since we put ECC RAM in ( the Samsung type approved by Supermicro ) FreeNAS has been as rock solid as I know it from earlier deployments.
[ and I am very aware of the pool corruption you easily get w/o ECC ]

So this is another case of underlying HW problems instead of a SW bug.
From my PoV this issue can be closed.

Thanks for your excellent support and sorry for wasting your time.

Uli

#7 Updated by Dru Lavigne 2 months ago

  • File deleted (debug-rmnas2-20181211215638.txz)

#8 Updated by Dru Lavigne 2 months ago

  • Status changed from Blocked to Closed
  • Target version changed from Backlog to N/A
  • Private changed from Yes to No
  • Reason for Closing set to Not Applicable
  • Reason for Blocked deleted (Need additional information from Author)

Thanks for the followup Uli.

Also available in: Atom PDF