Project

General

Profile

Bug #20898

Consistent Crashing

Added by Erik Riffel over 3 years ago. Updated about 3 years ago.

Status:
Closed: Cannot reproduce
Priority:
No priority
Assignee:
Chris Torek
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

Was happily running FreeNAS-ca82ba222c0be179a6983636c50732c3 and loving it. On a whim, upgraded to 9.10.2. Vlans went all crazy. Mappings were gone, devices didnt exist. Ended up trashing them all and recreating and everything seemed fine. Then started the random reboots. Upgraded to 9.10.2-U1 hoping that would fix whatever was going on. No such luck. Started replacing hardware..new PSU, all new ram, gigantic CPU heatsink. Everything seems find hardware-wise. The pool is in an external SAS tray so the PSU can't be overdrawn.
Not sure where to go from here. Hoping the crashdumps can provide some insight.

Attached is an error I see constantly in the console, but to be honest, I don't know if it was always there because this thing had been so bullet proof, I rarely logged in.

Thanks

History

#1 Updated by Erik Riffel over 3 years ago

  • File debug-st0r-20170206230844.txz added

#2 Updated by Erik Riffel over 3 years ago

  • File freenas_zil_lwb_write_entry.error.txt added

#3 Updated by Bonnie Follweiler over 3 years ago

  • Assignee set to Sean Fagan

#4 Updated by Sean Fagan over 3 years ago

  • Category changed from 1 to 137
  • Assignee changed from Sean Fagan to Chris Torek

Not sure why dtrace would end up logged by snmpd; that's for suraj I think. (It's an issue I thought was fixed already.)

The reboots sound like a watchdog timer issue. Chris?

#5 Updated by Chris Torek over 3 years ago

  • Status changed from Unscreened to Investigation
  • Seen in changed from Unspecified to 9.10.2-U2

It's not watchdogs...

Most of the crashes are here:

(kgdb) x/i 0xffffffff80975ba5
0xffffffff80975ba5 <softclock_call_cc+373>:    callq  *-0x90(%rbp)
(kgdb) l *0xffffffff80975ba5
0xffffffff80975ba5 is in softclock_call_cc (/freenas-9.10-releng/_BE/os/sys/kern/kern_timeout.c:689).
684    #if defined(DIAGNOSTIC) || defined(CALLOUT_PROFILING)
685        sbt1 = sbinuptime();
686    #endif
687        THREAD_NO_SLEEPING();
688        SDT_PROBE1(callout_execute, , , callout__start, c);
689        c_func(c_arg);
690        SDT_PROBE1(callout_execute, , , callout__end, c);
691        THREAD_SLEEPING_OK();
692    #if defined(DIAGNOSTIC) || defined(CALLOUT_PROFILING)
693        sbt2 = sbinuptime();

which suggests a bad pointer got into the callouts.

One crash is here:

(kgdb) x/i 0xffffffff807e5ddf
0xffffffff807e5ddf <usb_proc_msignal+207>:    mov    %r14,(%rax)
(kgdb) l *0xffffffff807e5ddf
0xffffffff807e5ddf is in usb_proc_msignal (/freenas-9.10-releng/_BE/os/sys/dev/usb/usb_process.c:351).
346        DPRINTF(" t=%u, num=%u\n", t, up->up_msg_num);
347    
348        /* Put message last on queue */
349    
350        pm2->pm_num = up->up_msg_num;
351        TAILQ_INSERT_TAIL(&up->up_qhead, pm2, pm_qentry);
352    
353        /* Check if we need to wakeup the USB process. */
354    
355        if (up->up_msleep) {

which suggests we have bizarrely bad values in memory somewhere, which could be due to whatever is apparently trashing callouts, if that is indeed what is happening.

But it's not at all obvious who is trashing what, when.

Can you boot the debug kernel and see if you get an earlier / different crash?

#6 Updated by Chris Torek over 3 years ago

  • Status changed from Investigation to Closed: Cannot reproduce

#7 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-st0r-20170206230844.txz)

#8 Updated by Dru Lavigne about 3 years ago

  • File deleted (freenas_zil_lwb_write_entry.error.txt)

#9 Updated by Dru Lavigne about 3 years ago

  • Target version set to N/A
  • Private changed from Yes to No

Also available in: Atom PDF