Bug #20898

Consistent Crashing

Added by Erik Riffel over 3 years ago. Updated about 3 years ago.

Closed: Cannot reproduce
No priority
Chris Torek
Target version:
Seen in:
Reason for Closing:
Reason for Blocked:
Needs QA:
Needs Doc:
Needs Merging:
Needs Automation:
Support Suite Ticket:
Hardware Configuration:
ChangeLog Required:


Was happily running FreeNAS-ca82ba222c0be179a6983636c50732c3 and loving it. On a whim, upgraded to 9.10.2. Vlans went all crazy. Mappings were gone, devices didnt exist. Ended up trashing them all and recreating and everything seemed fine. Then started the random reboots. Upgraded to 9.10.2-U1 hoping that would fix whatever was going on. No such luck. Started replacing PSU, all new ram, gigantic CPU heatsink. Everything seems find hardware-wise. The pool is in an external SAS tray so the PSU can't be overdrawn.
Not sure where to go from here. Hoping the crashdumps can provide some insight.

Attached is an error I see constantly in the console, but to be honest, I don't know if it was always there because this thing had been so bullet proof, I rarely logged in.



#1 Updated by Erik Riffel over 3 years ago

  • File debug-st0r-20170206230844.txz added

#2 Updated by Erik Riffel over 3 years ago

  • File freenas_zil_lwb_write_entry.error.txt added

#3 Updated by Bonnie Follweiler over 3 years ago

  • Assignee set to Sean Fagan

#4 Updated by Sean Fagan over 3 years ago

  • Category changed from 1 to 137
  • Assignee changed from Sean Fagan to Chris Torek

Not sure why dtrace would end up logged by snmpd; that's for suraj I think. (It's an issue I thought was fixed already.)

The reboots sound like a watchdog timer issue. Chris?

#5 Updated by Chris Torek over 3 years ago

  • Status changed from Unscreened to Investigation
  • Seen in changed from Unspecified to 9.10.2-U2

It's not watchdogs...

Most of the crashes are here:

(kgdb) x/i 0xffffffff80975ba5
0xffffffff80975ba5 <softclock_call_cc+373>:    callq  *-0x90(%rbp)
(kgdb) l *0xffffffff80975ba5
0xffffffff80975ba5 is in softclock_call_cc (/freenas-9.10-releng/_BE/os/sys/kern/kern_timeout.c:689).
684    #if defined(DIAGNOSTIC) || defined(CALLOUT_PROFILING)
685        sbt1 = sbinuptime();
686    #endif
687        THREAD_NO_SLEEPING();
688        SDT_PROBE1(callout_execute, , , callout__start, c);
689        c_func(c_arg);
690        SDT_PROBE1(callout_execute, , , callout__end, c);
691        THREAD_SLEEPING_OK();
692    #if defined(DIAGNOSTIC) || defined(CALLOUT_PROFILING)
693        sbt2 = sbinuptime();

which suggests a bad pointer got into the callouts.

One crash is here:

(kgdb) x/i 0xffffffff807e5ddf
0xffffffff807e5ddf <usb_proc_msignal+207>:    mov    %r14,(%rax)
(kgdb) l *0xffffffff807e5ddf
0xffffffff807e5ddf is in usb_proc_msignal (/freenas-9.10-releng/_BE/os/sys/dev/usb/usb_process.c:351).
346        DPRINTF(" t=%u, num=%u\n", t, up->up_msg_num);
348        /* Put message last on queue */
350        pm2->pm_num = up->up_msg_num;
351        TAILQ_INSERT_TAIL(&up->up_qhead, pm2, pm_qentry);
353        /* Check if we need to wakeup the USB process. */
355        if (up->up_msleep) {

which suggests we have bizarrely bad values in memory somewhere, which could be due to whatever is apparently trashing callouts, if that is indeed what is happening.

But it's not at all obvious who is trashing what, when.

Can you boot the debug kernel and see if you get an earlier / different crash?

#6 Updated by Chris Torek over 3 years ago

  • Status changed from Investigation to Closed: Cannot reproduce

#7 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-st0r-20170206230844.txz)

#8 Updated by Dru Lavigne about 3 years ago

  • File deleted (freenas_zil_lwb_write_entry.error.txt)

#9 Updated by Dru Lavigne about 3 years ago

  • Target version set to N/A
  • Private changed from Yes to No

Also available in: Atom PDF