Project

General

Profile

Bug #46344

System ran stable for years until version 11. Unscheduled system reboots.

Added by Robert Townley 8 months ago. Updated 7 months ago.

Status:
Closed
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Cannot Reproduce
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

After upgrading to 11, the system has been rebooting on its own, but not sure why. Currently running in debug mode.

i receive the following email:
SUBJECT="freenas8.corp.eceo.us: Unscheduled system reboot"
BODY="freenas8.corp.eceo.us had an unscheduled system reboot.
The operating system successfully came back online at Sun Sep 9 05:33:12 2018."

Two other identical machines have not rebooted at all, but those are still on 9.

History

#1 Updated by Robert Townley 8 months ago

  • File debug-freenas8-20180913125826.txz added
  • Private changed from No to Yes

#2 Updated by Dru Lavigne 8 months ago

  • Assignee changed from Release Council to Alexander Motin

#3 Updated by Alexander Motin 8 months ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Waiting for feedback

Hi Robert. In IPMI even logs in provided debug I see that your system periodically rebooted by BMC watchdog timer. It may mean that either your system periodically locks up or becomes otherwise unresponsive, or something wrong with IPMI watchdog driver. You may try to identify what it is by stopping watchdog daemon by `service watchdogd onestop`. If your system hang at some point after that, then it is a real problem. If problem disappear, then it is some kind of false positive, and we would need to know whether your system experience any kind of hick-ups periodically that could trigger watchdog to fire.

#4 Updated by Alexander Motin 8 months ago

BTW, running debug kernel may help with diagnosing of kernel panics or hangs. If that is not a problem, like possibly here, enabling for all the time significantly reduces system performance, and may actually one of factors to trigger watchdog to trigger, if it is caused by system load bursts.

#5 Updated by Robert Townley 8 months ago

Thank you for promptly analyzing the log file. Never had such a quick response from a project whether open or proprietary, so thank you.

I hope to use ?zcat | egrep -i '(BMC|wdt|watchdogd)' ? to grep those logfiles myself. New again to bsd and totally new to ZFS.

The server has not rebooted on its own while in DEBUG mode and watchdogd has been running the entire time. Besides enabling DEBUG, the difference in the system is that the SMART configuration changed to NOT scan removable USB drives. However, may be too early to tell.

Is `service watchdogd onestop` supposed to be `service watchdogd stop`?
Neither `man service` nor `man watchdog` has the term onestop, but it stopped the service anyway.
I googled it.

I know it is unlikely, but for now, I am going to proceed with testing the effects of SMART tests of removable media and if the watchdog timer is set to a real low time, increase it.

Thank You.

#6 Updated by Robert Townley 8 months ago

Please ignore, I am putting dmesg output here for future reference.

ichwd0: ICH WDT present but disabled in BIOS or hardware,
but enabled in IPMI
ipmi0: Attached watchdog

Thing is, i never had luck connecting to the IPMI.

warning: KLD '/boot/kernel-debug/profile.ko' is newer than the linker.hints file
lock order reversal:
1st 0xfffff800349e57c8 tmpfs (tmpfs)
/freenas-11-releng/freenas/_BE/os/sys/kern/vfs_mount.c:849
2nd 0xfffff8003457c7c8 zfs (zfs) /freenas-11-releng/freenas/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1815
stack backtrace:
#0 0xffffffff80b01120 at witness_debugger+0x70
#1 0xffffffff80b00fa2 at witness_checkorder+0xe02
#2 0xffffffff80a75bce at lockmgr_lock_fast_path+0x1ae
#3 0xffffffff81137698 at VOP_LOCK1_APV+0xe8
#4 0xffffffff80b7aee6 at _vn_lock+0x66
#5 0xffffffff804812af at zfs_root+0xcf
#6 0xffffffff80b6171a at vfs_donmount+0x120a
#7 0xffffffff80b604e2 at sys_nmount+0x72
#8 0xffffffff80f67538 at amd64_syscall+0x798
#9 0xffffffff80f464cb at Xfast_syscall+0xfb
warning: KLD '/boot/kernel-debug/smbus.ko' is newer than the linker.hints file
ipmi0: <IPMI System Interface> port 0xca2,0xca3 on acpi0
ipmi0: KCS mode found at io 0xca2 on acpi
ipmi0: IPMI device rev. 1, firmware rev. 0.65, version 2.0
ipmi0: Number of channels 5
ipmi0: Attached watchdog
ichwd0: <Intel 63XXESB watchdog timer> at port 0x430-0x437,0x460-0x47f on isa0
ichwd0: ICH WDT present but disabled in BIOS or hardware
device_attach: ichwd0 attach returned 6
hwpmc: SOFT/16/64/0x67<INT,USR,SYS,REA,WRI> TSC/1/64/0x20<REA> IAP/2/40/0x3ff<INT,USR,SYS,EDG,THR,REA,WRI,INV,QUA,PRC> IAF/3/40/0x67<INT,USR,SYS,REA,WRI>
warning: KLD '/boot/kernel-debug/t3_tom.ko' is newer than the linker.hints file
warning: KLD '/boot/kernel-debug/toecore.ko' is newer than the linker.hints file
warning: KLD '/boot/kernel-debug/t4_tom.ko' is newer than the linker.hints file

#7 Updated by Alexander Motin 8 months ago

`onestop` is just a more robust equivalent of `stop`, but either should work the same.

Situations when enabling debug kernel hide the problem are nasty, but unfortunately they happen sometimes. :( In such case you could revert to non-debug kernel and at leats try enabling/disabling watchdog to collect some statistics.

#8 Updated by Alexander Motin 7 months ago

  • Status changed from Blocked to Closed
  • Target version changed from Backlog to N/A
  • Reason for Closing set to Cannot Reproduce
  • Reason for Blocked deleted (Waiting for feedback)
  • Needs QA changed from Yes to No
  • Needs Doc changed from Yes to No
  • Needs Merging changed from Yes to No

Closing this due to lack of input.

#9 Updated by Robert Townley 7 months ago

It has not rebooted on its own in 11.2 or ar least a month, so yes close.

#10 Updated by Dru Lavigne 7 months ago

  • File deleted (debug-freenas8-20180913125826.txz)

#11 Updated by Dru Lavigne 7 months ago

  • Private changed from Yes to No

Also available in: Atom PDF