System ran stable for years until version 11. Unscheduled system reboots.
After upgrading to 11, the system has been rebooting on its own, but not sure why. Currently running in debug mode.
i receive the following email:
SUBJECT="freenas8.corp.eceo.us: Unscheduled system reboot"
BODY="freenas8.corp.eceo.us had an unscheduled system reboot.
The operating system successfully came back online at Sun Sep 9 05:33:12 2018."
Two other identical machines have not rebooted at all, but those are still on 9.
- Status changed from Unscreened to Blocked
- Reason for Blocked set to Waiting for feedback
Hi Robert. In IPMI even logs in provided debug I see that your system periodically rebooted by BMC watchdog timer. It may mean that either your system periodically locks up or becomes otherwise unresponsive, or something wrong with IPMI watchdog driver. You may try to identify what it is by stopping watchdog daemon by `service watchdogd onestop`. If your system hang at some point after that, then it is a real problem. If problem disappear, then it is some kind of false positive, and we would need to know whether your system experience any kind of hick-ups periodically that could trigger watchdog to fire.
BTW, running debug kernel may help with diagnosing of kernel panics or hangs. If that is not a problem, like possibly here, enabling for all the time significantly reduces system performance, and may actually one of factors to trigger watchdog to trigger, if it is caused by system load bursts.
Thank you for promptly analyzing the log file. Never had such a quick response from a project whether open or proprietary, so thank you.
I hope to use ?zcat | egrep -i '(BMC|wdt|watchdogd)' ? to grep those logfiles myself. New again to bsd and totally new to ZFS.
The server has not rebooted on its own while in DEBUG mode and watchdogd has been running the entire time. Besides enabling DEBUG, the difference in the system is that the SMART configuration changed to NOT scan removable USB drives. However, may be too early to tell.
service watchdogd onestop` supposed to be `
service watchdogd stop`?
man service` nor `
man watchdog` has the term onestop, but it stopped the service anyway.
I googled it.
I know it is unlikely, but for now, I am going to proceed with testing the effects of SMART tests of removable media and if the watchdog timer is set to a real low time, increase it.
Please ignore, I am putting dmesg output here for future reference.
ichwd0: ICH WDT present but disabled in BIOS or hardware,
but enabled in IPMI
ipmi0: Attached watchdog
Thing is, i never had luck connecting to the IPMI.
warning: KLD '/boot/kernel-debug/profile.ko' is newer than the linker.hints file /freenas-11-releng/freenas/_BE/os/sys/kern/vfs_mount.c:849
lock order reversal:
1st 0xfffff800349e57c8 tmpfs (tmpfs)
2nd 0xfffff8003457c7c8 zfs (zfs)
#0 0xffffffff80b01120 at witness_debugger+0x70
#1 0xffffffff80b00fa2 at witness_checkorder+0xe02
#2 0xffffffff80a75bce at lockmgr_lock_fast_path+0x1ae
#3 0xffffffff81137698 at VOP_LOCK1_APV+0xe8
#4 0xffffffff80b7aee6 at _vn_lock+0x66
#5 0xffffffff804812af at zfs_root+0xcf
#6 0xffffffff80b6171a at vfs_donmount+0x120a
#7 0xffffffff80b604e2 at sys_nmount+0x72
#8 0xffffffff80f67538 at amd64_syscall+0x798
#9 0xffffffff80f464cb at Xfast_syscall+0xfb
warning: KLD '/boot/kernel-debug/smbus.ko' is newer than the linker.hints file
ipmi0: <IPMI System Interface> port 0xca2,0xca3 on acpi0
ipmi0: KCS mode found at io 0xca2 on acpi
ipmi0: IPMI device rev. 1, firmware rev. 0.65, version 2.0
ipmi0: Number of channels 5
ipmi0: Attached watchdog
ichwd0: <Intel 63XXESB watchdog timer> at port 0x430-0x437,0x460-0x47f on isa0
ichwd0: ICH WDT present but disabled in BIOS or hardware
device_attach: ichwd0 attach returned 6
hwpmc: SOFT/16/64/0x67<INT,USR,SYS,REA,WRI> TSC/1/64/0x20<REA> IAP/2/40/0x3ff<INT,USR,SYS,EDG,THR,REA,WRI,INV,QUA,PRC> IAF/3/40/0x67<INT,USR,SYS,REA,WRI>
warning: KLD '/boot/kernel-debug/t3_tom.ko' is newer than the linker.hints file
warning: KLD '/boot/kernel-debug/toecore.ko' is newer than the linker.hints file
warning: KLD '/boot/kernel-debug/t4_tom.ko' is newer than the linker.hints file
`onestop` is just a more robust equivalent of `stop`, but either should work the same.
Situations when enabling debug kernel hide the problem are nasty, but unfortunately they happen sometimes. :( In such case you could revert to non-debug kernel and at leats try enabling/disabling watchdog to collect some statistics.
- Status changed from Blocked to Closed
- Target version changed from Backlog to N/A
- Reason for Closing set to Cannot Reproduce
- Reason for Blocked deleted (
Waiting for feedback)
- Needs QA changed from Yes to No
- Needs Doc changed from Yes to No
- Needs Merging changed from Yes to No
Closing this due to lack of input.