Project

General

Profile

Bug #28111

Random shutdowns at irregular intervals

Added by Leon Roy over 2 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
Reason for Closing:
Behaves as Intended
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Motherboard: X9SCL+-F
RAM: 32GB ECC UDIMM (Supermicro supplied)
Boot disk: Supermicro DOM SSD-DM032-SMCMVN1
HBA: 9211-8i IT

ChangeLog Required:
No

Description

After updating FreeNAS to 11.1-U1 it's started to randomly shut itself down. Logs seem to indicate no errors that I can see. Box was stable up until updating.

I've reverted the Boot environment back to 11.0-U3 - system is still shutting down randomly. This would indicate a possible hardware issue but the system was rock solid until the upgrade.

Output of last reboot:

root@neptune:~ # last reboot
boot time                                  Wed Jan 31 22:04
boot time                                  Wed Jan 31 21:35
boot time                                  Tue Jan 30 12:33
shutdown time                              Tue Jan 30 12:28
boot time                                  Tue Jan 30 12:26
boot time                                  Tue Jan 30 12:21
boot time                                  Mon Jan 29 21:26
boot time                                  Mon Jan 29 21:04
boot time                                  Fri Jan 26 18:36
shutdown time                              Fri Jan 26 18:30
boot time                                  Mon Jan 22 13:42
shutdown time                              Mon Jan 22 13:39
boot time                                  Thu Jan 18 16:55
shutdown time                              Thu Jan 18 16:52
boot time                                  Thu Jan 18 16:47
shutdown time                              Thu Jan 18 14:16
boot time                                  Thu Jan 18 14:05
shutdown time                              Thu Jan 18 14:02
boot time                                  Tue Jan 16 16:13
shutdown time                              Tue Jan 16 16:10
boot time                                  Fri Jan 12 12:10

Attached System > Advanced > Save Debug output.

History

#1 Updated by Leon Roy over 2 years ago

System has IPMI and SNMP enabled. Both are monitored regularly by a Zabbix server. Mentioning this due to the SNMP service crashing in 11.1-RELEASE on another identical box.

/data/crash 
also empty.

#2 Updated by Leon Roy over 2 years ago

  • File deleted (debug-neptune-20180131224912.tgz)

#3 Updated by Dru Lavigne over 2 years ago

  • Private changed from No to Yes
  • Reason for Blocked set to Need additional information

Leon: please reattach the system debug as the dev will need that to start the investigation of the issue.

#4 Updated by Leon Roy over 2 years ago

  • File debug-neptune-20180131224912.tgz added

Reattached.

#5 Updated by Dru Lavigne over 2 years ago

  • Assignee changed from Release Council to Alexander Motin
  • Reason for Blocked deleted (Need additional information)

Sasha: do you see any likely culprits in the debug?

#6 Updated by Alexander Motin over 2 years ago

  • Status changed from Not Started to Blocked
  • Reason for Blocked set to Need additional information

Unfortunately there is no kernel dumps in attached debug and nothing in logs. Could you check your BIOS or IPMI event logs for some events that could correlate with reboots?

#7 Updated by Leon Roy over 2 years ago

14216

Alexander Motin wrote:

Unfortunately there is no kernel dumps in attached debug and nothing in logs. Could you check your BIOS or IPMI event logs for some events that could correlate with reboots?

Checking the IPMI logs on the affected systems shows the following:

223    2018/02/01 00:48:16    Watchdog 2 #0xca    Watchdog 2    Hard Reset
222    2018/02/01 00:48:15    Watchdog 2 #0xca    Watchdog 2    Timer Interrupt
221    2018/01/31 21:48:11    Watchdog 2 #0xca    Watchdog 2    Hard Reset
220    2018/01/31 21:48:10    Watchdog 2 #0xca    Watchdog 2    Timer Interrupt
219    2018/01/30 19:49:19    Watchdog 2 #0xca    Watchdog 2    Hard Reset
218    2018/01/30 19:49:18    Watchdog 2 #0xca    Watchdog 2    Timer Interrupt
217    2018/01/30 03:19:43    Watchdog 2 #0xca    Watchdog 2    Hard Reset
216    2018/01/30 03:19:42    Watchdog 2 #0xca    Watchdog 2    Timer Interrupt
215    2018/01/29 21:23:48    Watchdog 2 #0xca    Watchdog 2    Hard Reset
214    2018/01/29 21:23:47    Watchdog 2 #0xca    Watchdog 2    Timer Interrupt
213    2018/01/29 00:50:25    Watchdog 2 #0xca    Watchdog 2    Hard Reset
212    2018/01/29 00:50:24    Watchdog 2 #0xca    Watchdog 2    Timer Interrupt

We have another identical server. Every single BIOS setting is also identical. The issue doesn't occur there and there are no watchdog events in the IPMI logs.

In the BIOS watchdog is disabled on both systems (attached screenshot of BIOS setting).

#8 Updated by Leon Roy over 2 years ago

Output from ipmitool:

Affected system:

root@neptune:~ # ipmitool mc watchdog get
Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      137 sec
Present Countdown:      134 sec

Unaffected system:

Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      137 sec
Present Countdown:      136 sec

#9 Updated by Leon Roy over 2 years ago

On a hunch based on my feedback on bugs 27818 and 14723 I reduced the frequency with which our Zabbix server polls our FreeNAS box. System no longer crashes.

Could there be a possible connection between the snmp service hanging in FreeNAS (as per 27818) and the watchdog daemon not resetting the watchdog counter in time?

#10 Updated by Alexander Motin over 2 years ago

Surely it can. Whole point of watchdog is to reset the system when it becomes unresponsive. If your system does become unresponsive for periods of time, then the watchdog does the right thing.

#11 Updated by Alexander Motin over 2 years ago

  • Category set to OS
  • Status changed from Blocked to Closed
  • Reason for Closing set to Behaves as Intended
  • Reason for Blocked deleted (Need additional information)
  • Needs QA changed from Yes to No
  • Needs Doc changed from Yes to No
  • Needs Merging changed from Yes to No

#12 Updated by Dru Lavigne over 2 years ago

  • File deleted (debug-neptune-20180131224912.tgz)

#13 Updated by Dru Lavigne over 2 years ago

  • Target version set to N/A
  • Private changed from Yes to No

Also available in: Atom PDF