Project

General

Profile

Bug #67179

Sudden reboot of Dell R610 server on which Freenas 11.2 has been installated

Added by Frédéric Denin 9 months ago. Updated 7 months ago.

Status:
Closed
Priority:
No priority
Assignee:
Alexander Motin
Category:
Hardware
Target version:
Severity:
New
Reason for Closing:
Cannot Reproduce
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

Hi,

since 11.1U5 upgrade to 11.2 (and then 11.2U1), our freenas system installed on Dell PowerEdge R610 suddenly reboots for no apparent reason.
We've tried to update our server and our hard drives and have also reinstalled the system but this didn't helped.

In fact, the system has been functionning normally for a few days since the reinstall but since 01/02/2019 the system has been suffering from these sudden reboots.

This system has been installed in 9.3 in 2015 and has been upgraded since.

This freenas server has always been used as a repository for vmware virtual machines.

What else can I provide you to help troubleshoot ?

Best Regards,

Tunables.png (10 KB) Tunables.png Frédéric Denin, 01/24/2019 04:59 AM
50395

History

#1 Updated by Frédéric Denin 9 months ago

  • File debug-PRD-FNAS-01-20190103110638.txz added
  • Private changed from No to Yes

#2 Updated by Dru Lavigne 9 months ago

  • Category changed from Build system to Hardware
  • Assignee changed from Release Council to Alexander Motin

#3 Updated by Frédéric Denin 9 months ago

Hi,

we've found this error in IDRAC management each time the system reboots :
"Watchdog sensor for System Board, hard reset by SMS/OS timer was asserted"

Moreover we will install same relase one another R610 system to conduct some tests to see if it happens also on it.

Best regards,

#4 Updated by Alexander Motin 9 months ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Need additional information from Author

Watchdog events in your board log likely tell that system either hanged or otherwise was not responsive for too long for some reason. We need to figure out why. You may try to stop watchdog daemon with `service watchdogd stop` and we how system will work after that. Or you may try to correlate those events with system load and operations during that time. Alternatively you may try to disable hardware watchdog devices with `hint.ipmi.0.disabled=1` and `hint.ichwd.0.disabled=1` loader tunables, that should make software watchdog to be used instead, that should leave us core dumps after reboot.

#5 Updated by Frédéric Denin 9 months ago

Hi and thanks for these explanations,

these reboots usually happen when I launch multiple storage vmotion at the same time.

I will try both solutions you provied and will keep you informed.

Best Regards,

#6 Updated by Frédéric Denin 9 months ago

Hi,

applying "service watchdogd stop" seems to fix the problem.

I'm not 100 percent sure as the system run smoothly for a few days at start but right now I've moved many virtual machine and no sudden reboots has occured.

Best Regards,

#7 Updated by Dru Lavigne 9 months ago

Frederic, as a followup: is the system still staying up? Should we close out this ticket?

#8 Updated by Frédéric Denin 9 months ago

Hi,

sadly, the systems has rebooted multiple times with no important activity and with watchdog disabled.
I will try second solution you've provided.

Best Regards,

#9 Updated by Frédéric Denin 9 months ago

50395

Hi,

We don't have crash dump (and this server had sudden reboots since we changed tunables).

Could you help us if the config in attachment is correct ?

Best Regards,

#10 Updated by Alexander Motin 9 months ago

Visually they look correct, but real proof would be a look on dmesg, where driver for those devices should no longer attach.

#11 Updated by Alexander Motin 9 months ago

You may also check BMC/IPMI logs for new messages about the reboots causes. Though with the last tunable you should not be able to to that from FreeNAS, it has to be done either from BIOS or out of band.

#12 Updated by Frédéric Denin 9 months ago

Thanks for these quick replies.
In BMC, I found same errors than before.
I will try now to change some options in BIOS.
If I have no more success, I will try reinstall in 11.1.

Best Regards,

#13 Updated by Frédéric Denin 8 months ago

Hi,

since we rolled back to 11.1, we suffered no more crash like the previous ones.
I think we will keep this release installed on the server as long as possible.

Best Regards,

#14 Updated by Alexander Motin 8 months ago

Frédéric, I see no way how "the systems has rebooted multiple times ... with watchdog disabled", and still log "Watchdog sensor for System Board, hard reset by SMS/OS timer was asserted" in BMC logs. It just don't fit, or those are pieces form different puzzles. A) Either system reboots with watchdog disabled triggered by some other means, and then you should see some other error in BMC or nothing at all, or B) you have reenabled the watchdog timer.

#15 Updated by Alexander Motin 8 months ago

Colleague of mine reminded me that if you are disabling watchdog devices in FreeNAS, you should also make sure watchdog timer is not enabled in BIOS, otherwise system may cyclically reboot each time few minutes after boot. But I suppose you would notice/mention that.

#16 Updated by Frédéric Denin 8 months ago

  • File dmesg apres tunables.txt added

Right now, I really need space associated to this server but in some weeks, I may be able to conduct some more tests with a 11.2 upgrade.
For information, I've attached dmesg after applying tunables in 11.2.

Best Regards,

#17 Updated by Alexander Motin 8 months ago

Ah, tunables proposed before appeared not enough. IPMI and its watchdog attached differently. You would also need `hint.ipmi.1.disabled=1` to disable it completely.

#18 Updated by Frédéric Denin 8 months ago

Thanks for the advice, I will so in, I hope, 2-3 weeks.

#19 Updated by Alexander Motin 7 months ago

  • Status changed from Blocked to Closed
  • Reason for Closing set to Cannot Reproduce
  • Reason for Blocked deleted (Need additional information from Author)

Closing this on lack of activity. We are going to release FreeNAS 11.2-U3, fixing some interactivity issues, which I would recommend you to try.

#20 Updated by Dru Lavigne 7 months ago

  • File deleted (debug-PRD-FNAS-01-20190103110638.txz)

#21 Updated by Dru Lavigne 7 months ago

  • File deleted (dmesg apres tunables.txt)

#22 Updated by Dru Lavigne 7 months ago

  • Target version changed from Backlog to N/A
  • Private changed from Yes to No

Also available in: Atom PDF