Project

General

Profile

Bug #62541

[M50] It takes 9-11 minutes from reboot to HA being enabled

Added by Bonnie Follweiler 2 months ago. Updated 7 days ago.

Status:
Blocked
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Severity:
New
Reason for Closing:
Reason for Blocked:
Need additional information from Author
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

This is on the M50 with TrueNAS-11.1-U6.2-INTERNAL2 installed
I did three tests
It was roughly 17, 12, and 17 seconds to electing
42, 38, and 38 seconds to import
(on the last test I timed from reboot to login and that took 50 seconds)
In my three tests it took 11 minutes, 9 min 3 sec, and 10 min 1 sec from reboot to HA Enabled

As I was watching the console in the IPMI I noticed that the passive node seemed to stall at the system initializing step. On my third test I timed it and it started initializing at 3 minutes 30 seconds and finally continued on at 7 minutes 9 seconds

Screen Shot 2018-12-07 at 9.51.32 AM.png (71.9 KB) Screen Shot 2018-12-07 at 9.51.32 AM.png Bonnie Follweiler, 12/07/2018 07:13 AM
debug-20181207071410.tar (880 KB) debug-20181207071410.tar Bonnie Follweiler, 12/07/2018 07:17 AM
44010

History

#1 Updated by Dru Lavigne 2 months ago

  • Assignee changed from Release Council to Alexander Motin

#2 Updated by Nick Wolff 2 months ago

As a comparison a x10 board takes about 200 seconds (3.3minutes) from reboot to first ping.

During a reboot of an m-series the nvdimm needs to first dump to nand and restore and make sure that it is "re-armed" and that the supercapacitor is fully recharged before it will allow the system to continue to boot.

Mav still wants to investigate but wanted to braindump for anyone reading this ticket.

#3 Updated by Alexander Motin 2 months ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Need additional information from Author

3-4 minute of wait on that "System initializing ..." state is normal unfortunately, since NVDIMM needs up to 2 minute to backup and 1 minute to restore. May be remaining 5 minutes of reboot is OK, but 8 sounds somewhat less so to me. Could somebody timed what other things system is doing that takes the rest of time? Unfortunately I don't have full M50 nearby to watch it myself.

#4 Updated by Alexander Motin 7 days ago

I've tried to time it on passive node of tn11 M50 from performance team. This system also has serial console enabled, so its boot time is predictably slightly slower then normal due to ~11KBps console speed. I got several reboot times:

6:30 -- first reboot. For some reason it somehow spent 3 minutes from the turning off ELI swap, which was last message to actual reboot, while most of time system still could be pinged. Have no idea what it was, but very curios. I've found that for some reason this system had "Reset triggers ADR" disabled, so system spent no time NVDIMM save/restore.

4:00 -- 2nd, 3td, 4th reboots. This time between swap off an reboot past only ~30 seconds, not sure what has changed form the first try, may be uptime somehow, or something else related to the fact system has just booted?

6:10 -- 6th, 7th boot. This time I've enabled "Reset triggers ADR" as it should be for data safety, and it added 2 more minutes on NVDIMMs inserted there.

One more not very nice side I've noticed is that probe of 3 Samsung PM1725a NVMe's takes ~20 seconds total, that is not good. But it seems like intentional, due to bug in the SSD controller, which require 2.3 seconds delay to reset reliably, and boot at that stage in sequential.

If I add 3 minutes to time before reboot in first case to 6:10, 9 minutes is about lower time of Bonnie's tests. Plus to focus on OS side I intentionally rebooted passive node, not active to avoid active services stop, which I think may be have other issues.

#5 Updated by Alexander Motin 7 days ago

So aside of sporadic delay before reboot once, I don't see anything else too wrong. We should still test reboot of the active controller too.

Also available in: Atom PDF