[M50] It takes 9-11 minutes from reboot to HA being enabled
This is on the M50 with TrueNAS-11.1-U6.2-INTERNAL2 installed
I did three tests
It was roughly 17, 12, and 17 seconds to electing
42, 38, and 38 seconds to import
(on the last test I timed from reboot to login and that took 50 seconds)
In my three tests it took 11 minutes, 9 min 3 sec, and 10 min 1 sec from reboot to HA Enabled
As I was watching the console in the IPMI I noticed that the passive node seemed to stall at the system initializing step. On my third test I timed it and it started initializing at 3 minutes 30 seconds and finally continued on at 7 minutes 9 seconds
As a comparison a x10 board takes about 200 seconds (3.3minutes) from reboot to first ping.
During a reboot of an m-series the nvdimm needs to first dump to nand and restore and make sure that it is "re-armed" and that the supercapacitor is fully recharged before it will allow the system to continue to boot.
Mav still wants to investigate but wanted to braindump for anyone reading this ticket.
- Status changed from Unscreened to Blocked
- Reason for Blocked set to Need additional information from Author
3-4 minute of wait on that "System initializing ..." state is normal unfortunately, since NVDIMM needs up to 2 minute to backup and 1 minute to restore. May be remaining 5 minutes of reboot is OK, but 8 sounds somewhat less so to me. Could somebody timed what other things system is doing that takes the rest of time? Unfortunately I don't have full M50 nearby to watch it myself.
I've tried to time it on passive node of tn11 M50 from performance team. This system also has serial console enabled, so its boot time is predictably slightly slower then normal due to ~11KBps console speed. I got several reboot times:
6:30 -- first reboot. For some reason it somehow spent 3 minutes from the turning off ELI swap, which was last message to actual reboot, while most of time system still could be pinged. Have no idea what it was, but very curios. I've found that for some reason this system had "Reset triggers ADR" disabled, so system spent no time NVDIMM save/restore.
4:00 -- 2nd, 3td, 4th reboots. This time between swap off an reboot past only ~30 seconds, not sure what has changed form the first try, may be uptime somehow, or something else related to the fact system has just booted?
6:10 -- 6th, 7th boot. This time I've enabled "Reset triggers ADR" as it should be for data safety, and it added 2 more minutes on NVDIMMs inserted there.
One more not very nice side I've noticed is that probe of 3 Samsung PM1725a NVMe's takes ~20 seconds total, that is not good. But it seems like intentional, due to bug in the SSD controller, which require 2.3 seconds delay to reset reliably, and boot at that stage in sequential.
If I add 3 minutes to time before reboot in first case to 6:10, 9 minutes is about lower time of Bonnie's tests. Plus to focus on OS side I intentionally rebooted passive node, not active to avoid active services stop, which I think may be have other issues.