Turn disk.sync_all into a job so it won't time out while booting
During boot the following messages are displayed on the console:
/etc/rc: WARNING: failed precmd routine for vmware_guestd Call timeout null
After displaying null, there is a long pause before the next message, which is smartd starting. The system under test has ~144 drives, in case that has any bearing.
Do not issue unnecessary updates, they are slow on HA systems and cause severe boot delays
when lots of drives are present
#1 Updated by Dru Lavigne about 1 year ago
- Project changed from TrueNAS to FreeNAS
- Assignee set to William Grzybowski
- Target version changed from N/A to 11.1-U5
- Private changed from No to Yes
- Migration Needed deleted (
- Hide from ChangeLog deleted (
- Support Department Priority deleted (
Passing to William first to ensure this is not TN-only and to assess that the target and assignee is suitable.
#5 Updated by Nick Principe about 1 year ago
root@tn02-a:~ # time /etc/ix.rc.d/ix-smartd start 6.416u 93.796s 1:36.63 103.6% 785+252k 0+0io 0pf+0w
root@tn02-a:~ # time midclt call disk.sync_all Call timeout 0.185u 0.034s 1:00.75 0.3% 5+645k 0+0io 0pf+0w
root@tn02-a:~ # time midclt call disk.multipath_sync null 0.206u 0.077s 0:27.31 0.9% 7+646k 0+0io 0pf+0w
#6 Updated by Vladimir Vinogradenko about 1 year ago
- Status changed from Blocked to In Progress
- Reason for Blocked deleted (
Waiting for feedback)
This happens because it polls each disk for SMART capabilities. This can be done in parallel, but parallelism would be highly inconvenient to implement in bash. I think this is the chance to begin rewriting our
ix-* init scripts to python.
This should also happen in parallel as already stated in
# TODO: hack so every disk is not synced independently during boot # This is a performance issue
However, I don't see an easy straightforward way to do it as loop iterations depend on each other. There is a lot of CPU-bound operations involved (e.g. parsing the same XML multiple times), SQLite operations (which can be combined).
sync_all needs a careful rewrite, profiling and testing on a huge system like one in subject, with multipaths and so on.
This requires a lot of work, however, I don't see anything that may prevent us from reducing boot time on huge systems. There might be some in hardware (e.g. will controller(s) be able to perform SMART requests on all drives simultaneously?).
Right now we should fix
Call timeout on
disk.sync_all because it aborts it and
disk.multipath_sync might not work properly.
- Subject changed from Unexpected failures, timeout, null during boot to Turn disk.sync_all into a job so it won't time out while booting
- Target version changed from 11.2-RC2 to 11.2-BETA1
- Private changed from Yes to No
- Needs Doc changed from Yes to No
- Needs Merging changed from Yes to No
- Status changed from Ready for Testing to Passed Testing
- Needs QA changed from Yes to No
Testing with FreeNAS Mini updated to INTERNAL12:
Rebooted system and watched for timeout and null messages in the console. Ran time midclt call disk.sync_all and time midclt call disk.multipath_sync to confirm no timeout or null messages.