Project

General

Profile

Bug #30846

Avatar?id=19868&size=50x50

Turn disk.sync_all into a job so it won't time out while booting

Added by Nick Principe about 1 year ago. Updated 12 months ago.

Status:
Done
Priority:
Important
Assignee:
Vladimir Vinogradenko
Category:
Middleware
Target version:
Seen in:
TrueNAS - TrueNAS 11.1-U4
Severity:
Medium
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Related projects 1 project

Description

During boot the following messages are displayed on the console:

/etc/rc: WARNING: failed precmd routine for vmware_guestd
Call timeout
null

After displaying null, there is a long pause before the next message, which is smartd starting. The system under test has ~144 drives, in case that has any bearing.

iKVM_capture.jpg (107 KB) iKVM_capture.jpg Nick Principe, 03/27/2018 04:02 PM
15771

Associated revisions

Revision 196c2336 (diff)
Added by Vladimir Vinogradenko about 1 year ago

Fix null responses from midclt call not being silenced

Ticket: #30846

Revision 6a2f95fb (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(rc): Prevent /etc/rc: WARNING: failed precmd routine for vmware_guestd

Ticket: #30846

Revision 4d5dcd3f (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(boot): Turn disk.sync_all into a job so it won't time out while booting

Ticket: #30846

Revision 785f1116 (diff)
Added by Vladimir Vinogradenko about 1 year ago

Fix null responses from midclt call not being silenced

Ticket: #30846

Revision 7c0c99d4 (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(rc): Prevent /etc/rc: WARNING: failed precmd routine for vmware_guestd

Ticket: #30846

Revision 086a85fe (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(boot): Turn disk.sync_all into a job so it won't time out while booting

Ticket: #30846

Revision e330f8c1 (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(smart): Rewrite ix-smartd to python to poll devices in parallel

Ticket: #30846

Revision 738a7d02 (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(smart): Rewrite ix-smartd to python to poll devices in parallel

Ticket: #30846

Revision 0f892935 (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(smart): Rewrite ix-smartd to python to poll devices in parallel

Ticket: #30846

Revision 845b6869 (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(smart): Rewrite ix-smartd to python to poll devices in parallel

Ticket: #30846

Revision 923cd396 (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(smart): Rewrite ix-smartd to python to poll devices in parallel

Ticket: #30846

Revision aad8b1bc (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(smart): Rewrite ix-smartd to python to poll devices in parallel

Ticket: #30846

Revision 78fcbdde (diff)
Added by Vladimir Vinogradenko about 1 year ago

Fix null responses from midclt call not being silenced

Ticket: #30846

Revision 664fcc4e (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(rc): Prevent /etc/rc: WARNING: failed precmd routine for vmware_guestd

Ticket: #30846

Revision b7ef51a3 (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(disk.sync_all):

Do not issue unnecessary updates, they are slow on HA systems and cause severe boot delays
when lots of drives are present

Ticket: #30846

Revision 934847cf (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(boot): Turn disk.sync_all into a job so it won't time out while booting

Ticket: #30846

Revision 67a45184 (diff)
Added by Vladimir Vinogradenko about 1 year ago

fix(boot): Turn disk.sync_all into a job so it won't time out while booting

Ticket: #30846

History

#1 Updated by Dru Lavigne about 1 year ago

  • Project changed from TrueNAS to FreeNAS
  • Assignee set to William Grzybowski
  • Target version changed from N/A to 11.1-U5
  • Private changed from No to Yes
  • Migration Needed deleted (No)
  • Hide from ChangeLog deleted (No)
  • Support Department Priority deleted (0)

Passing to William first to ensure this is not TN-only and to assess that the target and assignee is suitable.

#2 Updated by William Grzybowski about 1 year ago

  • Assignee changed from William Grzybowski to Vladimir Vinogradenko

Vladimir, could you please work with Nick to get this sorted? Thanks

#3 Updated by Vladimir Vinogradenko about 1 year ago

Nick, please post output of:

  • time /etc/ix.rc.d/ix-smartd start
  • time midclt call disk.sync_all
  • time midclt call disk.multipath_sync

#4 Updated by Vladimir Vinogradenko about 1 year ago

  • Status changed from Not Started to Blocked
  • Reason for Blocked set to Waiting for feedback

#5 Avatar?id=19868&size=24x24 Updated by Nick Principe about 1 year ago

root@tn02-a:~ # time /etc/ix.rc.d/ix-smartd start                                                                                                 
6.416u 93.796s 1:36.63 103.6%   785+252k 0+0io 0pf+0w                                                                                             
root@tn02-a:~ # time midclt call disk.sync_all                                                                                                                                               
Call timeout
0.185u 0.034s 1:00.75 0.3%      5+645k 0+0io 0pf+0w
root@tn02-a:~ # time midclt call disk.multipath_sync
null
0.206u 0.077s 0:27.31 0.9%      7+646k 0+0io 0pf+0w

#6 Updated by Vladimir Vinogradenko about 1 year ago

  • Status changed from Blocked to In Progress
  • Reason for Blocked deleted (Waiting for feedback)

ix-smartd start
93.796s

This happens because it polls each disk for SMART capabilities. This can be done in parallel, but parallelism would be highly inconvenient to implement in bash. I think this is the chance to begin rewriting our ix-* init scripts to python.

disk.sync_all
Call timeout
1:00.75

This should also happen in parallel as already stated in disk.py:

# TODO: hack so every disk is not synced independently during boot
# This is a performance issue

However, I don't see an easy straightforward way to do it as loop iterations depend on each other. There is a lot of CPU-bound operations involved (e.g. parsing the same XML multiple times), SQLite operations (which can be combined). sync_all needs a careful rewrite, profiling and testing on a huge system like one in subject, with multipaths and so on.

This requires a lot of work, however, I don't see anything that may prevent us from reducing boot time on huge systems. There might be some in hardware (e.g. will controller(s) be able to perform SMART requests on all drives simultaneously?).

Right now we should fix Call timeout on disk.sync_all because it aborts it and disk.multipath_sync might not work properly.

#7 Updated by Vladimir Vinogradenko about 1 year ago

  • Target version changed from 11.1-U5 to 11.2-RC2

Timeouts fixed in stable. Rescheduling performance fixes to 11.2 as they require major rewrite.

#8 Updated by Vladimir Vinogradenko about 1 year ago

  • Severity set to Medium

#9 Updated by Vladimir Vinogradenko about 1 year ago

  • Category set to Middleware
  • Status changed from In Progress to Ready for Testing

#10 Updated by Dru Lavigne about 1 year ago

  • File deleted (debug-20180327160941.tar)

#11 Updated by Dru Lavigne about 1 year ago

  • Subject changed from Unexpected failures, timeout, null during boot to Turn disk.sync_all into a job so it won't time out while booting
  • Target version changed from 11.2-RC2 to 11.2-BETA1
  • Private changed from Yes to No
  • Needs Doc changed from Yes to No
  • Needs Merging changed from Yes to No

Master PR:

#12 Updated by Timothy Moore II 12 months ago

  • Status changed from Ready for Testing to Passed Testing
  • Needs QA changed from Yes to No

Testing with FreeNAS Mini updated to INTERNAL12:

Rebooted system and watched for timeout and null messages in the console. Ran time midclt call disk.sync_all and time midclt call disk.multipath_sync to confirm no timeout or null messages.

#13 Updated by Dru Lavigne 12 months ago

  • Status changed from Passed Testing to Done

Also available in: Atom PDF