Fix system crash/freeze when deleting many files
I have a plex database backup with 550,000 files and 255,000 directories. If I try and delete the top directory of the database backup, the system freezes and never becomes responsive until a hard reset.
I am able to repeat the crash on different pools and even multiple systems (i7-3770 and xeon 1231v3). FWIW, I was unable to reproduce it on a Ryzen 1700 in an esxi VM.
Booting up 11.1, I am able to successfully remove the files.
#5 Updated by Alexander Motin about 2 years ago
- Status changed from Unscreened to In Progress
- Target version changed from Backlog to 11.2-RC1
- Severity changed from New to Med High
- Reason for Blocked deleted (
Need additional information from Author)
Among the few reports about hangs on file deletion this is the first one that looks promising, thanks to one kernel dump, triggered by software watchdog, I found in the provided debug. As I see, system may not completely hang, but being very busy (inefficiently). As I see, at the moment of the panic out of 8 CPU cores 7 were waiting on lock
lock_delay() at lock_delay+0x42/frame 0xfffffe0a91dfe6b0 _sx_xlock_hard() at _sx_xlock_hard+0x178/frame 0xfffffe0a91dfe760 zfs_zget() at zfs_zget+0x1f8/frame 0xfffffe0a91dfe810 zfs_unlinked_drain() at zfs_unlinked_drain+0x99/frame 0xfffffe0a91dfe9e0 taskqueue_run_locked() at taskqueue_run_locked+0x154/frame 0xfffffe0a91dfea40 taskqueue_thread_loop() at taskqueue_thread_loop+0x98/frame 0xfffffe0a91dfea70 fork_exit() at fork_exit+0x83/frame 0xfffffe0a91dfeab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0a91dfeab0
, while one remaining could be still working, or at least looked different:
arc_buf_access() at arc_buf_access+0xfa/frame 0xfffffe0a91e1c620 dbuf_hold_impl() at dbuf_hold_impl+0x79/frame 0xfffffe0a91e1c680 dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe0a91e1c6b0 dnode_hold_impl() at dnode_hold_impl+0x130/frame 0xfffffe0a91e1c730 dmu_bonus_hold() at dmu_bonus_hold+0x1d/frame 0xfffffe0a91e1c760 zfs_zget() at zfs_zget+0xb4/frame 0xfffffe0a91e1c810 zfs_unlinked_drain() at zfs_unlinked_drain+0x99/frame 0xfffffe0a91e1c9e0 taskqueue_run_locked() at taskqueue_run_locked+0x154/frame 0xfffffe0a91e1ca40 taskqueue_thread_loop() at taskqueue_thread_loop+0x98/frame 0xfffffe0a91e1ca70 fork_exit() at fork_exit+0x83/frame 0xfffffe0a91e1cab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0a91e1cab0
. Behind all that "activity" system had no time to react on anything else.
#10 Updated by Alexander Motin about 2 years ago
- Needs Doc changed from Yes to No
- Needs Merging changed from Yes to No
Test case: in empty directory create many files, add some extended attributes to each and delete all of them:
jot 10000 | xargs -n 1 touch setextattr system zzz 1 * setextattr system zzy 1 * setextattr system zzx 1 * rm *
System without the patch may either hang or at least get very busy for longer then needed. System with the patch should handle it using only one CPU core and for much smaller time.