Project

General

Profile

Bug #33168

ZFS panic on SLOG removal

Added by Keith Nash about 1 year ago. Updated 6 months ago.

Status:
Closed
Priority:
No priority
Assignee:
Sean Fagan
Category:
OS
Target version:
Seen in:
Severity:
Low Medium
Reason for Closing:
Cannot Reproduce
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

FreeNAS system 'BANDIT' : FreeNAS-11.1-U4
Virtualized on VMware ESXi v6.0U3 with 4 vCPUs and 48GiB RAM
Supermicro X9DRi-LN4F+, dual Intel Xeon E5-2660 @2.2GHz CPUs, 128GiB ECC RAM
Case: Supermicro CSE-846 4U 24-bay with BPN-SAS-846A backplane and dual 920W 'Super Quiet' PSUs
Intel X520-DA1 10GbE NIC, 4 x Intel GbE NICs
ESXi boot and local datastore for FreeNAS VM:
Dell PERC H200 (flashed to P20 IR mode LSI firmware)
2 x 120GB Intel DC S3500 drives in RAID1 array​
3 x LSI 9210-8i HBAs passed through to FreeNAS VM:
Pool 1: RAIDZ2: 6 x 4TB HGST DeskStar NAS
ZIL SLOG device: 100 GB Intel DC S3700​
Pool 2: 3 x 2TB mirrored vdevs (6 x 2TB HGST UltraStar 7K4000)
ZIL SLOG device: 100 GB Intel DC S3700​
Pool 3: 1 x 6TB WDC Black backup disk​

ChangeLog Required:
No

Description

My system rebooted unexpectedly when I used the GUI to remove the SLOG device from on of my pools.

History

#1 Updated by Keith Nash about 1 year ago

  • File debug-bandit-20180506131209.txz added
  • Private changed from No to Yes

#2 Updated by Keith Nash about 1 year ago

  • Seen in changed from Unspecified to 11.1-U4
  • Hardware Configuration updated (diff)

#3 Updated by Dru Lavigne about 1 year ago

  • Assignee changed from Release Council to Alexander Motin

Sasha: please pass to William if it turns out to be a middleware issue.

#4 Updated by Alexander Motin about 1 year ago

  • Category changed from Hardware to OS
  • Status changed from Unscreened to Screened
  • Severity changed from New to Low Medium

Its not a middleware issue, system panicked on assertion inside ZFS kernel code. Though, as I see, after the reboot the next removal attempt went fine, so my guess is that it could be some rare race condition between the removal and new metaslab load, since it didn't crash in the removal code itself:

May  6 07:23:45 bandit panic: Solaris(panic): zfs: freeing free segment (offset=0 size=4096)                                            
May  6 07:23:45 bandit cpuid = 1                                                                                                        
May  6 07:23:45 bandit KDB: stack backtrace:                                                                                            
May  6 07:23:45 bandit db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0c57d6c5b0                                   
May  6 07:23:45 bandit vpanic() at vpanic+0x186/frame 0xfffffe0c57d6c630                                                                
May  6 07:23:45 bandit panic() at panic+0x43/frame 0xfffffe0c57d6c690                                                                   
May  6 07:23:45 bandit vcmn_err() at vcmn_err+0xcf/frame 0xfffffe0c57d6c7c0                                                             
May  6 07:23:45 bandit zfs_panic_recover() at zfs_panic_recover+0x5a/frame 0xfffffe0c57d6c820                                           
May  6 07:23:45 bandit range_tree_remove_impl() at range_tree_remove_impl+0x100/frame 0xfffffe0c57d6c8e0                                
May  6 07:23:45 bandit space_map_load() at space_map_load+0x3bf/frame 0xfffffe0c57d6c970                                                
May  6 07:23:45 bandit metaslab_load() at metaslab_load+0x2b/frame 0xfffffe0c57d6c990                                                   
May  6 07:23:45 bandit metaslab_preload() at metaslab_preload+0x89/frame 0xfffffe0c57d6c9c0                                             
May  6 07:23:45 bandit taskq_run() at taskq_run+0x10/frame 0xfffffe0c57d6c9e0                                                           
May  6 07:23:45 bandit taskqueue_run_locked() at taskqueue_run_locked+0x147/frame 0xfffffe0c57d6ca40                                    
May  6 07:23:45 bandit taskqueue_thread_loop() at taskqueue_thread_loop+0xb8/frame 0xfffffe0c57d6ca70                                   
May  6 07:23:45 bandit fork_exit() at fork_exit+0x85/frame 0xfffffe0c57d6cab0                                                           
May  6 07:23:45 bandit fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0c57d6cab0                                                
May  6 07:23:45 bandit --- trap 0, rip = 0, rsp = 0, rbp = 0 ---                                                                        
May  6 07:23:45 bandit KDB: enter: panic                                                                                                

#5 Updated by Alexander Motin about 1 year ago

  • Subject changed from zfs panic and reboot when removing SLOG device from pool to ZFS panic on SLOG removal

#6 Updated by Alexander Motin 10 months ago

  • Status changed from Screened to Unscreened
  • Assignee changed from Alexander Motin to Sean Fagan

Sean, please take a look on it.

#7 Updated by Sean Fagan 10 months ago

The range tree code changed a fair amount, as I recall.

#8 Updated by Alexander Motin 10 months ago

Range tree did change, but I am not sure what we see is not a consequence of something else, like a race.

#9 Updated by Sean Fagan 10 months ago

Part of what changed, though, was how the locking is done. Not sure if that's related here.

When I ported the sequential scan code to FreeBSD-HEAD, I ended up adding an ssd (to work as both cache and slog); when I rebooted it after adding it, it panicked. But I've not been able to reproduce it, and that included removing the cache/slog and re-adding them. Unfortunately I can't remember how similar that panic was. (Also it was head, not what we use in FreeNAS.)

One question I have which Keith may be able to answer was whether a scrub or resilver was going on at the time.

#10 Updated by Sean Fagan 6 months ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Waiting for feedback

#11 Updated by Keith Nash 6 months ago

Sean, I don't recall whether a resilver was in progress at the time; but I'm reasonably certain that one was NOT taking place.

#12 Updated by Sean Fagan 6 months ago

  • Status changed from Blocked to Screened
  • Reason for Blocked deleted (Waiting for feedback)

Ok. I still can't reproduce this, at least not with the FreeNAS Mini, or a VM. Examining the code hasn't resulted in any epiphanies about race conditions. I'm probably going to have to close this as cannot reproduce. Has it reproduced for you? (I realize that one doesn't generally go about removing log or cache devices randomly.)

#13 Updated by Keith Nash 6 months ago

It has not re-occurred. I'd hoped that the debug dump would be helpful, but this seems to be one of those singular events that you run across once in a great while.

I agree that it's probably best just to close this and move on.

#14 Updated by Dru Lavigne 6 months ago

  • File deleted (debug-bandit-20180506131209.txz)

#15 Updated by Dru Lavigne 6 months ago

  • Status changed from Screened to Closed
  • Target version changed from Backlog to N/A
  • Private changed from Yes to No
  • Reason for Closing set to Cannot Reproduce

Also available in: Atom PDF