Project

General

Profile

Bug #15714

Replication task with "Recursively replicate child dataset's snapshots" option panics/reboots target system

Added by Matthew Held over 3 years ago. Updated about 3 years ago.

Status:
Closed: Insufficient Info
Priority:
Important
Assignee:
Alexander Motin
Category:
OS
Target version:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Target: Dell R510 with Intel E5645 32GB RAM, 12x2TB SATA

ChangeLog Required:
No

Description

Both source and target running FreeNAS 9.10-STABLE-201605240427.
Goal is to have subset of data on source replicated to target.
Source has multiple replication jobs (the subset of data) configured to go to target that run successfully.
If "Recursively replicate child dataset's snapshots" option on source is selected on a replication task, when the task runs it causes a panics/reboot on the target.
Disabling Recursively replicate child dataset's snapshots option on replication task stops target from rebooting, job will complete successfully but not replicate child datasets.
Target debug file uploaded.

This is a major challenge as child datasets have no way to be replicated. If "Recursively replicate child dataset's snapshots" is selected we get instability and the only volumes that can be added to a replication task are the first level (child of the zpool) zvol's, no children zvols of zvols.

Based on the volumes in the screen shot how would one successfully replicate (using the FreeNAS GUI) the children of pool-01/VirtualMachines?

Screen Shot 2016-05-30 at 01.10.34.png (126 KB) Screen Shot 2016-05-30 at 01.10.34.png screenshot of zvol with child datasets Matthew Held, 05/29/2016 10:10 PM
Screen Shot 2016-05-30 at 01.12.17.png (58.6 KB) Screen Shot 2016-05-30 at 01.12.17.png screenshot of replication task for pool-01/VirtualMachines Matthew Held, 05/29/2016 10:12 PM
Screen Shot 2016-05-30 at 01.15.18.png (66.3 KB) Screen Shot 2016-05-30 at 01.15.18.png screenshot, only top level zvol replication seems possible Matthew Held, 05/29/2016 10:15 PM
6139
6140
6141

History

#1 Updated by Matthew Held over 3 years ago

6141

Added screenshot showing that it is not possible to select children of a zvol for replication.

#2 Updated by Matthew Held over 3 years ago

It appears that unchecking "Delete stale snapshots on remote system" allows child datasets to replicate to target without causing target to panic/reboot.

I'm going to let replication complete so that it is 'up to date' and then try enabling "Delete stale snapshots on remote system" option.

It is possible that "Delete stale snapshots on remote system" option does not handle partially replicated datasets on target and causes issue OR that the "Delete stale snapshots on remote system" functionality is broken.

#3 Updated by Jordan Hubbard over 3 years ago

What would probably be most useful here is a crash dump from /data/crash on the remote system - obviously a panic / reboot shouldn't happen under any circumstances, and if we can get the crash logs attached to this ticket, we can start working on that independently of debugging the replication code path. Thanks!

#4 Updated by Wojciech Kloska over 3 years ago

  • Status changed from Unscreened to Screened

#5 Updated by Matthew Held over 3 years ago

  • File crash.zip added

Uploaded contents of /data/crash from target system.

#6 Updated by Wojciech Kloska over 3 years ago

I've tried to reproduce this case in various ways on 9.10-STABLE-201605240427, but without any success.

Are you sure that your disks and target pool are in good shape? Have you tried replicating over to other machine, checking disks health or recreating the target pool?

#7 Updated by Josh Paetzel over 3 years ago

The kernel panic on the target machine is a null pointer deref, whether it's a bug being tripped by some sort of "event" (like a corrupt or unhealthy pool) or just a plain old software bug isn't clear. zpool status -v from the target machine would be helpful.

#8 Updated by Matthew Held over 3 years ago

[root@stm-bar-freenas-dr] ~# zpool status -v
pool: freenas-boot
state: ONLINE
scan: none requested
config:

NAME        STATE     READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

errors: No known data errors

pool: studiom-dr
state: ONLINE
scan: none requested
config:
NAME                                            STATE     READ WRITE CKSUM
studiom-dr ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/432b5dba-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/43e2b85e-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/449eb955-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/4549ca68-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/45faa2e5-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/46a853f8-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/47785df9-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/483c7081-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/48e9d2c6-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/499e88d5-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/4a4aacfe-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/4af99f5e-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0

errors: No known data errors
[root@stm-bar-freenas-dr] ~#

#9 Updated by Josh Paetzel over 3 years ago

I'd run a scrub of studiom-dr

#10 Updated by Matthew Held over 3 years ago

Thanks Josh, am now running a scrub. Will test the behaviour below once scrub is complete. The target pool was re-created upon the installation of the latest build. New Replication Tasks were created, the panic happened upon the first synchronization of the VirtualMachine's zvol with child datasets.

An update on behaviour:

Was able to successfully replicate child datasets of the zvol VirtualMachines with the "Recursively replicate child dataset's snapshots" option enabled and disabled the "Delete stale snapshots on remote system".

When I re-enable the "Delete stale snapshots on remote system" option on the replication task, as soon as the replication job starts the target system panics.

#11 Updated by Matthew Held over 3 years ago

Results of zpool status

[root@stm-bar-freenas-dr] ~# zpool status
pool: freenas-boot
state: ONLINE
scan: none requested
config:

NAME        STATE     READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

errors: No known data errors

pool: studiom-dr
state: ONLINE
scan: scrub repaired 0 in 18h35m with 0 errors on Wed Jun 1 06:54:22 2016
config:
NAME                                            STATE     READ WRITE CKSUM
studiom-dr ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/432b5dba-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/43e2b85e-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/449eb955-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/4549ca68-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/45faa2e5-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/46a853f8-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/47785df9-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/483c7081-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/48e9d2c6-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/499e88d5-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/4a4aacfe-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0
gptid/4af99f5e-1ead-11e6-8b42-d4ae526f2dd0 ONLINE 0 0 0

errors: No known data errors

#12 Updated by Wojciech Kloska about 3 years ago

  • Assignee changed from Wojciech Kloska to Kris Moore

Off to Kris to redistribute inside 9.x team.

#13 Avatar?id=14398&size=24x24 Updated by Kris Moore about 3 years ago

  • Assignee changed from Kris Moore to Josh Paetzel
  • Target version set to 9.10.1-U2

#14 Updated by Josh Paetzel about 3 years ago

  • Priority changed from No priority to Important

This is a clear software bug in ZFS. I'll look at it.

#15 Avatar?id=14398&size=24x24 Updated by Kris Moore about 3 years ago

  • Target version changed from 9.10.1-U2 to 9.10.1-U3

#16 Updated by Josh Paetzel about 3 years ago

  • Status changed from Screened to Unscreened
  • Assignee changed from Josh Paetzel to Alexander Motin

Alexander,

Another ZFS bug for you.

#17 Updated by Alexander Motin about 3 years ago

  • Category changed from 59 to 200
  • Status changed from Unscreened to 15
  • Target version changed from 9.10.1-U3 to 9.10.2

I've reviewed provided kernel dumps and unfortunately can not see how it could happen. Kernel panicked on attempt to remove "duplicate" item from nvlist, which just could not be there, since nvlist was allocated shortly before that and should include only few elements. My best/only guess is that it is result of some modify-after-free kind of error, when some other thread modifies our memory. But unfortunately I can not diagnose that kind of problems with available information, at least without obtaining full crash dump.

I would recommend, if this problem still persist, update FreeNAS to latest version, and then, if problem also persist, try to setup and obtain full kernel dump for deeper autopsy. Matthew, please tell us whether this is doable, otherwise I tend to close this with "Insufficient information" status.

#18 Updated by Alexander Motin about 3 years ago

  • Status changed from 15 to Closed: Insufficient Info

#19 Updated by Dru Lavigne about 2 years ago

  • File deleted (debug-stm-bar-freenas-dr-20160530005224.tgz)

#20 Updated by Dru Lavigne about 2 years ago

  • File deleted (crash.zip)

Also available in: Atom PDF