Project

General

Profile

Bug #16901

Umbrella #27076: Replication and automatic snapshotting features rewrite to the new middleware

Periodic snapshots - autosnap - generate large demand on ARC metadata.

Added by Wojciech Kruzel about 4 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Nice to have
Assignee:
Vladimir Vinogradenko
Category:
Middleware
Target version:
Severity:
Low
Reason for Closing:
Not Applicable
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

I have configured periodic snapshot tasks to run Mon-Fri, between 9:00 and 18:00, and generate a snapshot once every two hours. Snapshots are deleted after 2 weeks.

We have a LOT of datasets under this rule, and currently have 42900 snapshots.

But what I have noticed, is that during the period configured for automatic snapshots, there was an ever increasing number of ARC requests for demand_metadata. This is observed 9:00 to 18:00 Mon-Fri, but not observed during any other time.
The increase was happening after each snapshot happened, but the load on ARC requests was constant.

After two weeks, the increases stopped, I presume, because old snapshots get deleted.

When looking at graph with shortest timespan, it';s visible that the number of requests reaches about 175k, but actual average reported is 261k/s.

I could not figure out why it's doing this, but then I had a look at debug.log and this (below) is happening every minute).

Aug 18 16:39:00 hd02mlt autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/zfs list -t snapshot -H -o name
Aug 18 16:39:10 hd02mlt autorepl.py: [tools.autorepl:233] Autosnap replication started
Aug 18 16:39:10 hd02mlt autorepl.py: [tools.autorepl:234] temp log file: /tmp/repl-97191
Aug 18 16:39:10 hd02mlt autorepl.py: [tools.autorepl:617] Autosnap replication finished
Aug 18 16:40:00 hd02mlt autosnap.py: [tools.autosnap:66] Popen()ing: /sbin/zfs list -t snapshot -H -o name
Aug 18 16:40:10 hd02mlt autorepl.py: [tools.autorepl:233] Autosnap replication started
Aug 18 16:40:10 hd02mlt autorepl.py: [tools.autorepl:234] temp log file: /tmp/repl-97390
Aug 18 16:40:10 hd02mlt autorepl.py: [tools.autorepl:617] Autosnap replication finished

So, maybe this isn't so much of a problem with only few datasets, but with large numbers it's becoming an issue.

arc_request1.png (12.8 KB) arc_request1.png Wojciech Kruzel, 08/18/2016 08:44 AM
arc_request2.png (12.3 KB) arc_request2.png Wojciech Kruzel, 08/18/2016 08:44 AM
arc_request3.png (12.7 KB) arc_request3.png Wojciech Kruzel, 08/19/2016 02:25 AM
6799
6800
6812

Related issues

Blocked by FreeNAS - Bug #16429: Integrate MiddlewaredResolved2016-04-25

Associated revisions

Revision baab2ecc (diff)
Added by William Grzybowski almost 4 years ago

fix(autosnap): does not let it run more than it is supposed to Running every minute will poke zfs even if we won't snapshot at all. Instead compare the time last time autosnap ran and the shortest interval amongst all snapshot tasks. Ticket: #16901

History

#1 Updated by Wojciech Kruzel about 4 years ago

  • File debug-hd02mlt-20160818164430.txz added

#2 Updated by Wojciech Kruzel about 4 years ago

6799

#3 Updated by Wojciech Kruzel about 4 years ago

6800

#4 Updated by Vaibhav Chauhan about 4 years ago

BRB: what is the issue. please elaborate.

#5 Avatar?id=14398&size=24x24 Updated by Kris Moore about 4 years ago

  • Status changed from Unscreened to Screened

#6 Updated by Wojciech Kruzel about 4 years ago

6812

Vaibhav Chauhan wrote:

BRB: what is the issue. please elaborate.

The issue is 260k ARC requests /second every minute from 9:00 to 18:00, while the actual snapshots created are at 9,11,13,15,17 hours.
Autosnap.py and autorepl.py are run every minute within the period, but really they should run only at full hours.

#7 Avatar?id=14398&size=24x24 Updated by Kris Moore about 4 years ago

  • Seen in changed from Unspecified to 9.10-U1

#8 Avatar?id=14398&size=24x24 Updated by Kris Moore almost 4 years ago

  • Category changed from 3 to 200
  • Status changed from Screened to Unscreened
  • Assignee changed from Kris Moore to Josh Paetzel
  • Priority changed from No priority to Nice to have

#9 Avatar?id=14398&size=24x24 Updated by Kris Moore almost 4 years ago

  • Assignee changed from Josh Paetzel to William Grzybowski
  • Target version set to 9.10.1-U3

BRB: William, is there any tuning we can do here which may help some of the pressure on the ARC?

#10 Avatar?id=14398&size=24x24 Updated by Kris Moore almost 4 years ago

  • Seen in changed from 9.10-U1 to 9.10-STABLE-201606270534

#11 Updated by William Grzybowski almost 4 years ago

  • Status changed from Unscreened to Screened

We could run autosnap only at the shorter interval of configured tasks, since it will query all snapshots everytime.
That is not a trivial task but not rocket science as well.

#12 Updated by William Grzybowski almost 4 years ago

  • Status changed from Screened to 19
  • Target version changed from 9.10.1-U3 to 9.10.2

#13 Updated by William Grzybowski almost 4 years ago

  • Blocked by Bug #16429: Integrate Middlewared added

#14 Updated by William Grzybowski almost 4 years ago

  • Status changed from 19 to Needs Developer Review

#15 Updated by William Grzybowski almost 4 years ago

  • Assignee changed from William Grzybowski to Josh Paetzel

#16 Updated by Josh Paetzel almost 4 years ago

  • Status changed from Needs Developer Review to Reviewed
  • Assignee changed from Josh Paetzel to William Grzybowski

#17 Updated by Vaibhav Chauhan almost 4 years ago

  • Status changed from Reviewed to Ready For Release

#18 Updated by William Grzybowski over 3 years ago

  • Status changed from Ready For Release to Screened
  • Target version changed from 9.10.2 to 9.10.2-U2

Reverted as it has some issues regarding snapshots tasks with time frames.

#19 Updated by William Grzybowski over 3 years ago

  • Target version changed from 9.10.2-U2 to 9.10.3

#21 Avatar?id=14398&size=24x24 Updated by Kris Moore over 3 years ago

  • Target version changed from 9.10.4 to 11.1

#22 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-hd02mlt-20160818164430.txz)

#24 Updated by William Grzybowski almost 3 years ago

  • Status changed from Screened to Unscreened
  • Assignee changed from William Grzybowski to Bartosz Prokop

Bartosz, is this something that will be addressed in your rewrite?

#26 Updated by Bartosz Prokop almost 3 years ago

  • Status changed from Unscreened to Screened

#27 Updated by Bartosz Prokop almost 3 years ago

  • Related to Umbrella #27076: Replication and automatic snapshotting features rewrite to the new middleware added

#28 Avatar?id=14398&size=24x24 Updated by Kris Moore over 2 years ago

  • Target version changed from 11.2-BETA1 to 11.3

#29 Avatar?id=14398&size=24x24 Updated by Kris Moore over 2 years ago

  • Status changed from Screened to Not Started

#31 Updated by Alexander Motin over 2 years ago

  • Category changed from OS to Middleware
  • Assignee changed from Alexander Motin to William Grzybowski

I am not sure what could be done here on the OS side. Snapshot listing is known to be a heavy operation, requiring many disks head seeks if not cached in ARC (in which case I'd like to see some kind of prefetch there, but there still none), and if cached, as I suppose here, then the only way to optimize that is to not list the snapshots, or at least make sure to only list names, not the other properties, which may require additional data access.

#32 Avatar?id=13649&size=24x24 Updated by Ben Gadd over 2 years ago

  • Target version changed from 11.3 to Backlog

#33 Updated by William Grzybowski over 2 years ago

  • Severity set to Low

#34 Updated by William Grzybowski about 2 years ago

  • Assignee changed from William Grzybowski to Vladimir Vinogradenko
  • Parent task set to #27076

#37 Updated by Vladimir Vinogradenko over 1 year ago

  • Status changed from Not Started to Ready for Testing
  • Target version changed from Backlog to 11.3

#38 Updated by Dru Lavigne over 1 year ago

  • Target version changed from 11.3 to 11.3-BETA1

#39 Updated by Dru Lavigne over 1 year ago

  • Status changed from Ready for Testing to Closed
  • Target version changed from 11.3-BETA1 to N/A
  • Private changed from Yes to No
  • Reason for Closing set to Not Applicable

This should no longer be an issue in 11.3 due to the replication rewrite.

Also available in: Atom PDF