Project

General

Profile

Bug #26313

Merge in additional upstream scrub commits

Added by Dru Lavigne about 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Critical
Assignee:
Sean Fagan
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

As of Oct 19, there are two more new patches upstream.


Related issues

Related to FreeNAS - Bug #25531: Add scrub sorting featureResolved2017-08-10
Related to FreeNAS - Bug #13630: Prevent large file delete from starving out write operationsResolved2016-02-13

History

#1 Updated by Dru Lavigne about 1 year ago

  • Related to Bug #25531: Add scrub sorting feature added

#2 Updated by Dru Lavigne about 1 year ago

  • Status changed from Untriaged to Unscreened

#4 Updated by Sean Fagan about 1 year ago

  • Status changed from Screened to Needs Developer Review
  • Assignee changed from Sean Fagan to Alexander Motin

Okay, that went quickly. I've created the branch sef-zfs-sorted-scrub3 which has the two commits. (The second one applied cleanly, which was a bit of a surprise.)

I'm still doing tests, but it can be looked at now.

#5 Updated by Sean Fagan about 1 year ago

Also pull request 37 over on the freenas/os repo.

#6 Updated by Richard Kojedzinszky about 1 year ago

I would like to give feedback on this.

We are testing the branch on one of our backup servers. It just receives replication streams from other boxes. While the scrub speed seems to be increased, the zfs receive processes seems to be blocked many times, and thus, our hourly replication tasks are lagging behind. Unfortunately, the hourly tasks do not finish in an hour.

The zfs receive command seems to be processing the stream in bursts, and after some time it gets into D state,

# procstat -k -k 4121
PID TID COMM TDNAME KSTACK
4121 101984 zfs - mi_switch+0xe5 sleepq_wait+0x3a _cv_wait+0x15b bqueue_enqueue+0x7b dmu_recv_stream+0x7a0 zfs_ioc_recv+0xadf zfsdev_ioctl+0x6d5 devfs_ioctl_f+0x128 kern_ioctl+0x281 sys_ioctl+0x15c amd64_syscall+0xa4a Xfast_syscall+0xfb

And sometimes it takes 10-20 seconds in this state. After it, it just resumes receiving the stream.

This is observed with this branch only.

#7 Updated by Sean Fagan about 1 year ago

Is there a scrub (or resilver) going on during the receive?

If you do "sysctl -w vfs.zfs.zfs_scan_legacy=1", does it continue to have this behaviour?

#8 Updated by Richard Kojedzinszky about 1 year ago

Sorry, I forgot to mention that this happens while a scrub is running. I will report back with the sysctl you've mentioned.

#9 Updated by Sean Fagan about 1 year ago

Okay. That may be a design limitation, then, although a pause of up to 20 seconds seems unreasonable. (Essentially, it looks like the scanning code is putting a bunch of i/o on a queue, and the recv code is doing the same.)

#10 Updated by Sean Fagan about 1 year ago

The loop in bqueue_enqueue() is:

        while (q->bq_size + item_size > q->bq_maxsize) {
                cv_wait(&q->bq_add_cv, &q->bq_lock);
        }

#11 Updated by Richard Kojedzinszky about 1 year ago

Checking again the stalls, maybe they last for 10-15 seconds instead of 20, but for first sight it looked to me as much. But then it still slows the entire process, resulting our replications not finishing in time.

During a running scrub changing the sysctl did not changed anything. Does that only take effect when starting a scrub?

#12 Updated by Richard Kojedzinszky about 1 year ago

Just a long stall is here, this is the sending side of the replication, a ps once per second:

Fri Oct 27 20:37:59 CEST 2017
3422 - S 0:00.49 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 25122592/7657722328) (zfs)
Fri Oct 27 20:38:00 CEST 2017
3422 - S 0:00.52 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 39464040/7657722328) (zfs)
Fri Oct 27 20:38:01 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:02 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:03 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:04 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:05 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:06 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:07 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:08 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:09 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:10 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:11 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:12 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:13 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:14 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:15 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:16 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:17 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:18 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:19 CEST 2017
3422 - S 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:20 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:21 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:22 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:23 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:24 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:25 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:26 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:27 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:28 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:29 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:30 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:31 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:32 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:33 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:34 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:35 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:36 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:37 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:38 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:39 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:40 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:41 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:42 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:44 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:45 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:46 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:47 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:48 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:49 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:50 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:51 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:52 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:53 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:54 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:55 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:56 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:57 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:58 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:38:59 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:00 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:01 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:02 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:03 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:04 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:05 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:06 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:07 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:08 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:09 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:10 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:11 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:12 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:13 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:14 CEST 2017
3422 - I 0:00.53 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:15 CEST 2017
3422 - S 0:00.56 zfs: sending pool/dataset@auto-20171027.2000-2w (0%: 62871608/7657722328) (zfs)
Fri Oct 27 20:39:16 CEST 2017
3422 - S 0:00.62 zfs: sending pool/dataset@auto-20171027.2000-2w (1%: 100581000/7657722328) (zfs)
Fri Oct 27 20:39:17 CEST 2017
3422 - S 0:00.71 zfs: sending pool/dataset@auto-20171027.2000-2w (2%: 201221456/7657722328) (zfs)
Fri Oct 27 20:39:18 CEST 2017
3422 - S 0:00.73 zfs: sending pool/dataset@auto-20171027.2000-2w (3%: 306163520/7657722328) (zfs)
Fri Oct 27 20:39:19 CEST 2017
3422 - S 0:00.73 zfs: sending pool/dataset@auto-20171027.2000-2w (3%: 306163520/7657722328) (zfs)
Fri Oct 27 20:39:20 CEST 2017
3422 - S 0:00.73 zfs: sending pool/dataset@auto-20171027.2000-2w (3%: 306163520/7657722328) (zfs)

Later, with this dataset, again a long stall:

Fri Oct 27 20:42:30 CEST 2017
3422 - S 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:31 CEST 2017
3422 - S 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:32 CEST 2017
3422 - S 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:33 CEST 2017
3422 - S 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:34 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:35 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:36 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:37 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:38 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:39 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:40 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:41 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:42 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:43 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:44 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:45 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:46 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:47 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:48 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:49 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:50 CEST 2017
3422 - I 0:01.56 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:51 CEST 2017
3422 - S 0:01.58 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1316904192/7657722328) (zfs)
Fri Oct 27 20:42:52 CEST 2017
3422 - S 0:01.58 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1346337328/7657722328) (zfs)
Fri Oct 27 20:42:53 CEST 2017
3422 - S 0:01.58 zfs: sending pool/dataset@auto-20171027.2000-2w (17%: 1346337328/7657722328) (zfs)
Fri Oct 27 20:42:54 CEST 2017
3422 - S 0:01.62 zfs: sending pool/dataset@auto-20171027.2000-2w (18%: 1384050776/7657722328) (zfs)
Fri Oct 27 20:42:56 CEST 2017
3422 - S 0:01.62 zfs: sending pool/dataset@auto-20171027.2000-2w (18%: 1384050776/7657722328) (zfs)

#13 Updated by Sean Fagan about 1 year ago

The sysctl doesn't take effect until a scrub/resilver is started. So setting it during the middle won't have an impact.

#14 Updated by Sean Fagan about 1 year ago

Also, to be clear: this does not happen with the current Nightlies build, but only on a build based on the sef-zfs-sorted-scrub3 branch?

#15 Updated by Richard Kojedzinszky about 1 year ago

Unfortunately I don't know. I've not tried NIGHTLY builds.

Earlier I've tried a build based on 11.1–BETA1 (sorted2?) but that was much worse. With that even the scrub stalled for periods. But that experiment was a really short time. The box sometimes had a load average of 50, so I've tried the sorted3 branch.

#16 Updated by Sean Fagan about 1 year ago

That makes me think it's more of an unavoidable (for now, anyway) issue with how it's implemented. (The main thing it does is it sorts the blocks in each transaction group, and sends that off to a thread to be processed.)

Scrubs can be paused, and that may end up being a necessary thing. (Resilvers can't be paused, however.)

I'm curious what Alexander has to think about all.

#17 Updated by Richard Kojedzinszky about 1 year ago

While this is the main point of this scrub algorithm to issue many reads sorted, does this just race with zfs receive only? Does it affect normal I/O in the same way? Does it have such a negative impact on services?

#18 Updated by Sean Fagan about 1 year ago

I haven't noticed anything nearly as drastic as you have -- in general, the scrubs are lower priority than normal I/O, but this change does refactor how the scrubs are done.

I've done recv's to a system with the changes, but my data churn is low enough that I've never seen any issues. (I have seen a slowdown in performance, but that's been due to pattern of the I/O: the scrubs are sequentially-sorted, but the new data coming from the recv [or any other I/O!] is not, which means a lot more seeks going on. When I looked at it on my system, it seemed pretty comparable to the non-sorted method; I did this by comparing the output of "zpool iostat -v Storage 60" in both cases.)

#19 Updated by Richard Kojedzinszky about 1 year ago

With this scrub, the legacy sysctl vfs.zfs.scrub_delay and vfs.zfs.scan_idle disappeared. Is it planned also? Are the any possibilities to control somehow the scrub process? I assume that when I switch back to legacy scrub via sysctl, I will not be able to prioritize scrub at all. As the former sysctls are removed completely, this seems to be a change to scrub algorithm which is not tunable, which is more aggresive against normal I/O. Can it be tuned/controlled somehow? I've seen some memory limits in the code, could they be used for that?

#20 Updated by Sean Fagan about 1 year ago

It should be vfs.zfs.zfs_scan_legacy (changed from "scrub" to "scan"). The idle one is gone.

#21 Updated by Richard Kojedzinszky about 1 year ago

This is still very odd to me. Just started a scrub again. All the disks are at only ~60% busy, and still the zfs receive is stalled, lasts for 20 or minutes. It finished in 1-2 minutes with the legacy implementation. It seems to me that not real disk contention is the problem of zfs receive lagging. The receiving zfs processes are in D state for the long periods, with no cpu time increasing. I suppose this is not in production ready state right now. I also miss tunables to control the scrub, even if it is optimized or not.

I've just played a bit with vfs.zfs.top_maxinflight without any success, lowering it to 2 or raising it to 128 did not affect the receive process.

If i start the scrub in legacy mode, with vfs.zfs.zfs_scan_legacy=1, then heavy write i/o is being sent to the pool, with no scrub progress at all:

Sat Oct 28 02:43:10 CEST 2017

pool: main
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub in progress since Sat Oct 28 02:36:57 2017
0 scanned at 0/s, 0 issued at 0/s, 36.0T total, (scan is slow, no estimated time)
0 repaired, 0.00% done
config:
NAME        STATE     READ WRITE CKSUM
main ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
da0p2 ONLINE 0 0 0
da1p2 ONLINE 0 0 0
da2p2 ONLINE 0 0 0
da3p2 ONLINE 0 0 0
da4p2 ONLINE 0 0 0
da5p2 ONLINE 0 0 0
da6p2 ONLINE 0 0 0
da7p2 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
da8p2 ONLINE 0 0 0
da9p2 ONLINE 0 0 0
da10p2 ONLINE 0 0 0
da11p2 ONLINE 0 0 0
da12p2 ONLINE 0 0 0
da13p2 ONLINE 0 0 0
da14p2 ONLINE 0 0 0
da15p2 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
ada0p1 ONLINE 0 0 0
ada1p1 ONLINE 0 0 0
cache
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0

errors: No known data errors

A 5 seconds zpool iostat -v:

capacity     operations    bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
main 36.0T 51.0T 0 1.14K 0 47.5M
raidz2 18.0T 25.5T 0 557 0 23.9M
da0p2 - - 0 203 0 4.47M
da1p2 - - 0 204 0 4.49M
da2p2 - - 0 203 0 4.50M
da3p2 - - 0 198 0 4.47M
da4p2 - - 0 204 0 4.47M
da5p2 - - 0 204 0 4.50M
da6p2 - - 0 203 0 4.49M
da7p2 - - 0 196 0 4.49M
raidz2 18.0T 25.5T 0 605 0 23.6M
da8p2 - - 0 199 0 4.49M
da9p2 - - 0 197 0 4.48M
da10p2 - - 0 199 0 4.48M
da11p2 - - 0 195 0 4.49M
da12p2 - - 0 195 0 4.48M
da13p2 - - 0 196 0 4.48M
da14p2 - - 0 198 0 4.49M
da15p2 - - 0 200 0 4.49M
logs - - - - - -
mirror 404K 3.97G 0 0 0 0
ada0p1 - - 0 0 0 0
ada1p1 - - 0 0 0 0
cache - - - - - -
ada0p2 41.7G 47.5G 0 0 0 0
ada1p2 43.8G 45.3G 0 0 0 0

What is happening here?

#22 Updated by Sean Fagan about 1 year ago

I've no idea. That's not the behaviour I see at all.

#23 Updated by Richard Kojedzinszky about 1 year ago

Whenever the stall happens, the receiving thread blocks waiting on txg_wait_open():

# procstat -k -k -a | grep -E "dmu_recv_stream|receive_writer_thread"
16 102397 zfskern solthread 0xfffffff mi_switch+0xe5 sleepq_wait+0x3a _cv_wait+0x15b txg_wait_open+0xa5 dmu_free_long_range+0x20a receive_writer_thread+0x1d5 fork_exit+0x85 fork_trampoline+0xe
16256 103175 zfs - mi_switch+0xe5 sleepq_wait+0x3a _cv_wait+0x15b bqueue_enqueue+0x7b dmu_recv_stream+0x7a0 zfs_ioc_recv+0xadf zfsdev_ioctl+0x6d5 devfs_ioctl_f+0x128 kern_ioctl+0x281 sys_ioctl+0x15c amd64_syscall+0xa4a Xfast_syscall+0xfb

Hope this helps.

#24 Updated by Richard Kojedzinszky about 1 year ago

I am now playing with vfs.zfs.scan_min_time_ms, have set it to 10 from the default 1000, and for now it seems promising regarding the mentioned stalls. I dont know what effect does it have on total scrub time, but from my view the main benefit of sorted scrub is to not give random read load to disks. So, if at all the scrub would took the same time as before but the disks would be less busy, that is a win. But for now, the scrub ETA is also around 1/3 of the legacy scrub times. I will monitor this, and report back soon.

#25 Updated by Alexander Motin about 1 year ago

Look on "txg_wait_open+0xa5 dmu_free_long_range+0x20a receive_writer_thread+0x1d5" part of stack trace brought me to another change introduced after FreeNAS 11.0 release: "6569 large file delete can starve out write ops". It seems like your replication stream includes many deletes (Sean, may be you can't reproduce it because you have little deletes in your replication stream?). If there are too many deletes done in one transaction group, the process is frozen till the transaction group will be closed and another one is open. I am not sure how is transaction closing related to running scrub, may be just a disk load, but may be somehow else.

Richard, you may try setting loader tunable vfs.zfs.per_txg_dirty_frees_percent to 0 to disable this delete throttling mechanism, as it was in times of FreeNAS 11.0 release, so we would debug one function at a time.

#26 Updated by Alexander Motin about 1 year ago

  • Status changed from Needs Developer Review to Reviewed by Developer

Except one more fixed extra tab character I have no objections.

#27 Updated by Alexander Motin about 1 year ago

  • Assignee changed from Alexander Motin to Sean Fagan

#28 Updated by Richard Kojedzinszky about 1 year ago

It seems that setting vfs.zfs.per_txg_dirty_frees_percent to 0 results in the original behaviour, where we see the original replication times during a scrub. So Alexander is right, 6569 causes the stalls in our case. Indeed, we are replicating streams which may have many deletes.

Regards,

#29 Avatar?id=14398&size=24x24 Updated by Kris Moore about 1 year ago

  • Status changed from Reviewed by Developer to Ready For Release
  • Target version changed from 11.1 to 11.1-RC1

#30 Updated by Dru Lavigne about 1 year ago

  • Status changed from Ready For Release to Resolved

#31 Updated by Dru Lavigne about 1 year ago

  • Related to Bug #13630: Prevent large file delete from starving out write operations added

#32 Updated by Nick Wolff 12 months ago

  • Needs QA changed from Yes to No
  • QA Status Test Fails FreeNAS added
  • QA Status deleted (Not Tested)

Test Passes

Initial testing has been done. More comprehensive ZFS test plan being worked on.

Also available in: Atom PDF