Project

General

Profile

Bug #24405

Add viamillipede to build

Added by Nick Wolff almost 2 years ago. Updated 10 months ago.

Status:
Done
Priority:
Expected
Assignee:
Ash Gokhale
Category:
Middleware
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

FreeNAS replication tasks seems to be being slowed by the dd commands used. This may not be noticeable on all systems as disk or network bottlenecks may be apparent before this bottleneck.
Simply by first during a zfs send to /dev/null we test what kind of throughput is possible then by testing various combination we can see the impact of various parts of our replication process.

During testing dtrace commands were used to investigate this issue and it appears dd is doing extremely small reads.

The " /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$" below is pulled out of autorepl.py as the commands piped together during a replication.

Replication of replication problem:
Setup

zfs create $ZPOOL/$PARENTDATASET/test-zfs-send-rand
cd $PARENTDATASET/test-zfs-send-rand
dd if=/dev/random of=rand.1 bs=10240000 count=1024
zfs snap $ZPOOL/$PARENTDATASET/test-zfs-send-rand@10g
cp rand.1 rand.2; cp rand.1 rand.3; cp rand.1 rand.4; cp rand.1 rand.5
zfs snap $ZPOOL/$PARENTDATASET/test-zfs-send-rand@50g

Testing
sh
dataset="Palantiri/test-zfs-send-rand" 
( time /sbin/zfs send -V -p ${dataset}@10g |cat > /dev/null)
( time /sbin/zfs send -V -p ${dataset}@10g |/usr/local/bin/lz4c |cat > /dev/null)
( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$|cat > /dev/null)
( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$| /usr/local/bin/lz4c |cat > /dev/null)
( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null  |cat > /dev/null)
( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |/bin/dd obs=1m 2> /dev/null  |cat > /dev/null)
( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |/bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null |cat > /dev/null)
( time /sbin/zfs send -V -p ${dataset}@10g |  /usr/local/bin/pipewatcher $$ |cat > /dev/null)
( time /sbin/zfs send -V -p ${dataset}@50g |  cat > /dev/null)
( time /sbin/zfs send -V -p ${dataset}@50g |  /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$ |  cat > /dev/null)
( time /sbin/zfs send -V -p ${dataset}@50g |  /bin/dd obs=1m 2> /dev/null | /usr/local/bin/pipewatcher $$ |  cat > /dev/null)

[root@zoltan-a] /mnt/Palantiri/test-zfs-send-rand# sh
dataset="Palantiri/test-zfs-send-rand"
  1. ( time /sbin/zfs send -V -p ${dataset}@10g |cat > /dev/null)
    load: 1.11 cmd: zfs 82379 [running] 2.55r 0.00u 1.88s 16% 3656k
    12.27 real 0.00 user 8.92 sys
  2. ( time /sbin/zfs send -V -p ${dataset}@10g |/usr/local/bin/lz4c |cat > /dev/null)
    54.06 real 0.00 user 10.23 sys
  3. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$|cat > /dev/null)
    57.69 real 0.00 user 8.84 sys
  4. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$| /usr/local/bin/lz4c |cat > /dev/null)
    85.91 real 0.00 user 10.03 sys
  5. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |cat > /dev/null)
    38.96 real 0.00 user 11.12 sys
  6. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |/bin/dd obs=1m 2> /dev/null |cat > /dev/null)
    56.91 real 0.00 user 8.64 sys
  7. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |/bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null |cat > /dev/null)
    64.48 real 0.00 user 9.13 sys
  8. ( time /sbin/zfs send -V -p ${dataset}@10g | /usr/local/bin/pipewatcher $$ |cat > /dev/null)
    11.20 real 0.00 user 8.29 sys
  9. ( time /sbin/zfs send -V -p ${dataset}@50g | cat > /dev/null)
    59.83 real 0.00 user 43.40 sys
  10. ( time /sbin/zfs send -V -p ${dataset}@50g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$ | cat > /dev/null)
    281.54 real 0.02 user 44.57 sys
  11. ( time /sbin/zfs send -V -p ${dataset}@50g | /bin/dd obs=1m 2> /dev/null | /usr/local/bin/pipewatcher $$ | cat > /dev/null)
replication before.jpg (158 KB) replication before.jpg Joe Maloney, 09/06/2017 05:00 AM
replication after.jpg (161 KB) replication after.jpg Joe Maloney, 09/06/2017 05:03 AM
12386
12388

Related issues

Related to FreeNAS - Bug #22777: Replication is slower in 9.10 then in 9.2.10Closed: Insufficient Info2017-03-25

Associated revisions

Revision cd319019 (diff)
Added by William Grzybowski almost 2 years ago

fix(repl): remove double dd (it make things slower now with pipewatcher)

Ticket: #24405

Revision e0e6579f (diff)
Added by William Grzybowski over 1 year ago

fix(repl): remove double dd (it make things slower now with pipewatcher)

Ticket: #24405
(cherry picked from commit cd3190192bd3af6c73b9e346be59a3400889595f)

History

#1 Updated by William Grzybowski almost 2 years ago

  • Status changed from Unscreened to Screened
  • Target version set to 11.0-U1

#2 Updated by Ash Gokhale almost 2 years ago

During replication the dd and pipewatch processs are serially processing replication traffic
and this is limiting final replication throughput on systems with >1gbps replication expectations. Watching the read/write flow shows a constant multiplication of operations
with no buffering gains due to mulitiple rebuffering operations.

The in and out block size of dd and pipe watch are far from optimal ; looking at the size
histogram for both calls:

syscall::read:entry / execname == "dd" || execname == "pipewatcher" / 
{
@read[execname, arg2] = count () ; 
}
syscall::write:entry / execname == "dd" || execname == "pipewatcher" / 
{
@write[execname, arg2] = count () ; 
}

Shows this for reads:

  pipewatcher                                                     122                1
  pipewatcher                                                     128                1
  dd                                                              122                2
  dd                                                              128                2
  pipewatcher                                                 1048576           160383
  dd                                                              512         41088680 <--dd's read are tiny!

And this for writes:

  dd                                                               42                1
  dd                                                               46                1
  pipewatcher                                                   41688                1
  dd                                                               70                2
  dd                                                           893656                2
  dd                                                          1048576            20046
  pipewatcher                                                   65536           160381 <-pipewatcher is reblocking to 64k write; 

It's time to reevaluate the benefits of this previous optimization.
Ideas to fix this:
-tuning dd's parametehs,
-dropping dd/pipewatch optionally,
-provide a user tunable replication pipeline string
-replacing dd with an more appropriate tool like mbuffer
-replacing dd,pipewatch with a bespoke tool that provides the exact behavior
- fix zfs/send receive if there is still a starvation problem.

#3 Updated by William Grzybowski almost 2 years ago

  • Target version changed from 11.0-U1 to 11.1

#4 Updated by Nick Wolff almost 2 years ago

  • Description updated (diff)

#5 Updated by Nick Wolff almost 2 years ago

  • Description updated (diff)

#6 Updated by Nick Wolff almost 2 years ago

  • Description updated (diff)

#7 Updated by William Grzybowski almost 2 years ago

So removing dd and leaving pipewatcher seems to be good enough? Since pipewatcher does the buffering the dd used to do.

#8 Updated by Nick Wolff almost 2 years ago

  • Description updated (diff)

#10 Updated by Cyber Jock almost 2 years ago

This data is from a dataset generated from /dev/zero and compression=off on my Z20-HA system running 9.10.2-U5 on a zpool that is an 8-disk RAIDZ2.

  1. ( time /sbin/zfs send -V -p ${dataset}@10g |cat > /dev/null)
    12.16 real 0.00 user 8.79 sys
  2. ( time /sbin/zfs send -V -p ${dataset}@10g |/usr/local/bin/lz4c |cat > /dev/null)
    18.29 real 0.00 user 10.19 sys
  3. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null | /usr/local/bin/pipewatcher $$|cat > /dev/null)
    57.71 real 0.00 user 8.72 sys
  4. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null | /usr/local/bin/pipewatcher $$| /usr/local/bin/lz4c |cat > /dev/null)
    56.97 real 0.00 user 8.78 sys
  5. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |cat > /dev/null)
    37.26 real 0.01 user 10.85 sys
  6. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |/bin/dd obs=1m 2> /dev/null |cat > /dev/null)
    56.58 real 0.00 user 8.39 sys
  7. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |/bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null |cat > /dev/null)
    63.30 real 0.00 user 9.46 sys
  8. ( time /sbin/zfs send -V -p ${dataset}@10g | /usr/local/bin/pipewatcher $$ |cat > /dev/null)
    10.49 real 0.00 user 8.02 sys
  9. ( time /sbin/zfs send -V -p ${dataset}@50g | cat > /dev/null)
    59.71 real 0.00 user 43.15 sys
  10. ( time /sbin/zfs send -V -p ${dataset}@50g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$ | cat > /dev/null)
    273.40 real 0.00 user 44.22 sys
  11. ( time /sbin/zfs send -V -p ${dataset}@50g | /bin/dd obs=1m 2> /dev/null | /usr/local/bin/pipewatcher $$ | cat > /dev/null)
    189.39 real 0.00 user 45.68 sys

This is the same system with /dev/random as the source on a dataset with lz4 compression.

  1. ( time /sbin/zfs send -V -p ${dataset}@10g |cat > /dev/null)
    17.57 real 0.00 user 8.72 sys
  2. ( time /sbin/zfs send -V -p ${dataset}@10g |/usr/local/bin/lz4c |cat > /dev/null)
    54.48 real 0.00 user 8.91 sys
  3. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$|cat > /dev/null)
    57.47 real 0.00 user 8.89 sys
  4. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$| /usr/local/bin/lz4c |cat > /dev/null)
    85.90 real 0.00 user 10.07 sys
  5. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |cat > /dev/null)
    38.42 real 0.00 user 11.40 sys
  6. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |/bin/dd obs=1m 2> /dev/null |cat > /dev/null)
    55.59 real 0.00 user 8.62 sys
  7. ( time /sbin/zfs send -V -p ${dataset}@10g | /bin/dd obs=1m 2> /dev/null |/bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null |cat > /dev/null)
    65.38 real 0.00 user 9.16 sys
  8. ( time /sbin/zfs send -V -p ${dataset}@10g | /usr/local/bin/pipewatcher $$ |cat > /dev/null)
    11.53 real 0.00 user 8.67 sys
  9. ( time /sbin/zfs send -V -p ${dataset}@50g | cat > /dev/null)
    57.61 real 0.00 user 41.99 sys
  10. ( time /sbin/zfs send -V -p ${dataset}@50g | /bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null| /usr/local/bin/pipewatcher $$ | cat > /dev/null)
    279.39 real 0.00 user 44.37 sys
  11. ( time /sbin/zfs send -V -p ${dataset}@50g | /bin/dd obs=1m 2> /dev/null | /usr/local/bin/pipewatcher $$ | cat > /dev/null)
    188.18 real 0.01 user 55.73 sys

Thanks.

#11 Updated by William Grzybowski almost 2 years ago

  • Status changed from Screened to Ready For Release

#12 Updated by Nick Wolff almost 2 years ago

William for your question about if removing dd is good enough I'm not sure about all corner cases. My understanding of the buffering is that it's meant to help in corner cases of many small datasets among other things. I don't believe testing of this has been done to validate. All testing so far has been to show a general issue exists.

#14 Updated by William Grzybowski almost 2 years ago

  • Related to Bug #22777: Replication is slower in 9.10 then in 9.2.10 added

#15 Updated by Shantur Rathore almost 2 years ago

Can pipewatcher be replaced with timeout command line utility already present in Freenas? That way nothing would need to read and write pipe again.

#16 Updated by William Grzybowski almost 2 years ago

Shantur Rathore wrote:

Can pipewatcher be replaced with timeout command line utility already present in Freenas? That way nothing would need to read and write pipe again.

No, its more complicated than that. It is not about connection timeout.
Also pipewatcher does the job of buffering (you can see replication with pipewatcher is faster than without it).

#18 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

  • Status changed from Ready For Release to 42
  • Target version changed from 11.1 to 11.0-U3

William: can you make a pr for this fix please? We need to get this into -u3 if possible.

#19 Updated by William Grzybowski over 1 year ago

#20 Updated by Vaibhav Chauhan over 1 year ago

  • Status changed from 42 to 47

#21 Updated by Vaibhav Chauhan over 1 year ago

  • Assignee changed from William Grzybowski to Joe Maloney

#22 Updated by Joe Maloney over 1 year ago

  • Assignee changed from Joe Maloney to Bonnie Follweiler

#23 Updated by Dru Lavigne over 1 year ago

  • Subject changed from Freenas Replication Slower then expected to Speed improvement for replication

#24 Updated by Joe Maloney over 1 year ago

12386
12388

Bonnie setup the environment with 2 x FreeNAS minis, and Chelsio T520-SO-CR 10g cards.

For monitoring I used:

systat -f ifstat1
:scale mbit

I then created a 10GB of random data using dd:

dd if=/dev/urandom of=test1 bs=1M count=10240

I then created a dataset called largefiles, created a perioidic snapshot task to run every 5 minutes, and used a script to copy the 10gb file 100 times into the dataset.

#!/bin/bash

for x in `seq 1 100`; do

    if [[ x -lt 10 ]]; then cp test1 /mnt/tanka/largefiles/file-00$x.img;

    elif [[ x -lt 100 ]]; then cp test1 /mnt/tanka/largefiles/file-0$x.img;

    else cp test1 /mnt/tanka/largefiles/file-$x.img;

    fi

done

Locally in autorepl.py I commented out the change:

# replcmd = '%s%s/usr/local/bin/pipewatcher $$ | %s "%s/sbin/zfs receive -F -d \'%s\' && echo Succeeded"' % (compress, throttle, sshcmd, decompress, remotefs)

Then added:


replcmd = '%s%s/bin/dd obs=1m 2> /dev/null | /bin/dd obs=1m 2> /dev/null | /usr/local/bin/pipewatcher $$ | %s "%s/sbin/zfs receive -F -d \'%s\' && echo Succeeded"' % (compress, throttle, sshcmd, decompress, remotefs)

As shown in replication before 497.933Mb/s.

I then restored autorepl.py with only the change:

replcmd = '%s%s/usr/local/bin/pipewatcher $$ | %s "%s/sbin/zfs receive -F -d \'%s\' && echo Succeeded"' % (compress, throttle, sshcmd, decompress, remotefs)

As shown in replication after 851.139Mb/s.

No stability issues were encountered when replicating around 500GB of data.

#25 Updated by William Grzybowski over 1 year ago

  • Assignee changed from Bonnie Follweiler to William Grzybowski

#26 Updated by Dru Lavigne over 1 year ago

  • Status changed from Ready For Release to Resolved

#28 Updated by Sam Fourman over 1 year ago

  • Status changed from Resolved to 50
  • Assignee changed from William Grzybowski to Ash Gokhale
  • Target version changed from 11.0-U3 to 11.1

#29 Updated by Dru Lavigne over 1 year ago

  • Target version changed from 11.1 to 11.1-U1

#30 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

  • Target version changed from 11.1-U1 to 11.2-BETA1

#31 Updated by Ash Gokhale over 1 year ago

  • Status changed from 50 to 51

My parallel tcp stream multiplexer may suit some of this need; it's in alpha. Not sure how to set priority for backlog.. so 4.

#32 Updated by Dru Lavigne over 1 year ago

  • Status changed from 51 to Fix In Progress

#33 Updated by Ash Gokhale over 1 year ago

I written a scatter gather parallel tcp multiplexer that uses as much of the wire as possible; this is a replacement for 'netcat' and handles transport, not crypto or complression.
It uses multiple network links preferring a certain set destination and fans out if the transport does not keep up with the source.
It's here:
https://github.com/agokhale/cantrips/tree/master/libexec/viamillipede

#34 Updated by Dru Lavigne about 1 year ago

  • Status changed from Fix In Progress to In Progress

#35 Avatar?id=13649&size=24x24 Updated by Ben Gadd about 1 year ago

  • Due date set to 03/09/2018

#36 Updated by Ash Gokhale about 1 year ago

Viamillipede port is in nightlies; however the UI and the replication logic is yet unable to use.
https://github.com/freenas/ports/pull/82

#37 Updated by Dru Lavigne about 1 year ago

  • Subject changed from Speed improvement for replication to Add viamillipede to build
  • Status changed from In Progress to Done

#38 Updated by Dru Lavigne about 1 year ago

  • Needs QA changed from No to Yes
  • Needs Doc changed from Yes to No
  • Needs Merging changed from Yes to No

#39 Updated by Dru Lavigne 12 months ago

  • Status changed from Done to Ready for Testing

#40 Updated by William Grzybowski 10 months ago

  • Severity set to New
  • Needs QA changed from Yes to No

#41 Updated by Dru Lavigne 10 months ago

  • Status changed from Ready for Testing to Done

#42 Updated by Michael Reynolds 10 months ago

viamillipede is now included
confirmed this ticket is closeable w/NickW if viamillipede is present

[root@freenas ~]# which viamillipede
/usr/local/bin/viamillipede

Also available in: Atom PDF