Project

General

Profile

Feature #26131

Add "Auxiliary arguments" field to middleware

Added by Joshua Sirrine almost 2 years ago. Updated 11 months ago.

Status:
Done
Priority:
Important
Assignee:
Vladimir Vinogradenko
Category:
Middleware
Target version:
Estimated time:
Severity:
Med High
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
MGA-767-84967
Hardware Configuration:

Description

Two customers have asked if s3 sync can be made any faster. One customer I worked with today has a 20TB dataset he is syncing to S3, and it takes more than a week to run through everything. He was hoping to do a "quick" sync after the initial fill, but that doesn't seem to be happening, and it still takes nearly a week.

This is creating a secondary problem as the server gets so busy with iops all the time for the s3 sync that ZFS replication is running at 15MB/sec to a second TrueNAS despite having a 10Gb pipe between the two servers (iperf reported >8Gb/sec between source and destination while the server was in production). The customer I worked with today is currently stuck between a rock and a hard place as they are supposed to have onsite backups to a secondary TrueNAS and offsite backups to S3. Unfortunately the S3 storage is syncing, but taking an extremely long time (almost a week) and the ZFS replication cannot keep up with the workload. The customer is currently evaluating options, but would like to be able to do S3 and replication (although maybe not expect both to be scheduled to run at the exact same time without slowdown). The support ticket is MGA-767-84967.

If there is the option to do things like skip checksums and use date/time stamps, file sizes, or other things that customers may want to consider we should consider making those options available in the WebGUI.

This may be short notice, but I think this should be considered for 11.1 since this issue is likely to get much bigger as time goes on, and I don't want us to be in a situation where this is slated for 11.2 and we have some big ticket customer with a massive server demanding this be added immediately. This issue is clearly for large-scale servers, which is our bread-and-butter.

Thanks.


Related issues

Related to FreeNAS - Feature #35107: Allow passing auxiliary arguments to rcloneClosed
Related to FreeNAS - Bug #36523: Hide Auxiliary Arguments from Cloud Sync taskDone
Related to FreeNAS - Bug #40968: Fix traceback when trying to add a Cloud Sync taskDone
Related to FreeNAS - Bug #52507: Add Transfers field to Cloud Sync in legacy UIReady for Testing

Associated revisions

Revision 0d57a8e1 (diff)
Added by Vladimir Vinogradenko about 1 year ago

feat(cloud_sync): Allow passing auxiliary arguments to rclone

Ticket: #26131

Revision 1165f6e3 (diff)
Added by Vladimir Vinogradenko about 1 year ago

feat(cloud_sync): Allow passing auxiliary arguments to rclone

Ticket: #26131

Revision fdcc6764 (diff)
Added by Vladimir Vinogradenko about 1 year ago

feat(cloud_sync): Allow passing auxiliary arguments to rclone

Ticket: #26131

Revision 5ff265e4 (diff)
Added by Dru Lavigne about 1 year ago

Doc Auxiliary arguments for cloud sync.
Ticket: #26131

History

#1 Updated by Dru Lavigne almost 2 years ago

  • Status changed from Untriaged to Unscreened

#2 Updated by Bartosz Prokop almost 2 years ago

  • Assignee changed from Bartosz Prokop to Dru Lavigne
  • Priority changed from Nice to have to Important

It looks like rclone does not support incremental backups. Maybe it's worth using a different tool for this task?

RC: Please reassingn. This is one of the last tickets without proper 'L3' status.

#3 Updated by Dru Lavigne almost 2 years ago

  • Assignee changed from Dru Lavigne to Marcelo Araujo

#4 Updated by Marcelo Araujo over 1 year ago

  • Status changed from Unscreened to Screened
  • Target version set to TrueNAS 11.1-U2

#5 Updated by Marcelo Araujo over 1 year ago

  • Assignee changed from Marcelo Araujo to Bartosz Prokop

Bartosz do you mind take a look on it? Feel free to send it back to me in case you are busy with other things too.

#6 Updated by Dru Lavigne over 1 year ago

  • Status changed from Screened to Unscreened

#7 Updated by Joshua Sirrine over 1 year ago

How do we go backwards in this process?

#8 Updated by Bartosz Prokop over 1 year ago

  • Status changed from Unscreened to Screened

#9 Updated by Warren Block over 1 year ago

  • Subject changed from add option to choose between MD5 hashes and date/time tamps for S3 sync to add option to choose between MD5 hashes and date/time stamps for S3 sync

#10 Avatar?id=13649&size=24x24 Updated by Ben Gadd over 1 year ago

  • Due date set to 02/12/2018

#11 Updated by Dru Lavigne over 1 year ago

  • Project changed from TrueNAS to FreeNAS
  • Category changed from OS to OS
  • Status changed from Screened to Not Started
  • Target version changed from TrueNAS 11.1-U2 to 11.2-RC2
  • Hide from ChangeLog deleted (No)
  • Support Department Priority deleted (0)

#12 Updated by Eric Loewenthal over 1 year ago

I hate to pile on more work, but this seems to be begging for zfs diff.

#13 Avatar?id=13649&size=24x24 Updated by Ben Gadd over 1 year ago

  • Due date deleted (02/12/2018)

#14 Updated by William Grzybowski over 1 year ago

  • Category changed from OS to Middleware
  • Assignee changed from Bartosz Prokop to Vladimir Vinogradenko

Load balancing Bartosz tickets.

#15 Updated by Vladimir Vinogradenko over 1 year ago

  • Status changed from Not Started to Blocked
  • Reason for Blocked set to Waiting for feedback

https://rclone.org/docs/

Normally rclone will look at modification time and size of files to see if they are equal.

I've checked this, this is correct:

$ rclone --config config -vv --stats 1s sync /mnt/data/aws remote:ixsystems/
...
2018/03/23 23:57:33 DEBUG : test: Size and modification time the same (differ by 0s, within tolerance 1µs)
2018/03/23 23:57:33 DEBUG : test: Unchanged skipping

The file on remote is not updated if I update it locally but preserve size and mtime. The reason for slow sync must be something else.

Can you please run verbose rclone by hand to see what is taking it so long? You'll need to create config file

[remote]
type = s3
env_auth = false
access_key_id = ...
secret_access_key = ...
region = 

and run it like this

rclone --config /path/to/config/file -vv --stats 1s sync /local/path remote:BUCKETNAME/BUCKET/PATH

#16 Updated by Joshua Sirrine over 1 year ago

I'll see if I can reach out to the customer about this one.

Stay tuned.

#17 Updated by Vladimir Vinogradenko about 1 year ago

We might try official AWS CLI S3 tool

Install it with

# pip install awscli

Put your credentials into ~/.aws/credentials:

[default]
aws_secret_access_key = ...
aws_access_key_id = ...

Run it with

# aws s3 sync /local/path s3://bucket-name/path

#18 Updated by Vladimir Vinogradenko about 1 year ago

  • Severity set to Med High

#19 Updated by Joshua Sirrine about 1 year ago

So we tested a directory with 109915 files. No updates were needed in both cases.

rclone:
real = 7m46s
user =2m1s
sys = 0m25s

aws cli:
real = 3m45s
user = 1m52s
sys = 0m23s

So the time was halved.

We then did a directory that had few files, but the files were large (200+GB), total was almost 600GB of files.

rlone:

We didn't let it finish because the throughput was 8MB/sec and slowly decreasing, with the estimated completion being >6 hours.

aws cli:

consistently 20MB/sec+, typically 24-29MB/sec.

In my opinion, we should consider switching from rclone to aws cli.

Thanks.

#20 Updated by Vladimir Vinogradenko about 1 year ago

That's great!

I think we should add "use awscli" option for S3 targets (which would also prohibit using rclone-specific features like encryption). William, what do you think?

#21 Updated by William Grzybowski about 1 year ago

Talked to Vladimir via RC.

We are not going to move away from rclone on the first problem found. It is a fantastic tool and works with dozens of different backends.

We will try to work with maintainers to track down the issue.

If performance is a big issue s3cmd can be used as a workaround for these people in the meantime.

#22 Updated by Joshua Sirrine about 1 year ago

William Grzybowski wrote:

If performance is a big issue s3cmd can be used as a workaround for these people in the meantime.

I'm trying to schedule a test of a new rclone package to see how it does, but s3cmd is a no-go for my customer.

With a warning like this when you run it:

  1. s3cmd
    WARNING: !!!!!!! Support for python3 is currently in a 'Work In Progress' state.
    Please don't use s3cmd with python3 on production tasks or with sensitive data as unexpected behaviors could occur !!!!!!!

The customer isn't even interested in using it for testing purposes.

#23 Updated by William Grzybowski about 1 year ago

Joshua Sirrine wrote:

William Grzybowski wrote:

If performance is a big issue s3cmd can be used as a workaround for these people in the meantime.

I'm trying to schedule a test of a new rclone package to see how it does, but s3cmd is a no-go for my customer.

With a warning like this when you run it:

  1. s3cmd
    WARNING: !!!!!!! Support for python3 is currently in a 'Work In Progress' state.
    Please don't use s3cmd with python3 on production tasks or with sensitive data as unexpected behaviors could occur !!!!!!!

The customer isn't even interested in using it for testing purposes.

We can add awscli to the image if thats an issue.

#24 Updated by Joshua Sirrine about 1 year ago

Vlad gave me an install of rlcone 1.41. I installed it and the time was 7m15s with no changes on the same dataset I had done above. So faster, but only about 30 seconds faster.

Unfortunately, the dataset with the large files has been running since midnight, so we couldn't run ours since the other was still running. So I was not able to do any raw throughput tests.

Installed 1.38 on the customers box at the end of the call.

#25 Updated by William Grzybowski about 1 year ago

Joshua Sirrine wrote:

Vlad gave me an install of rlcone 1.41. I installed it and the time was 7m15s with no changes on the same dataset I had done above. So faster, but only about 30 seconds faster.

Unfortunately, the dataset with the large files has been running since midnight, so we couldn't run ours since the other was still running. So I was not able to do any raw throughput tests.

Installed 1.38 on the customers box at the end of the call.

Did you use --use-server-modtime flag ?

#26 Updated by William Grzybowski about 1 year ago

William Grzybowski wrote:

Joshua Sirrine wrote:

Vlad gave me an install of rlcone 1.41. I installed it and the time was 7m15s with no changes on the same dataset I had done above. So faster, but only about 30 seconds faster.

Unfortunately, the dataset with the large files has been running since midnight, so we couldn't run ours since the other was still running. So I was not able to do any raw throughput tests.

Installed 1.38 on the customers box at the end of the call.

Did you use --use-server-modtime flag ?

Ping?

#27 Updated by Joshua Sirrine about 1 year ago

William Grzybowski wrote:

William Grzybowski wrote:

Joshua Sirrine wrote:

Vlad gave me an install of rlcone 1.41. I installed it and the time was 7m15s with no changes on the same dataset I had done above. So faster, but only about 30 seconds faster.

Unfortunately, the dataset with the large files has been running since midnight, so we couldn't run ours since the other was still running. So I was not able to do any raw throughput tests.

Installed 1.38 on the customers box at the end of the call.

Did you use --use-server-modtime flag ?

Ping?

I did not. The message I got in RC made no mention of using that flag.

#28 Updated by William Grzybowski about 1 year ago

Joshua Sirrine wrote:

William Grzybowski wrote:

William Grzybowski wrote:

Joshua Sirrine wrote:

Vlad gave me an install of rlcone 1.41. I installed it and the time was 7m15s with no changes on the same dataset I had done above. So faster, but only about 30 seconds faster.

Unfortunately, the dataset with the large files has been running since midnight, so we couldn't run ours since the other was still running. So I was not able to do any raw throughput tests.

Installed 1.38 on the customers box at the end of the call.

Did you use --use-server-modtime flag ?

Ping?

I did not. The message I got in RC made no mention of using that flag.

I think it would be worth a try.

#29 Updated by Vladimir Vinogradenko about 1 year ago

Joshua, please run the same as in https://redmine.ixsystems.com/issues/26131#note-15 but with --use-server-modtime flag.

Also did this problem

We didn't let it finish because the throughput was 8MB/sec and slowly decreasing, with the estimated completion being >6 hours.

persist after upgrade to 1.41?

#30 Updated by Joshua Sirrine about 1 year ago

Vladimir Vinogradenko wrote:

Also did this problem

We didn't let it finish because the throughput was 8MB/sec and slowly decreasing, with the estimated completion being >6 hours.

persist after upgrade to 1.41?

I wasn't able to test it the last time around because the system was doing an update that had been running all night. In a prior call, the time of day was late enough that we were able to test it, but couldn't for the most recent call.

I will reach out to the customer to test this.

#31 Avatar?id=13649&size=24x24 Updated by Ben Gadd about 1 year ago

  • Support Suite Ticket changed from n/a to MGA-767-84967

#32 Updated by Joshua Sirrine about 1 year ago

Vladimir,

Is there anything else we want to test? This will be my third call with the customer to test things. If you have more things you'd like tested, I'd prefer to do that rather than a 4th call. You can also choose to be on the call with me if you'd like.

Let me know. I'm still trying to schedule something with the customer.

#33 Updated by Vladimir Vinogradenko about 1 year ago

Joshua, no, that's it, after these tests I'll file a bug to rclone.

#34 Updated by Joshua Sirrine about 1 year ago

So the customer and I tried to run rclone 1.41 with the --use-server-modtime and things didn't go well. Files that hadn't been updated in months (and sometimes many years) were resyncing. Screenshot is attached showing the behavior. I informed Vladimir about this issue and the fact that the test had been running for almost 2 hours and wasn't complete and he recommended we try the no-update-modtime parameter instead. Here's the results:

With no-update-modtime I got the following results on 2 runs:

Transferred: 0 Bytes (0 Bytes/s)
Errors: 0
Checks: 135742
Transferred: 0
Elapsed time: 8m17.9s

Transferred: 0 Bytes (0 Bytes/s)
Errors: 0
Checks: 135742
Transferred: 0
Elapsed time: 9m34.6s

So the awscli test we did some weeks ago was still significantly faster.

Thanks.

#35 Updated by Vladimir Vinogradenko about 1 year ago

Joshua, it's suspicious to see:

a) time increase comparing to regular rclone
b) big time difference in two identical runs

Dataset or network conditions might change since we've ran rclone without these options. Can you please also repeat old test with current dataset and network conditions?

#36 Updated by William Grzybowski about 1 year ago

We cant keep bugging our customers to do science experiment all the time.

If think we need to try and replicate the scenario locally. Joshua could you help us set a test scenario?

#37 Updated by Vladimir Vinogradenko about 1 year ago

Joshua, have you been running your last tests with -vv? It's my mistake, using -vv affects performance (especially on slow SSH connections). It would be glad if you would be able to re-run test with just -v. On my synthetic test:

  • old rclone: 5m40s
  • awscli: 44s
  • new rclone: 7s

#38 Updated by Vladimir Vinogradenko about 1 year ago

--checkers=128 helped us to speed up syncs with lots of small files

Problem with speed dropping while transferring large files still remains. It's not only –-s3-disable-checksum, it's something more.

#39 Updated by Vladimir Vinogradenko about 1 year ago

Low speed upload bug fixed. My PR to rclone: https://github.com/ncw/rclone/pull/2351

#40 Updated by Vladimir Vinogradenko about 1 year ago

  • Related to Feature #35107: Allow passing auxiliary arguments to rclone added

#41 Updated by Vladimir Vinogradenko about 1 year ago

  • Status changed from Blocked to In Progress

New rclone with this fix was released yesterday.

New field will appear in cloud sync: Auxiliary arguments

For case with small files user should add: --use-server-modtime --no-update-modtime --checkers=128

For huge files user should add: --s3-upload-concurrency 16 (also –-s3-disable-checksum if he likes)

#42 Updated by Dru Lavigne about 1 year ago

  • Subject changed from add option to choose between MD5 hashes and date/time stamps for S3 sync to Add "Auxiliary arguments" field to Cloud Sync
  • Target version changed from 11.2-RC2 to 11.1-U6
  • Needs Merging changed from Yes to No

#43 Updated by Vladimir Vinogradenko about 1 year ago

  • Status changed from In Progress to Ready for Testing

#44 Updated by Dru Lavigne about 1 year ago

  • Needs Doc changed from Yes to No

Legacy UI doc commit: https://github.com/freenas/freenas-docs/commit/5ff265e4a3a2f2bc70b9be83573bc4e82f036a1f
New UI doc commit will occur in related ticket, once added to the new UI.

#45 Updated by Dru Lavigne about 1 year ago

  • Related to Bug #36523: Hide Auxiliary Arguments from Cloud Sync task added

#46 Updated by William Grzybowski about 1 year ago

  • Subject changed from Add "Auxiliary arguments" field to Cloud Sync to Add "Auxiliary arguments" field to middleware
  • Status changed from Ready for Testing to Unscreened

Vladimir, as discussed, we will hide this from UI for now.

Thanks!

#47 Updated by Vladimir Vinogradenko about 1 year ago

  • Status changed from Unscreened to Ready for Testing

#49 Updated by Bonnie Follweiler 11 months ago

  • Status changed from Ready for Testing to Blocked

This test is blocked by https://redmine.ixsystems.com/issues/40968 since I can't create a Cloud Sync task until that ticket is done

#50 Updated by Bonnie Follweiler 11 months ago

  • Related to Bug #40968: Fix traceback when trying to add a Cloud Sync task added

#51 Updated by Bonnie Follweiler 11 months ago

  • Reason for Blocked changed from Waiting for feedback to Dependent on a related task to be completed

#52 Updated by Dru Lavigne 11 months ago

  • Status changed from Blocked to Ready for Testing

#53 Updated by Bonnie Follweiler 11 months ago

  • Status changed from Ready for Testing to Passed Testing
  • Needs QA changed from Yes to No

Test Passed in FreeNAS-11.1-U6-INTERNAL4

#54 Updated by Dru Lavigne 11 months ago

  • Status changed from Passed Testing to Done
  • Reason for Blocked deleted (Dependent on a related task to be completed)

#55 Updated by Vladimir Vinogradenko 9 months ago

  • Related to Bug #52507: Add Transfers field to Cloud Sync in legacy UI added

Also available in: Atom PDF