Project

General

Profile

Bug #24000

Improve FHA locality control for NFS read/write requests

Added by Cyber Jock over 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Critical
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
TrueNAS-9.10.2-U3
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
Migration Needed:
No
Hide from ChangeLog:
No
ChangeLog Required:
No
Support Department Priority:
0

Related projects 1 project

Description

A TrueNAS customer that is pre-production has complained of poor performance. Through troubleshooting I've determined that his slogs seem to be the bottleneck (in particular, when sync writes are used). Doing dd write tests we saw a max of about 290MB/sec. The customer bought 2 devices, and was unsure if the plan was for mirrored slog or striped. With the two slog devices striped we were only able to get to about 450MB/sec.

I have since tested the same device in zoltan, and got the same performance characteristics (only had 1 disk, so couldn't try with 2 devices).

Here's some basic SSD tests from zoltan:

Read test: (not overly useful as slogs are write-only except when having to playback an slog on things such as zpool import, but included for completeness):

[root@zoltan-a] ~# dd if=/dev/da12 of=/dev/null bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 22.548099 secs (465039646 bytes/sec)

Write test:
[root@zoltan-a] ~# dd if=/dev/zero of=/dev/da12 bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 26.272926 secs (399108956 bytes/sec)

Device information for Zoltan:
Vendor: HGST
Product: HUSMH8010BSS200
Revision: A360
Compliance: SPC-4
User Capacity: 100,030,242,816 bytes [100 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x5000cca0496f8ec0
Serial number: 0HWZAY0A

Device performance info from the manufacturer (device is 12Gb sas, but Z-series are only 6Gb sas):

Performance
Read Throughput (max MB/s, sequential 64k) 1100
Write Throughput (max MB/s, sequential 64K) 765
Read IOPS (max IOPS, random 4k) 130,000
Write IOPS (max IOPS, random 4k) 110,000

When the disk is used as an slog device (dd tests are always the same speed or faster than slog writes) then results are in screenshot 2017-05-05_16h13_42.jpg (max was around 200MB/sec with a block size of 128KB).

I performed the same test with a StecRAM drive (this was our "top of the line slog" device before they were discontinued), and got a maximum speed of 161MB/sec. See screenshot 2017-05-16_11h45_34.jpg for performance tests.

These numbers seem abnormally low. The StecRAM drive used to nearly saturate our 6Gb SAS, and now somehow we are getting far less performance.

At present, 2 customers are dealing with this issue.

2017-05-05_16h13_42.jpg (688 KB) 2017-05-05_16h13_42.jpg Cyber Jock, 05/16/2017 03:18 PM
2017-05-16_11h45_34.jpg (694 KB) 2017-05-16_11h45_34.jpg Cyber Jock, 05/16/2017 03:19 PM
fioplay.sh (263 Bytes) fioplay.sh the test harness Ash Gokhale, 05/23/2017 10:18 AM
iolatency.dtrace (454 Bytes) iolatency.dtrace latency analyzer is pinned the the psync fioengine back end Ash Gokhale, 05/23/2017 10:19 AM
slogbench.c (2.44 KB) slogbench.c Alexander Motin, 06/13/2017 10:48 AM
slogbench (10.4 KB) slogbench Alexander Motin, 06/13/2017 10:48 AM
11162
11163

History

#1 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

  • Assignee set to Alexander Motin

Sasha, any thoughts on these performance characteristics?

#2 Updated by Alexander Motin over 1 year ago

  • Status changed from Unscreened to Screened

I am not ready to tell whether things got worse without directly comparing with some old version when it was "good" on the same hardware. But generally I would say that it is not really surprising for me to see performance values lower then in vendor specifications. AFAIK endors typically use maximal possible (or at least significant) request queue depths to reach maximal throughput and IOPS, and don't use cache flushes on every write. dd on raw device used here creates queue depth of only one request. You may run few of them same time to see how it scale, or use fio. Combining short queues and cache flushes, which is typical for SLOG, may cause results very different from specified, since in that case performance is getting more limited by different latencies, rather then peak throughput or IOPS. NVDIMMs we are fighting for quite some time for the next generation hardware are supposed to address exactly that problem.

Latest versions of TrueNAS actually received some ZIL improvements that supposed to improve the performance, preparing to NVDIMM support. But if there indeed some regressions suspected, I would appreciate to get some live demo, which I could analyze. It would save me huge amount of very limited time.

#3 Updated by Cyber Jock over 1 year ago

Any update on this?

I have a customer that tried to call me multiple times on Friday (I was out of the office) and no doubt he's looking for a solution (sooner than later).

#4 Updated by Joe Maloney over 1 year ago

  • Assignee changed from Alexander Motin to Ash Gokhale

I reviewed this ticket, and related tickets with Sam, and Ash. The conclusion I have come to on my own is that SMB is outperforming NFS for a customer as well as the dd tests from zoltan's console which I recall Josh Paetzel mentioning is not an accurate way to measure performance with ZFS pools. Ash agreed to investigate this further, and Kris suggested to reassign the ticket to Ash for now.

#5 Updated by Ash Gokhale over 1 year ago

I ran more benchmarks and found some disturbing results:
I varied parameters and found results that tracked; the A360 based ZIL never definitively outperformed other media for writethrough latency.
- rw mix 50/50 to full write,
- thread count from 1 to 192,
- block sizes from 4k to 1M

The psync engine underperforms >20% compared to posixaio; however it's easier to analyze with dtrace.

This A360 is performing horribly in this system. While the bulk of the latency is more tightly bounded than the other media;
The zil device is suffering a 1msec penalty on every transaction at the 1st percentile.

I will be happy to marshal another benchmark with this framework if anyone wants other load generation parameters.

for the HGST:

 
[root@zoltan-b] ~/ash# ./fioplay.sh da12
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUSMH8010BSS200
Revision:             A360
Compliance:           SPC-4
User Capacity:        100,030,242,816 bytes [100 GB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5000cca0496f8ec0
Serial number:        0HWZAY0A
dtrace dispatched as 54288
dtrace: script 'iolatency.dtrace' matched 2 probes
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
meph: (g=0): rw=rw, bs=128K-128K/128K-128K/128K-128K, ioengine=psync, iodepth=1
...
fio-2.14
Starting 8 threads
Jobs: 8 (f=8): [M(8)] [100.0% done] [208.7MB/211.5MB/0KB /s] [1669/1691/0 iops] [eta 00m:00s]
meph: (groupid=0, jobs=8): err= 0: pid=101041: Tue May 23 10:09:31 2017
  read : io=2104.4MB, bw=215402KB/s, iops=1682, runt= 10004msec
    clat (usec): min=611, max=13104, avg=2878.50, stdev=1087.26
     lat (usec): min=611, max=13104, avg=2878.86, stdev=1087.28
    clat percentiles (usec):
     |  1.00th=[ 1400],  5.00th=[ 1656], 10.00th=[ 1864], 20.00th=[ 1928],
     | 30.00th=[ 2128], 40.00th=[ 2320], 50.00th=[ 2576], 60.00th=[ 2928],
     | 70.00th=[ 3376], 80.00th=[ 3792], 90.00th=[ 4320], 95.00th=[ 4832],
     | 99.00th=[ 6112], 99.50th=[ 6688], 99.90th=[ 9536], 99.95th=[10048],
     | 99.99th=[12480]
  write: io=2078.2MB, bw=212715KB/s, iops=1661, runt= 10004msec
    clat (usec): min=834, max=9834, avg=1878.59, stdev=594.83
     lat (usec): min=841, max=9838, avg=1890.26, stdev=594.76
    clat percentiles (usec):
     |  1.00th=[ 1080],  5.00th=[ 1208], 10.00th=[ 1400], 20.00th=[ 1448],
     | 30.00th=[ 1640], 40.00th=[ 1672], 50.00th=[ 1720], 60.00th=[ 1880],
     | 70.00th=[ 1928], 80.00th=[ 2064], 90.00th=[ 2576], 95.00th=[ 3088],
     | 99.00th=[ 4048], 99.50th=[ 4448], 99.90th=[ 6368], 99.95th=[ 6752],
     | 99.99th=[ 8512]
    lat (usec) : 750=0.01%, 1000=0.34%
    lat (msec) : 2=49.60%, 4=41.66%, 10=8.36%, 20=0.03%
  cpu          : usr=0.53%, sys=1.25%, ctx=33511, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=16835/w=16625/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=2104.4MB, aggrb=215401KB/s, minb=215401KB/s, maxb=215401KB/s, mint=10004msec, maxt=10004msec
  WRITE: io=2078.2MB, aggrb=212714KB/s, minb=212714KB/s, maxb=212714KB/s, mint=10004msec, maxt=10004msec

  lat                                               
           value  ------------- Distribution ------------- count    
          262144 |                                         0        
          524288 |                                         162   <------no writethrough in <500usec    
         1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         13334    
         2097152 |@@@@@@@                                  3017     
         4194304 |                                         110      
         8388608 |                                         2        
        16777216 |                                         0        

For a regular old spindle, performance is not greatly different; and minimum latency is less!:

[root@zoltan-b] ~/ash# ./fioplay.sh da1
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS726020AL4210
Revision:             A519
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca245018524
Serial number:        N4G0UXDK
Device type:          disk
dtrace dispatched as 54420
dtrace: script 'iolatency.dtrace' matched 2 probes
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
meph: (g=0): rw=rw, bs=128K-128K/128K-128K/128K-128K, ioengine=psync, iodepth=1
...
fio-2.14
Starting 8 threads
Jobs: 8 (f=8): [M(8)] [100.0% done] [152.3MB/156.4MB/0KB /s] [1218/1251/0 iops] [eta 00m:00s]
meph: (groupid=0, jobs=8): err= 0: pid=101020: Tue May 23 10:09:48 2017
  read : io=1613.9MB, bw=165211KB/s, iops=1290, runt= 10003msec
    clat (usec): min=354, max=429256, avg=3123.43, stdev=11425.83
     lat (usec): min=354, max=429256, avg=3123.79, stdev=11425.83
    clat percentiles (usec):
     |  1.00th=[  564],  5.00th=[ 1576], 10.00th=[ 1800], 20.00th=[ 2008],
     | 30.00th=[ 2096], 40.00th=[ 2352], 50.00th=[ 2544], 60.00th=[ 2736],
     | 70.00th=[ 2832], 80.00th=[ 2896], 90.00th=[ 3152], 95.00th=[ 4128],
     | 99.00th=[13120], 99.50th=[18304], 99.90th=[68096], 99.95th=[419840],
     | 99.99th=[428032]
  write: io=1597.8MB, bw=163561KB/s, iops=1277, runt= 10003msec
    clat (usec): min=459, max=429471, avg=3085.23, stdev=10310.13
     lat (usec): min=466, max=429478, avg=3096.17, stdev=10310.01
    clat percentiles (usec):
     |  1.00th=[ 1032],  5.00th=[ 1736], 10.00th=[ 1864], 20.00th=[ 2064],
     | 30.00th=[ 2128], 40.00th=[ 2320], 50.00th=[ 2544], 60.00th=[ 2768],
     | 70.00th=[ 2864], 80.00th=[ 3024], 90.00th=[ 3184], 95.00th=[ 3664],
     | 99.00th=[12352], 99.50th=[17280], 99.90th=[150528], 99.95th=[214016],
     | 99.99th=[428032]
    lat (usec) : 500=0.34%, 750=0.89%, 1000=0.44%
    lat (msec) : 2=15.61%, 4=78.10%, 10=2.88%, 20=1.33%, 50=0.28%
    lat (msec) : 100=0.04%, 250=0.04%, 500=0.06%
  cpu          : usr=0.39%, sys=1.00%, ctx=25717, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=12911/w=12782/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=1613.9MB, aggrb=165211KB/s, minb=165211KB/s, maxb=165211KB/s, mint=10003msec, maxt=10003msec
  WRITE: io=1597.8MB, aggrb=163560KB/s, minb=163560KB/s, maxb=163560KB/s, mint=10003msec, maxt=10003msec

  lat                                               
           value  ------------- Distribution ------------- count    
          131072 |                                         0        
          262144 |                                         42       <-- first writethrough in <500usec      
          524288 |                                         104      
         1048576 |@@@@@@@@@@@                              3403     
         2097152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              8741     
         4194304 |@                                        244      
         8388608 |@                                        183      
        16777216 |                                         36       
        33554432 |                                         14       
        67108864 |                                         2        
       134217728 |                                         7        
       268435456 |                                         6        
       536870912 |                                         0    

And for a mini with a stock zil:

[root@badger] ~/zilslow# ./fioplay.sh ada6
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron MX1/2/300, M5/600, 1100 Client SSDs
Device Model:     Micron_M600_MTFDDAK128MBF
Serial Number:    1613123BDBA7
LU WWN Device Id: 5 00a075 1123bdba7
Firmware Version: MU04
User Capacity:    128,035,676,160 bytes [128 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
dtrace dispatched as 10059
dtrace: script 'iolatency.dtrace' matched 2 probes
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
meph: (g=0): rw=rw, bs=128K-128K/128K-128K/128K-128K, ioengine=psync, iodepth=1
...
fio-2.14
Starting 8 threads
Jobs: 8 (f=8): [M(8)] [100.0% done] [144.4MB/144.2MB/0KB /s] [1155/1153/0 iops] [eta 00m:00s]
meph: (groupid=0, jobs=8): err= 0: pid=101625: Tue May 23 10:20:54 2017
  read : io=1437.8MB, bw=147152KB/s, iops=1149, runt= 10005msec
    clat (usec): min=432, max=11586, avg=5130.12, stdev=1079.14
     lat (usec): min=433, max=11587, avg=5130.49, stdev=1079.14
    clat percentiles (usec):
     |  1.00th=[ 2384],  5.00th=[ 3312], 10.00th=[ 3792], 20.00th=[ 4256],
     | 30.00th=[ 4576], 40.00th=[ 4896], 50.00th=[ 5152], 60.00th=[ 5408],
     | 70.00th=[ 5664], 80.00th=[ 5984], 90.00th=[ 6496], 95.00th=[ 6880],
     | 99.00th=[ 7712], 99.50th=[ 8096], 99.90th=[ 9408], 99.95th=[10048],
     | 99.99th=[10944]
  write: io=1418.2MB, bw=145143KB/s, iops=1133, runt= 10005msec
    clat (usec): min=278, max=7039, avg=1837.06, stdev=915.95
     lat (usec): min=282, max=7043, avg=1845.48, stdev=915.97
    clat percentiles (usec):
     |  1.00th=[  290],  5.00th=[  474], 10.00th=[  700], 20.00th=[  956],
     | 30.00th=[ 1256], 40.00th=[ 1544], 50.00th=[ 1784], 60.00th=[ 2024],
     | 70.00th=[ 2288], 80.00th=[ 2640], 90.00th=[ 3152], 95.00th=[ 3440],
     | 99.00th=[ 3920], 99.50th=[ 4128], 99.90th=[ 4704], 99.95th=[ 5088],
     | 99.99th=[ 6496]
    lat (usec) : 500=3.26%, 750=3.87%, 1000=3.96%
    lat (msec) : 2=18.36%, 4=26.49%, 10=44.03%, 20=0.03%
  cpu          : usr=0.28%, sys=0.63%, ctx=22896, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=11502/w=11345/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=1437.8MB, aggrb=147152KB/s, minb=147152KB/s, maxb=147152KB/s, mint=10005msec, maxt=10005msec
  WRITE: io=1418.2MB, aggrb=145143KB/s, minb=145143KB/s, maxb=145143KB/s, mint=10005msec, maxt=10005msec

  lat                                               
           value  ------------- Distribution ------------- count    
          131072 |                                         0        
          262144 |@@@                                      903      <<--- even the freenas sata micron zil can writethrough in <500usec
          524288 |@@@@@@                                   1755     
         1048576 |@@@@@@@@@@@@@@@@                         4525     
         2097152 |@@@@@@@@@@@@@@@                          4119     
         4194304 |                                         43       
         8388608 |                                         0 

And remarkably for my local box ( cooked file on hybrid pool, kernel 11-stable, not representative);

#./fioplay.sh fleh                                                                                        :root:/dozer/yslo:17:13:42:34kaylee
dtrace dispatched as 69112
dtrace: script 'iolatency.dtrace' matched 2 probes
meph: (g=0): rw=rw, bs=128K-128K/128K-128K/128K-128K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 8 processes
dtrace: 19679 dynamic variable drops with non-empty dirty list
dtrace: 23941 dynamic variable drops with non-empty dirty list
dtrace: 25594 dynamic variable drops with non-empty dirty list8K/49.2K/0 iops] [eta 00m:06s]
dtrace: 27639 dynamic variable drops with non-empty dirty list9K/49.1K/0 iops] [eta 00m:05s]
dtrace: 22871 dynamic variable drops with non-empty dirty list4K/51.2K/0 iops] [eta 00m:04s]
dtrace: 23648 dynamic variable drops with non-empty dirty list8K/46.1K/0 iops] [eta 00m:03s]
dtrace: 29769 dynamic variable drops with non-empty dirty list1K/47.4K/0 iops] [eta 00m:02s]
dtrace: 27659 dynamic variable drops with non-empty dirty list8K/56.7K/0 iops] [eta 00m:01s]
Jobs: 3 (f=3): [_(1),M(2),_(1),E(1),_(2),M(1)] [100.0% done] [5996MB/6036MB/0KB /s] [47.1K/48.3K/0 iops] [eta 00m:00s]
meph: (groupid=0, jobs=8): err= 0: pid=69115: Tue May 23 17:13:54 2017
  read : io=49106MB, bw=6193.2MB/s, iops=49551, runt=  7928msec
    clat (usec): min=5, max=217828, avg=77.90, stdev=847.81
     lat (usec): min=5, max=217829, avg=78.00, stdev=847.82
    clat percentiles (usec):
     |  1.00th=[   11],  5.00th=[   14], 10.00th=[   16], 20.00th=[   21],
     | 30.00th=[   28], 40.00th=[   34], 50.00th=[   40], 60.00th=[   45],
     | 70.00th=[   50], 80.00th=[   62], 90.00th=[  105], 95.00th=[  157],
     | 99.00th=[  548], 99.50th=[  812], 99.90th=[ 4512], 99.95th=[ 9920],
     | 99.99th=[35584]
  write: io=49198MB, bw=6205.7MB/s, iops=49644, runt=  7928msec
    clat (usec): min=10, max=217254, avg=76.45, stdev=1127.91
     lat (usec): min=10, max=217255, avg=78.99, stdev=1128.36
    clat percentiles (usec):
     |  1.00th=[   13],  5.00th=[   15], 10.00th=[   16], 20.00th=[   19],
     | 30.00th=[   24], 40.00th=[   27], 50.00th=[   31], 60.00th=[   35],
     | 70.00th=[   40], 80.00th=[   52], 90.00th=[   94], 95.00th=[  151],
     | 99.00th=[  572], 99.50th=[  948], 99.90th=[ 5216], 99.95th=[11072],
     | 99.99th=[35584]
    lat (usec) : 10=0.29%, 20=18.63%, 50=54.76%, 100=16.30%, 250=7.36%
    lat (usec) : 500=1.42%, 750=0.63%, 1000=0.19%
    lat (msec) : 2=0.22%, 4=0.09%, 10=0.07%, 20=0.03%, 50=0.02%
    lat (msec) : 100=0.01%, 250=0.01%
  cpu          : usr=2.56%, sys=39.55%, ctx=517319, majf=0, minf=8
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=392847/w=393585/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=49106MB, aggrb=6193.2MB/s, minb=6193.2MB/s, maxb=6193.2MB/s, mint=7928msec, maxt=7928msec
  WRITE: io=49198MB, aggrb=6205.7MB/s, minb=6205.7MB/s, maxb=6205.7MB/s, mint=7928msec, maxt=7928msec

dtrace: 5650 dynamic variable drops with non-empty dirty list

  lat                                               
           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |@@@@@@@                                  34872    
           16384 |@@@@@@@@@@@@@@@@@                        79028    
           32768 |@@@@@@@@@@                               48796    
           65536 |@@@                                      14965    
          131072 |@                                        5361     
          262144 |                                         2109     
          524288 |                                         1117     
         1048576 |                                         359      
         2097152 |                                         189      
         4194304 |                                         147      
         8388608 |                                         66       
        16777216 |                                         48       
        33554432 |                                         21       
        67108864 |                                         3        
       134217728 |                                         2        
       268435456 |                                         0 

#6 Updated by Alexander Motin over 1 year ago

Ash, while I support your idea to start measuring from raw SLOG device, your test set look strange to me. I don't see real purpose to 1) test it on mixed read/write workload, 2) test it only for large requests sizes and deep queues, and 3) test without flushing on-disk write caches, which is the most heavy/important operation for SLOG (it probably explains why HDD shows those numbers on your test). What I would test here is a time of synchronous writes (with disk cache flush following every batch, same as ZFS does it) of different sizes, starting from 4KB and up to few megabytes (writes less then 128KB should have queue depth of 1, while bigger should have deeper queue, respecting 128KB I/Os). I am not sure how to make fio to flush disk caches, so you may need some other tool, or run it on top of ZFS with sync=always to make it do it in its regular way.

Plus AFAIK previously we under-provisioned our SLOG SSDs to let them optimize their bookkeeping activity, that should help them to do their main duties fast. Are we still doing that, or it was reconsidered?

#7 Updated by Ash Gokhale over 1 year ago

Mav; using your suggested changes, write only workload, 4M block size, sync after every write; we see 420Mpbs peak and much improved latency. Thanks for the feedback. We will add the fio incantations the tribal knowledge.

[root@zoltan-a] ~/ash# ./fioplay.sh da12
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUSMH8010BSS200
Revision:             A360
Compliance:           SPC-4
User Capacity:        100,030,242,816 bytes [100 GB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5000cca0496f8ec0
Serial number:        0HWZAY0A
dtrace dispatched as 790
dtrace: script 'iolatency.dtrace' matched 2 probes
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
meph: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=posixaio, iodepth=16
...
fio-2.14
Starting 8 threads
Jobs: 5 (f=5): [W(3),_(1),W(1),_(1),W(1),_(1)] [27.5% done] [0KB/446.7MB/0KB /s] [0/111/0 iops] [eta 00m:29s]
meph: (groupid=0, jobs=8): err= 0: pid=100258: Tue May 23 14:47:27 2017
  write: io=4540.0MB, bw=427963KB/s, iops=104, runt= 10863msec
    slat (usec): min=84, max=475, avg=322.03, stdev=78.18
    clat (msec): min=291, max=1272, avg=1184.40, stdev=215.92
     lat (msec): min=291, max=1272, avg=1184.72, stdev=215.93
    clat percentiles (msec):
     |  1.00th=[  293],  5.00th=[  523], 10.00th=[  947], 20.00th=[ 1254],
     | 30.00th=[ 1254], 40.00th=[ 1254], 50.00th=[ 1254], 60.00th=[ 1254],
     | 70.00th=[ 1254], 80.00th=[ 1254], 90.00th=[ 1270], 95.00th=[ 1270],
     | 99.00th=[ 1270], 99.50th=[ 1270], 99.90th=[ 1270], 99.95th=[ 1270],
     | 99.99th=[ 1270]
    lat (msec) : 500=4.23%, 750=2.82%, 1000=3.96%, 2000=88.99%
  cpu          : usr=0.44%, sys=0.29%, ctx=8529, majf=0, minf=0
  IO depths    : 1=0.7%, 2=1.4%, 4=2.8%, 8=62.2%, 16=32.9%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.5%, 8=1.8%, 16=4.7%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1135/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=4540.0MB, aggrb=427962KB/s, minb=427962KB/s, maxb=427962KB/s, mint=10863msec, maxt=10863msec

[root@zoltan-a] ~/ash# cat fioplay.sh
#!/bin/sh
smartctl -a /dev/$1 | head -16
dtrace -s iolatency.dtrace &
dpid=$!
echo dtrace dispatched as $dpid
sleep 1
fio --filename=/dev/$1 --rw=write --ioengine=posixaio  --iodepth=16 \
        --bs=4M --sync=1 --numjobs=8 --runtime=10 --group_reporting --name=meph
kill  $dpid
wait $dpid

On a fn certified qa system provided by Joe ( sorry for blowing up ada0 ); a bit better ~450Mbps
the posixaio backend was not working, switched to psync

root@freenas-z30ref:~/ash # ./fioplay.sh da0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               STEC
Product:              ZeusRAM
Revision:             C023
Compliance:           SPC-4
User Capacity:        8,000,000,000 bytes [8.00 GB]
Logical block size:   512 bytes
Rotation Rate:        Solid State Device
Form Factor:          3.5 inches
Logical Unit id:      0x5000a7203009f424
Serial number:        STM000199402
Device type:          disk
Transport protocol:   SAS (SPL-3)
dtrace dispatched as 24261
dtrace: script 'iolatency.dtrace' matched 2 probes
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
meph: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=psync, iodepth=16
...
fio-2.14
Starting 8 threads
Jobs: 8 (f=8): [W(8)] [100.0% done] [0KB/428.5MB/0KB /s] [0/107/0 iops] [eta 00m:00s]
meph: (groupid=0, jobs=8): err= 0: pid=100799: Tue May 23 14:52:52 2017
  write: io=4396.0MB, bw=448357KB/s, iops=109, runt= 10040msec
    clat (msec): min=10, max=74, avg=72.74, stdev= 8.42
     lat (msec): min=10, max=75, avg=72.97, stdev= 8.43
    clat percentiles (usec):
     |  1.00th=[18816],  5.00th=[70144], 10.00th=[74240], 20.00th=[74240],
     | 30.00th=[74240], 40.00th=[74240], 50.00th=[74240], 60.00th=[74240],
     | 70.00th=[74240], 80.00th=[74240], 90.00th=[74240], 95.00th=[74240],
     | 99.00th=[75264], 99.50th=[75264], 99.90th=[75264], 99.95th=[75264],
     | 99.99th=[75264]
    lat (msec) : 20=1.00%, 50=2.37%, 100=96.63%
  cpu          : usr=0.35%, sys=0.92%, ctx=35171, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1099/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=4396.0MB, aggrb=448356KB/s, minb=448356KB/s, maxb=448356KB/s, mint=10040msec, maxt=10040msec

  lat
           value  ------------- Distribution ------------- count
         4194304 |                                         0
         8388608 |                                         2
        16777216 |@                                        21
        33554432 |@                                        30
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1046
       134217728 |                                         0

If these rates are expected then there may not be a problem here.

#8 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

This sounds much more in line with expected performance, so as you say there may be no issue here (from slog perspective anyway). Ash I believe you have reviewed the original customer tickets? Anything jump out there that indicates it may not be a slog issue? Cyber, what were the visible symptoms which led you down this path?

#9 Updated by Alexander Motin over 1 year ago

As I told Ash before, I believe fio is just uncapable to flush caches on raw disk devices, that make second set of tests also not very relevant -- they show maximal SSD throughput in async write mode, but tell nothing about another, often more important for SLOG characteristic, such as synchronous write latency.

I see two further sides to measure this case: 1) write small custom benchmark, that would emulate SLOG operation patterns with different transaction sizes, allowing to measure commit latency; or 2) properly setup synthetic benchmark on top of ZFS to measure latency on that level. In ideal case those two numbers should be equal, but since ZFS also has some overheads, they likely won't be, and then next question will rise -- how much do they differ and whether that was regressed or can be improved. I believe we should run tests like that for every new SSD we qualify for SLOG.

Alternative possible way is to start all this from some higher level test, check it for regression comparing to some older TrueNAS version, and then try to explain obtained results, same as we previously did for SMB directory listing problem.

#10 Updated by Cyber Jock over 1 year ago

Kris Moore wrote:

This sounds much more in line with expected performance, so as you say there may be no issue here (from slog perspective anyway). Ash I believe you have reviewed the original customer tickets? Anything jump out there that indicates it may not be a slog issue? Cyber, what were the visible symptoms which led you down this path?

What lead me down this path was this:

1. Custoemr complained that NFS performance was poor. Claimed to only get 275MB/sec throughput or so on write, but read performance was excellent.
2. Did some dd write tests on a test dataset. Even with large 100GB+ write tests and got good numbers.
3. Had the customer reproduce his "test" workload while looking at zpool iostat -v 1 and systat -if 1.
4. Saw the slog only going to 250-275MB/sec every second, with no faster numbers.
5. Disabled sync writes on the customer's dataset and saw speed immediately jump to saturation of 10Gb (of course this is expected unless the zpool itself performs poorly for an unrelated reason).
6. Did some tests with mirrored versus "striped" slog devices and were still very unimpressed with the performance.
7. Did my own dd write tests (removing NFS from the equation) on a test dataset turning sync writes on and off and noticing that as soon as you turn on sync writes, performance peaks at about 300MB/sec (450MB/sec when doing 2xstriped slogs), and when you turn it off it goes to 700MB+/sec (expected results for his zpool for async writes).

Then tried doing sync write tests on Zoltan with the same device (Alexander Motin we are underprovisioning the devices still). I got nearly the same performance numbers for sync writes. Then did the same with the stecram drive, which was still very poor.

When I did some slog testing on my own with my home-built system I did see that the slog appears to never write larger than 32KB or 64KB blocks (I forget which) so the 4MB block size test is probably not useful.

Can someone explain to me why creating a dataset with sync=enabled and compression=off, and then doing something like dd if=/dev/zero of=/mnt/tank/syncwritesenabledonthedataset/testfile bs=1M count=100000 is NOT a valid way to see the true throughput of our slog devices? This has always been our standard of measurement for an slog's peak performance for longer than I've worked at iXsystems. Josh Paetzel even has a thread in the forums for testing precisely this way. To me this isn't that much different from when we do throughput tests on a zpool to see if the zpool is giving expected numbers. We do a dd test, but without deliberately enabling sync and have to make it 3-4x larger than system RAM to ensure that the performance isn't artificially high due to async write caching.

To me, the fact that the test I performed got us just 160MB/sec from a stecram drive is a pretty be a pretty big sign that something is very, very wrong. Either our testing criteria (I'm open to better if someone has it) or our expectations for performance are overzealous.

#11 Updated by Alexander Motin over 1 year ago

Cyber Jock wrote:

Can someone explain to me why creating a dataset with sync=enabled and compression=off, and then doing something like dd if=/dev/zero of=/mnt/tank/syncwritesenabledonthedataset/testfile bs=1M count=100000 is NOT a valid way to see the true throughput of our slog devices?

I haven't told that this specific test is invalid, but there should be understanding of how SLOG device work and what exactly this test measures. Your original tests were done with dd with block size of 128KB and less. With sync set to always it means that for every block of data (128KB or less) written, ZFS should ask SLOG device to do the write, after which it should request device cache flush. Such access pattern with effective queue depth of 1 is by definition many times slower then what SSD vendors show in their datasheets, so comparing those numbers is pointless. And adding the second striped SLOG device into this pattern may not really help if there is not enough data in each single commit to efficiently split them between the two SLOGs, since the problem there is a latency, not a throughput (9 women can’t make a baby in a month). Using ZeusRAM drive indeed supposed to help there, since it is all-RAM and so should have lower latency and not bother about cache flushing. And to that part I have no ready answer. I never has that drive in my lab and it worth investigation.

Doing the same dd test over NFS require even more factors to be considered, since NFS chunks large requests into shorter ones typically of 64-128KB, which may be synced in different ways, and that may affect SLOG operation pattern. Just yesterday doing NFS testing with VMware I found some oddities in SLOG operation, which are not there in case of iSCSI, so there indeed may be some problem in our NFS<->ZFS cooperation, but it is difficult to say now whether it is the same problem the customer sees or not.

PS: I am not telling there is no problem, merely asking to demonstrate one to me in controllable environment and check whether it is a regression or it was always there.

#12 Updated by Joe Maloney over 1 year ago

I fully agree with Mav we need to prove that this ever worked. Is anyone planning to reinatall zoltan, or could QA take the slog device? Who do we need to ask?

#13 Updated by Cyber Jock over 1 year ago

Alexander Motin,

Thank you for the explanation. Kris Moore and I talked about that earlier, and I hadn't thought of it from quite that angle (queue depth that is). I was trying to figure out what the queue depth is for an slog where the slog is a high performance SSD. zilstat doesn't work on my system (zilstat freaks out if you have more than 1 zpool in a server and I have 3 at present). But some of the stuff I read about slog characteristics had said that you shouldn't normally see much more than a queue depth of 1 unless you have a large number of active simultaneous sync writes from multiple users. I have no way of proving if that is or isn't the case though.

I cannot do reinstalls of TrueNAS on zoltan (or any iX hardware for that matter because the VPN is too slow for the remote storage java applet to work properly. I was last instructed by Chiu that I had done my part by filing the bug ticket and that this should be handled by the dev team and Ross as necessary as I'm at the end of my knowledge and experience with this. I have a lot more that I can imagine doing, but don't know how to actually do those things (such as deep queue depths, checking SAS latency, etc.) I'd be more than willing to learn if someone wants to share some knowledge and commands.

Joe Maloney wrote:

I fully agree with Mav we need to prove that this ever worked. Is anyone planning to reinatall zoltan, or could QA take the slog device? Who do we need to ask?

AFAIK the stecram drive and the HGST are "signed out" to the support team (and probably to my name specifically since I had asked for them in Zoltan). Neither one are normally in Zoltan, so I don't see any reason why you can't have them removed and used elsewhere. The only thing I ask is that we keep some kind of paper trail on them so that if the RMA team (they technically own the 2 drives in question) come asking the support team for their hardware back we have a chain of custody to get the hardware back. After the testing is done if you can return them to Shawn Cox or Nick Bettencourt so that they can be returned to RMA I would appreciate it. If you do remove them please let me know that way if someone else asks for them I can send them to you.

#14 Updated by Ash Gokhale over 1 year ago

Cyber, zilstat is a dtrace wrapper; does it operate correctly if you specify -p <poolname>. If not; I might be able to fix it. How does it freak out?

#15 Updated by Ash Gokhale over 1 year ago

I've repeated the first benchmark under 9.3 on the same HW and found results that are worse; I don't suspect an os performance regression in this data :

 
fioplay.sh: 11 lines, 267 characters
[root@truenas] ~/ash# ./fioplay.sh da12
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUSMH8010BSS200
Revision:             A360
Compliance:           SPC-4
User Capacity:        100,030,242,816 bytes [100 GB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5000cca0496f8ec0
Serial number:        0HWZAY0A
dtrace dispatched as 9978
dtrace: script 'iolatency.dtrace' matched 2 probes
fio: this platform does not support process shared mutexes, forcing use of threads. Use the 'thread' option to get rid of this warning.
meph: (g=0): rw=rw, bs=128K-128K/128K-128K/128K-128K, ioengine=psync, iodepth=1
...
fio-2.1.9
Starting 8 threads
Jobs: 8 (f=8): [MMMMMMMM] [100.0% done] [197.7MB/220.7MB/0KB /s] [1581/1764/0 iops] [eta 00m:00s]
meph: (groupid=0, jobs=8): err= 0: pid=101403: Thu May 25 08:21:14 2017
  read : io=2055.3MB, bw=210394KB/s, iops=1643, runt= 10003msec
    clat (usec): min=672, max=15564, avg=2923.77, stdev=1112.92
     lat (usec): min=673, max=15564, avg=2924.06, stdev=1112.92
    clat percentiles (usec):
     |  1.00th=[ 1384],  5.00th=[ 1640], 10.00th=[ 1848], 20.00th=[ 1912],
     | 30.00th=[ 2128], 40.00th=[ 2352], 50.00th=[ 2608], 60.00th=[ 2992],
     | 70.00th=[ 3440], 80.00th=[ 3888], 90.00th=[ 4448], 95.00th=[ 4960],
     | 99.00th=[ 5984], 99.50th=[ 6496], 99.90th=[ 8384], 99.95th=[10688],
     | 99.99th=[14272]
    bw (KB  /s): min=21972, max=30914, per=12.49%, avg=26282.47, stdev=2064.11
  write: io=1979.2MB, bw=202602KB/s, iops=1582, runt= 10003msec
    clat (usec): min=851, max=10521, avg=1994.58, stdev=671.59
     lat (usec): min=873, max=10540, avg=2006.35, stdev=671.42
    clat percentiles (usec):
     |  1.00th=[ 1096],  5.00th=[ 1224], 10.00th=[ 1400], 20.00th=[ 1544],
     | 30.00th=[ 1656], 40.00th=[ 1704], 50.00th=[ 1880], 60.00th=[ 1928],
     | 70.00th=[ 2064], 80.00th=[ 2320], 90.00th=[ 2832], 95.00th=[ 3312],
     | 99.00th=[ 4320], 99.50th=[ 4960], 99.90th=[ 6816], 99.95th=[ 7264],
     | 99.99th=[10432]
    bw (KB  /s): min=20439, max=29184, per=12.50%, avg=25323.44, stdev=2214.01
    lat (usec) : 750=0.01%, 1000=0.27%
    lat (msec) : 2=43.35%, 4=46.69%, 10=9.64%, 20=0.04%
  cpu          : usr=0.50%, sys=1.27%, ctx=32290, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=16442/w=15833/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=2055.3MB, aggrb=210394KB/s, minb=210394KB/s, maxb=210394KB/s, mint=10003msec, maxt=10003msec
  WRITE: io=1979.2MB, aggrb=202601KB/s, minb=202601KB/s, maxb=202601KB/s, mint=10003msec, maxt=10003msec

  lat                                               
           value  ------------- Distribution ------------- count    
          262144 |                                         0        
          524288 |                                         119      
         1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@            11289    
         2097152 |@@@@@@@@@@@                              4228     
         4194304 |                                         194      
         8388608 |                                         3        
        16777216 |                                         0        

#16 Updated by Alexander Motin over 1 year ago

Ash, I thought we agreed that fio numbers are useless for SLOG, since it can't flush caches, which is a main bottleneck there. Once you are on 9.3 now, can you run some different SLOG-oriented benchmarks over ZFS to compare them with 9.10.2 and 11.0 results later?

#17 Updated by Ash Gokhale over 1 year ago

as soon as I have the regression bench set up - I'll do some sort of nfs torture test.

#18 Updated by Cyber Jock over 1 year ago

@Ash,

I emailed you about the zilstat. zilstat -p <poolname> doesn't work. I don't want to pollute this bug ticket with that problem.

#19 Updated by Cyber Jock over 1 year ago

I was just emailed this link in the forums. It seems someone else thinks the same thing is going on. I have no idea if this is actually related, but it smells like it.

https://forums.freenas.org/index.php?threads/slog-bottleneck-on-sync-writes-with-smaller-block-sizes.54675/

#20 Updated by Joe Maloney over 1 year ago

Seems it has been a known issue for many years that NFS in particular is always much slower with a ZIL?

https://forums.freenas.org/index.php?threads/why-is-my-nfs-write-performance-this-bad.10433/

#21 Updated by Joe Maloney over 1 year ago

This is interesting. Someome patched NFS, and recompiled FreeNAS over here.

https://www.ateamsystems.com/tech-blog/solved-performance-issues-with-freebsd-zfs-backed-esxi-storage-over-nfs/

#22 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

Tl;Dr: The above "fix" simply tells NFS to ignore sync to disk requests, speeding up operations, but at the risk of data integrity.

Interesting patch, but I find it a bit scary that they recommend only using this if you are connected to a good UPS. Oracle even has a blog post on this particular speed-up method. They go through some really good reasons why this is probably a bad idea from an integrity standpoint, that probably explains why this patch isnt something more widespread right now.

https://blogs.oracle.com/roch/nfs-and-zfs,-a-fine-combination

----snip-----

But tar is single threaded, so what is actually going on here ? The need to COMMIT frequently means that our thread must frequently pause for a full server side I/O latency. Because our single threaded tar is blocked, nothing is able to process the rest of our workload. If we allow the server to ignore COMMIT operations, then NFS responses will we sent earlier allowing the single thread to proceed down the tar file at greater speed. One must realise that the extra performance is obtained at the risk of causing corruption from the client's point of view in the event of a crash.

#23 Updated by Joe Maloney over 1 year ago

I agree after rereading Kris. I think I can conclude by simply doing searches for "nfs zil", and "nfsv4 zil" that the performance with NFS has never been there. Poor performance results can be found all over for various platforms including nexenta. All signs seem to point to iscsi being a better choice for vm backened storage with a zil involved. I would say that unless one is simply concerned about protecting from data loss during a power failure a zil should not be used for general nfs sharing. I do not think we should be encouraging a zil for use to boost nfs performance at all. Maybe there are other workarounds such as asjusting nfs write speeds which I have found but even those articles conclude the performance will not be boosted by much. Of course it is not a bad idea to continue to prove this out by proper benchmarking but I do not have high hopes that issues with nfs performance with a zil can be resolved properly at all.

#24 Updated by Cyber Jock over 1 year ago

So I disagree, but I cannot prove it. I don't think that NFS performance is poor for sync writes has "always been there". I'm pretty sure we had a customer saturating NFS (Rising Sun Pictures Studio) and had to disable sync writes because 6Gb/sec SAS was bottlenecking him. I can't find the ticket at the moment though. I remember seeing 500MB/sec+ on sync writes from ESXi hosts before in tests, but I have no screenshots or a ticket to link to prove it. So I will totally agree that I have little information to support my claim though.

We have been recommending that with iSCSI that sync=always be set if data integrity needed to be verified, and a feature request exists somewhere that was supposed to make a zvol set sync=enabled when an iSCSI disk is pointed at it. If memory serves me right it was created by Josh Paetzel but I have no idea where that actually went. I also have no idea if iscsi performance will be better than NFS in the same situation.

For one of the customers, their client demands sync writes at all times with NFS and they can switch to Samba but only if we enable sync writes on Samba (or the zpool). I explained that CIFS, as a protocol, has no sync write functionality and so the only option is to set it on the applicable datasets. I've talked to the customer and they understand what sync writes are, and they need that level of data integrity on their z35 HA at all times. I'd bet most of our customers would demand the same thing if they understood the issue thoroughly. They're contemplating using Samba and simply forcing sync=always, but I have no actual performance numbers to share as to how well that performs.

#25 Updated by Alexander Motin over 1 year ago

  • Status changed from Screened to Investigation

With some investigation I found two bad factors for VMware NFS and ZIL performance:

1) By default FreeBSD NFS code has such thing as File Handle Affinity, which bounds I/Os for the same files to the same kernel threads. It help file system code to properly identify access pattern and do more reasonable prefetch, plus IIRC for UFS it simplified cache management by not requiring or at least not stressing range locks heavily. Unfortunately on sync writes it makes following requests to be executed strictly one at a time, heavily bounding throughput to ZIL latency. There is sysctl vfs.nfsd.fha.enable setting which to zero allows to improve write performance, but it will quite likely hurt read. Some code tuning is probably needed there.

2) Another factor is probably a number of NFS threads. For some reason FreeNAS defaults to using only 4 NFS server threads. That means it can not run more then 4 requests same time, which means that even if FHA discussed above is disabled, it can aggregate no more then 4 write requests into one ZIL transaction, that again bounding throughput to ZIL latency.

Does anybody in support remember why do we have so low default number of threads? FreeBSD default now is to use up to 8 NFS threads per CPU core, that can be an overkill for big systems, but I believe that having 4 threads is too small even for small installations.

#26 Updated by Cyber Jock over 1 year ago

Alexander Motin wrote:

With some investigation I found two bad factors for VMware NFS and ZIL performance:

1) By default FreeBSD NFS code has such thing as File Handle Affinity, which bounds I/Os for the same files to the same kernel threads. It help file system code to properly identify access pattern and do more reasonable prefetch, plus IIRC for UFS it simplified cache management by not requiring or at least not stressing range locks heavily. Unfortunately on sync writes it makes following requests to be executed strictly one at a time, heavily bounding throughput to ZIL latency. There is sysctl vfs.nfsd.fha.enable setting which to zero allows to improve write performance, but it will quite likely hurt read. Some code tuning is probably needed there.

I have a ticket open with a customer that Ash specifically asked due to an issue for this customer to set vfs.nfsd.fha.enable to zero. The customer has not been able to perform the failover to verify it resolved the issue, but Ash said "...there is a theoretical performance tradeoff, but it is nanoseconds per hit, so you shouldn't have any serious consequence with this change." Obviously on extremely large systems, a few nanoseconds per hit could add up to a serious loss of performance. Maybe we should consider leaving the default at zero unless it is a Z35 or something? We could make it one of our autotune values.

2) Another factor is probably a number of NFS threads. For some reason FreeNAS defaults to using only 4 NFS server threads. That means it can not run more then 4 requests same time, which means that even if FHA discussed above is disabled, it can aggregate no more then 4 write requests into one ZIL transaction, that again bounding throughput to ZIL latency.
Does anybody in support remember why do we have so low default number of threads? FreeBSD default now is to use up to 8 NFS threads per CPU core, that can be an overkill for big systems, but I believe that having 4 threads is too small even for small installations.

Honestly, I don't remember. I know that back in the day we had serious problems where people would set it astronomically high (256+). We did have one customer that was really high (256+) and because he had something like 2000 clients and it was the only way to get good performance. As far as I know the default has always been 4. Perhaps it is time to consider bumping it higher? I know that for single user workloads 3 seems to be the optimal everywhere I've seen single-user workloads.

The FreeNAS and TrueNAS documentation used to say to never set the value higher than the number of physical cores, but we removed that a year or more ago. I've heard conflicting information that the physical core recommendation was never a legitimate recommendation to start with.

Personally, I'm not against bumping it to 8 or 16 as the new default based on my personal experience of tuning NFS for customers. I probably wouldn't recommend 32 or higher as we rarely need that many threads and sometimes it causes performance to drop.

#27 Updated by Alexander Motin over 1 year ago

Cyber Jock wrote:

Obviously on extremely large systems, a few nanoseconds per hit could add up to a serious loss of performance. Maybe we should consider leaving the default at zero unless it is a Z35 or something? We could make it one of our autotune values.

I guess Ash comment was related to the CPU time spent on FHA code processing, which is indeed likely in nanoseconds. But in case of synchronous writes here the problem rises to hundred microseconds of SLOG device synchronous write latency. I may accept that for some systems like Z50 in some very random access patterns it may be a benefit to disable the FHA completely, but for most things I think we would get more benefits from making it enabled for reads, but disabled for writes, which unfortunately can't be done now in run time. I'll investigate this more. Just for note, our iSCSI target code in CTL also has equivalent of this logic, and by default it is enabled only for read, same as I'd like to see it here.

Personally, I'm not against bumping it to 8 or 16 as the new default based on my personal experience of tuning NFS for customers. I probably wouldn't recommend 32 or higher as we rarely need that many threads and sometimes it causes performance to drop.

OK. Then I propose to bump the default to 16. Our iSCSI target code in CTL now uses up to 14 threads per LUN.

#28 Updated by Alexander Motin over 1 year ago

Deeper look shown that FHA for write was intentionally left enabled three years ago by ken@ to optimize misaligned file rewrite exactly in ZFS case, at least when sync is not forced. When sync is forced (by VMware or by sync=always) it is not exactly clear to me now whether it is better to "optimize" or "not optimize" here -- both cases have downsides.

#29 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

If the threads is already a tunable can we not just try adjusting it up on some of those customers and see if it improves their performance? Josh is that something you can get us some feedback on? If it doesn't do much then that would be helpful information to know as well.

#30 Updated by Alexander Motin over 1 year ago

Multiple threads won't help sequential I/O if FHA force handling to be single-threaded. Though it may help random I/O up to some point on pool with multiple vdevs, since in that case FHA allows requests to run in parallel and wide pool can benefit from multiple simultaneous I/Os.

#31 Updated by Alexander Motin over 1 year ago

  • Assignee changed from Ash Gokhale to Alexander Motin

I'll take this ticket back. After some discussions on BSDCan it seems it may not be a problem to disable FHA for write, that should be good for NFS+SLOG performance. I'll try to make a patch and test it as soon as I have time.

#33 Updated by Alexander Motin over 1 year ago

As I have told above, just increasing number of threads won't likely help, unlike you also set sysctl vfs.nfsd.fha.enable=0, but setting it will likely hurt read performance. There is no way now to disable FHA for write, but keep it for read.

#35 Updated by Alexander Motin over 1 year ago

In ticket #24451 I did some SSDs benchmarking in SLOG role, which may be interesting for people to see.

#36 Updated by Cyber Jock over 1 year ago

@Alexander,

That bug ticket is private, so nobody in the support team will be able to see it. Can you take off the private flag please?

#37 Updated by Alexander Motin over 1 year ago

That ticket is not marked private, probably you have no access to "Hardware Certification" project. On the other side, after more tests my results there may not be correct (the same SSD as SLOG on Z35 show much higher latency then on X10). So I'll take it back till I find source of that difference or at least compare apples to apples. Sorry for noise so far, I'll come back later.

#38 Updated by Alexander Motin over 1 year ago

I've finally written a synthetic benchmarking tool, emulating about the worst case of ZFS SLOG behavior (synchronous random writes of different sizes). Here is some numbers I measured with it:

HGST HUSMH8010BSS200 on X10 (12Gbps):
root@truenas-b:~ # ./slogbench /dev/da2
 0.5 KB:   101.6 usec /    4.8 MB/s
   1 KB:   102.7 usec /    9.5 MB/s
   2 KB:   103.2 usec /   18.9 MB/s
   4 KB:   102.4 usec /   38.2 MB/s
   8 KB:   110.4 usec /   70.8 MB/s
  16 KB:   127.5 usec /  122.5 MB/s
  32 KB:   148.0 usec /  211.1 MB/s
  64 KB:   191.7 usec /  326.1 MB/s
 128 KB:   302.0 usec /  413.8 MB/s
 256 KB:   610.6 usec /  409.4 MB/s
 512 KB:  1214.7 usec /  411.6 MB/s
1024 KB:  2404.2 usec /  415.9 MB/s
2048 KB:  4779.9 usec /  418.4 MB/s
4096 KB:  9574.8 usec /  417.8 MB/s
8192 KB: 19246.9 usec /  415.7 MB/s

Micron S655DC-200 on X10 (12Gbps):
root@truenas-b:~ # ./slogbench /dev/da12
 0.5 KB:    85.4 usec /    5.7 MB/s
   1 KB:    87.3 usec /   11.2 MB/s
   2 KB:    86.7 usec /   22.5 MB/s
   4 KB:    89.2 usec /   43.8 MB/s
   8 KB:    95.0 usec /   82.2 MB/s
  16 KB:   101.0 usec /  154.7 MB/s
  32 KB:   118.4 usec /  264.0 MB/s
  64 KB:   182.6 usec /  342.2 MB/s
 128 KB:   266.3 usec /  469.5 MB/s
 256 KB:   508.5 usec /  491.7 MB/s
 512 KB:  1000.7 usec /  499.7 MB/s
1024 KB:  2001.6 usec /  499.6 MB/s
2048 KB:  3829.7 usec /  522.2 MB/s
4096 KB:  7264.2 usec /  550.6 MB/s
8192 KB: 14753.6 usec /  542.2 MB/s

STEC ZeusRAM on Z35 (z35ref, 6Gbps):
root@freenas:~ # ./slogbench /dev/da0
 0.5 KB:    73.5 usec /    6.6 MB/s
   1 KB:    74.5 usec /   13.1 MB/s
   2 KB:    73.3 usec /   26.7 MB/s
   4 KB:    76.9 usec /   50.8 MB/s
   8 KB:    86.3 usec /   90.6 MB/s
  16 KB:   103.4 usec /  151.1 MB/s
  32 KB:   139.0 usec /  224.8 MB/s
  64 KB:   209.6 usec /  298.2 MB/s
 128 KB:   353.9 usec /  353.2 MB/s
 256 KB:   653.5 usec /  382.6 MB/s
 512 KB:  1265.2 usec /  395.2 MB/s
1024 KB:  2467.9 usec /  405.2 MB/s
2048 KB:  4776.0 usec /  418.8 MB/s
4096 KB:  9478.5 usec /  422.0 MB/s
8192 KB: 18866.7 usec /  424.0 MB/s

MICRON S655DC-200 on Z20 (zfcref, 6Gbps):
[root@zfcref-a] ~# ./slogbench /dev/da8
 0.5 KB:   109.4 usec /    4.5 MB/s
   1 KB:   111.6 usec /    8.7 MB/s
   2 KB:   113.1 usec /   17.3 MB/s
   4 KB:   116.6 usec /   33.5 MB/s
   8 KB:   125.0 usec /   62.5 MB/s
  16 KB:   142.1 usec /  110.0 MB/s
  32 KB:   169.8 usec /  184.0 MB/s
  64 KB:   284.4 usec /  219.8 MB/s
 128 KB:   415.1 usec /  301.1 MB/s
 256 KB:   734.4 usec /  340.4 MB/s
 512 KB:  1305.6 usec /  383.0 MB/s
1024 KB:  2367.2 usec /  422.4 MB/s
2048 KB:  4527.4 usec /  441.8 MB/s
4096 KB:  8748.6 usec /  457.2 MB/s
8192 KB: 17116.7 usec /  467.4 MB/s

Comments or propositions about the tool are welcome.

#39 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

Interesting, those all seem to perform about the same or better than the STEC ZeusRAM, so doesn't appear to be a huge regression there. Was this on 11.0 you tested or 9.10.X? Looking at the maximum of the HGST it appears to be 765MB/s which this isn't quite approaching, but I would expect that is best-case scenario and as you said, this is worst-case here so perhaps not out of the realm of reason.

https://www.hgst.com/products/solid-state-solutions/ultrastar-ssd800mhb

Does anybody here have alternative information to bring to the table to demonstrate a regression of some sort?

#40 Updated by Alexander Motin over 1 year ago

Kris Moore wrote:

Interesting, those all seem to perform about the same or better than the STEC ZeusRAM, so doesn't appear to be a huge regression there. Was this on 11.0 you tested or 9.10.X?

X10 was obviously running recent TN 11.0, Z20 -- recent 9.10.2-U4, Z35 was some older FN 11.0. But for this test it should not matter especially much, since it is very low level. VB now reinstalled the Z35 to TN 9.10.2-U5, so I am going to do some more tests against it.

Looking at the maximum of the HGST it appears to be 765MB/s which this isn't quite approaching, but I would expect that is best-case scenario and as you said, this is worst-case here so perhaps not out of the realm of reason.

https://www.hgst.com/products/solid-state-solutions/ultrastar-ssd800mhb

Numbers in specifications usually mean asynchronous I/O with full request queue, while for SLOG it is not very typical, since it is very transaction-oriented with many cache flushes. So they are barely significant for SLOG.

Does anybody here have alternative information to bring to the table to demonstrate a regression of some sort?

Those numbers I provided mostly as a reference point of what SSD SLOGs can provide, and to illustrate how SLOG performance depends on workload (size of transaction group to commit). I am not saying there can not be regression somewhere, but considering above, I would say that much more realistic difference could be caused by difference in workloads, protocol-specific code (NFS, SMB, etc) or ZFS code. Above I've already confirmed and explained problems NFS has on write now. I don't think that is a regression, but I am still on it, thinking about better way to fix.

To illustrate what is really bad SLOG, here are numbers from desktop Intel 530-series SSD on LSI HBA. Not sure whether it is a problem of SSD, or somehow it does not live with the HBA, but while it is fast enough on normal async operations, it simply does not work as SLOG due to having insane cache flush time:

# ./slogbench /dev/da3
 0.5 KB:  9542.9 usec /    0.1 MB/s
   1 KB:  9591.1 usec /    0.1 MB/s
   2 KB:  9721.8 usec /    0.2 MB/s
   4 KB: 10012.2 usec /    0.4 MB/s
   8 KB: 10053.4 usec /    0.8 MB/s
  16 KB: 10071.9 usec /    1.6 MB/s
  32 KB: 10116.4 usec /    3.1 MB/s
  64 KB: 10692.5 usec /    5.8 MB/s
 128 KB: 11267.4 usec /   11.1 MB/s
 256 KB: 12271.2 usec /   20.4 MB/s
 512 KB: 13631.2 usec /   36.7 MB/s
1024 KB: 15862.8 usec /   63.0 MB/s
2048 KB: 23413.2 usec /   85.4 MB/s
4096 KB: 34475.7 usec /  116.0 MB/s
8192 KB: 52904.7 usec /  151.2 MB/s

#41 Updated by Cyber Jock over 1 year ago

Alexander Motin,

That slog bench tool is amazing. Any chance we can get this added to FreeNAS and TrueNAS for diagnostic purposes? I can file a feature request if that is necessary. That thing is pretty awesome! Good work!

One of our customers has decided to accept the numbers being lower than what he would have expected. His workloads are virtual machines, with typically don't do significant quantities of writes.

Another customer said that because of the intended use case they must have sync writes (they were forcing them on the client side, which is how they found this problem to begin with). They're getting a bit antsy because it's been a month since they found the bottleneck, and the current performance is inadequate for their needs. I've discussed not using sync writes, but they said that when they bought the server they had expected performance numbers to work with their current workload, and we're just not there and not using sync writes is a non-starter.

It is definitely nice to have a tool we can use for some comparison of different devices and different block sizes.

Correct me if I'm wrong, but aren't we limited to 128KB block sizes on ZFS, with the exception of the 1MB block feature flag being enabled and then setting recordsize to 1MB?

#42 Updated by Alexander Motin over 1 year ago

Cyber Jock wrote:

That slog bench tool is amazing. Any chance we can get this added to FreeNAS and TrueNAS for diagnostic purposes? I can file a feature request if that is necessary. That thing is pretty awesome! Good work!

Thanks. I do think we should add it to FN/TN in some way, I just haven't decided whether add that functionality to diskinfo, or add it as separate tool. I'll think about it when the dust settle a bit.

Correct me if I'm wrong, but aren't we limited to 128KB block sizes on ZFS, with the exception of the 1MB block feature flag being enabled and then setting recordsize to 1MB?

SLOG writes are not really related to dataset/zvol record/block sizes. Writes to SLOG happen on per user request basis, so it can be both smaller and bigger then block size, depending on workload. While one application may write separate 512b sectors via iSCSI to ZVOL with 16KB block, another can write the same ZVOL with large 1MB chunks, and SLOG will receive tiny 512b (or 4-8KB) write requests in the first case, and simultaneous burst of 128K writes with common cache flush in the second (which is what my tool actually simulates for commit sizes above 128K).

Another factor that supposed to improve SLOG's life, is a request aggregation, if many of them are submitted in parallel. For example, if some application writes data in 16KB chunks, but runs dozen of writes at the same time, SLOG should be written not in 16KB chunks, but in much bigger. This is actually the part which is not working for NFS now.

#43 Updated by Alexander Motin over 1 year ago

“Curiouser and curiouser!” Cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English).

I've compared performance of the same Z35 box (z35ref) running TN 9.10.2-U5 and 11.0 from today using my tool on raw SLOGs and dd on top of ZFS with sync=always:

9.10.2-U5:

ZeusRAM:
[root@truenas] ~# ./slogbench /dev/da0
 0.5 KB:    94.1 usec /    5.2 MB/s
   1 KB:    95.2 usec /   10.3 MB/s
   2 KB:    98.1 usec /   19.9 MB/s
   4 KB:   104.4 usec /   37.4 MB/s
   8 KB:   121.0 usec /   64.6 MB/s
  16 KB:   131.9 usec /  118.5 MB/s
  32 KB:   171.3 usec /  182.4 MB/s
  64 KB:   243.8 usec /  256.4 MB/s
 128 KB:   389.2 usec /  321.2 MB/s
 256 KB:   685.5 usec /  364.7 MB/s
 512 KB:  1275.0 usec /  392.2 MB/s
1024 KB:  2456.8 usec /  407.0 MB/s
2048 KB:  4815.7 usec /  415.3 MB/s
4096 KB:  9507.1 usec /  420.7 MB/s
8192 KB: 18888.2 usec /  423.5 MB/s
[root@truenas] ~# dd if=/dev/zero of=/mnt/tank/zzz bs=512 count=100000
100000+0 records in
100000+0 records out
51200000 bytes transferred in 17.462837 secs (2931941 bytes/sec)
ZFS write latency: 171us

Micron:
[root@truenas] ~# ./slogbench /dev/da13
 0.5 KB:   121.2 usec /    4.0 MB/s
   1 KB:   123.4 usec /    7.9 MB/s
   2 KB:   124.6 usec /   15.7 MB/s
   4 KB:   127.1 usec /   30.7 MB/s
   8 KB:   135.9 usec /   57.5 MB/s
  16 KB:   159.6 usec /   97.9 MB/s
  32 KB:   189.7 usec /  164.7 MB/s
  64 KB:   281.3 usec /  222.2 MB/s
 128 KB:   409.8 usec /  305.0 MB/s
 256 KB:   698.9 usec /  357.7 MB/s
 512 KB:  1316.1 usec /  379.9 MB/s
1024 KB:  2387.7 usec /  418.8 MB/s
2048 KB:  4517.6 usec /  442.7 MB/s
4096 KB:  8681.3 usec /  460.8 MB/s
8192 KB: 17023.1 usec /  469.9 MB/s
[root@truenas] ~# dd if=/dev/zero of=/mnt/tank/zzz bs=512 count=100000
100000+0 records in
100000+0 records out
51200000 bytes transferred in 21.872256 secs (2340865 bytes/sec)
ZFS write latency: 218us

11.0:

ZeusRAM:
root@truenas:~ # ./slogbench /dev/da0
 0.5 KB:    72.3 usec /    6.8 MB/s
   1 KB:    74.6 usec /   13.1 MB/s
   2 KB:    74.9 usec /   26.1 MB/s
   4 KB:    79.5 usec /   49.1 MB/s
   8 KB:    89.4 usec /   87.4 MB/s
  16 KB:   107.2 usec /  145.7 MB/s
  32 KB:   144.1 usec /  216.9 MB/s
  64 KB:   210.2 usec /  297.4 MB/s
 128 KB:   350.1 usec /  357.1 MB/s
 256 KB:   640.7 usec /  390.2 MB/s
 512 KB:  1242.8 usec /  402.3 MB/s
1024 KB:  2473.7 usec /  404.2 MB/s
2048 KB:  4828.4 usec /  414.2 MB/s
4096 KB:  9507.6 usec /  420.7 MB/s
8192 KB: 18882.1 usec /  423.7 MB/s
root@truenas:~ # dd if=/dev/zero of=/mnt/tank/zzz bs=512 count=100000
100000+0 records in
100000+0 records out
51200000 bytes transferred in 11.134166 secs (4598458 bytes/sec)
ZFS write latency: 111us

Micron:
root@truenas:~ # ./slogbench /dev/da13
 0.5 KB:    97.8 usec /    5.0 MB/s
   1 KB:    99.7 usec /    9.8 MB/s
   2 KB:    99.0 usec /   19.7 MB/s
   4 KB:   101.5 usec /   38.5 MB/s
   8 KB:   109.8 usec /   71.1 MB/s
  16 KB:   125.8 usec /  124.2 MB/s
  32 KB:   158.2 usec /  197.5 MB/s
  64 KB:   250.2 usec /  249.8 MB/s
 128 KB:   386.6 usec /  323.3 MB/s
 256 KB:   679.1 usec /  368.2 MB/s
 512 KB:  1263.8 usec /  395.6 MB/s
1024 KB:  2354.2 usec /  424.8 MB/s
2048 KB:  4503.1 usec /  444.1 MB/s
4096 KB:  8747.1 usec /  457.3 MB/s
8192 KB: 17154.8 usec /  466.3 MB/s
root@truenas:~ # dd if=/dev/zero of=/mnt/tank/zzz bs=512 count=100000
100000+0 records in
100000+0 records out
51200000 bytes transferred in 13.688985 secs (3740233 bytes/sec)
ZFS write latency: 136us

Either 11.0 is for some unclear but good reason so much faster, or something is indeed very wrong with 9.10.2, and customers complain for a reason.

#45 Updated by Alexander Motin over 1 year ago

  • Status changed from Investigation to Fix In Progress

I made some experiments and I have possible explanations why TN 11.0 is faster, but I haven't spent enough time to really prove that. We may just take that it is. TN 11.0 should be released sometime soon, so it should help from some sides of the problem. But the most problem I believe is on NFS server side. I am working on it.

#46 Updated by Cyber Jock over 1 year ago

Gladstone has been informed of the situation based on the most recent explanation. They are asking for any kind of tuning that can improve performance on their 9.3 build. They are not interested in upgrading at the present time because they've had so many problems that they need solutions that can be applied to their current install vice upgrading.

Is there any kind of tuning values we can apply?

Thanks.

#47 Updated by Alexander Motin over 1 year ago

I have no magic solution in my pocket. As I have told above: "There is sysctl vfs.nfsd.fha.enable setting which to zero allows to improve write performance" especially if combined with increased number of nfsd threads ", but it will quite likely hurt read."

#48 Updated by Alexander Motin over 1 year ago

I've added my SLOG benchmark to diskinfo tool in FreeBSD head: https://svnweb.freebsd.org/changeset/base/320683 . Will merge it into FN 11.1 after some time.

#49 Updated by Alexander Motin over 1 year ago

  • Category changed from Middleware to 162
  • Target version set to TrueNAS 11.1-U1

#50 Updated by Alexander Motin over 1 year ago

  • Status changed from Fix In Progress to 19

I've committed to nightly train patch adding two more sysctls to control NFS FHA. I expect with that patch setting vfs.nfsd.fha.write=0 should dramatically improve synchronous write performance with multiple parallel requests, especially if NFS servers configured with sufficiently large number of threads.

#51 Updated by Dru Lavigne about 1 year ago

  • Status changed from 19 to 47

#52 Updated by Dru Lavigne about 1 year ago

  • Subject changed from slog devices seem to be slow across the board to Increase write speed for sychronous operations

#53 Updated by Dru Lavigne about 1 year ago

  • Subject changed from Increase write speed for sychronous operations to Improve FHA locality control for NFS read/write requests

#54 Updated by Dru Lavigne about 1 year ago

  • 1 added project (FreeNAS)

#55 Updated by Dru Lavigne about 1 year ago

  • Description updated (diff)
  • Support Suite Ticket deleted (TIG-571-29993, WUU-204-76593)

#56 Updated by Dru Lavigne about 1 year ago

  • Target version changed from TrueNAS 11.1-U1 to 11.1

#57 Updated by Nick Wolff about 1 year ago

  • Needs QA changed from Yes to No

Clearing QA

May be target for future benchmarking.

#58 Updated by Dru Lavigne 12 months ago

  • Status changed from 47 to Resolved

Also available in: Atom PDF