Enhance compressed ARC performance
It seems that with the trunking in of compressed ARC into FreeBSD 10, the maximum IOPS that can be achieved on a LUN regardless of CPU power is now roughly 130,000 8k read IOPS in FreeNAS 9.10 and higher. For high transaction databases like Elasticsearch this could be negatively impacting. Can currently be worked around by setting vfs.zfs.compressed_arc_enabled=0 in /boot/loader.conf in 11-RC (due to a bug in 9.10.2, this workaround does not function).
I didn't know which category to put this in exactly so I put it in iSCSI - please move as necessary.
MFC r331711: MFV 331710:
9188 increase size of dbuf cache to reduce indirect block decompression
With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect
blocks, under a workload of random cached reads. To reduce this decompression
cost, we would like to increase the size of the dbuf cache so that more
indirect blocks can be stored uncompressed.
If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the
dbuf cache as well.)
In real world workloads, this won't help as dramatically as the example
above, but we think it's still worth it because the risk of decreasing
performance is low. The potential negative performance impact is that we
will be slightly reducing the size of the ARC (by ~3%).
Reviewed by: Dan Kimmel <firstname.lastname@example.org>
Reviewed by: Prashanth Sreenivasa <email@example.com>
Reviewed by: Paul Dagnelie <firstname.lastname@example.org>
Reviewed by: Sanjay Nadkarni <email@example.com>
Reviewed by: Allan Jude <firstname.lastname@example.org>
Reviewed by: Igor Kozhukhov <email@example.com>
Approved by: Garrett D'Amore <firstname.lastname@example.org>
Author: George Wilson <email@example.com>
(cherry picked from commit 3b7774b01772fe050d4f69bf97497815f3010af9)
#1 Updated by Alexander Motin almost 2 years ago
- Tracker changed from Umbrella to Feature
- Subject changed from Enhance compressed ARC / iSCSI locking to Enhance compressed ARC performance
- Category changed from 89 to 200
- Status changed from Unscreened to Screened
- Target version set to 11.1
- % Done set to 0
From your data I haven't seen significant lock congestion, so lets narrow the topic down to testing compressed ARC performance.
#2 Updated by Alexander Motin over 1 year ago
- Priority changed from No priority to Nice to have
- Target version changed from 11.1 to 11.2-BETA1
Compressed ARC performance is more question to upstream FreeBSD or even OpenZFS, it it out of scope of FreeNAS, but I'll leave it here, hoping to play with it later.
#4 Updated by Nick Wolff about 1 year ago
CTL source code list a potential performance improvement(see below) in pulling data directly from to avoid a second buffer. This get's complicated by compressed arc but we need to make sure we are not adding additional memcpys into code path and ideally it would be nice to do this when arc isn't compressed.
ZFS ARC backend for CTL. Since ZFS copies all I/O into the ARC (Adaptive Replacement Cache), running the block/file backend on top of a ZFS-backed zdev or file will involve an extra set of copies. The optimal solution for backing targets served by CTL with ZFS would be to allocate buffers out of the ARC directly, and DMA to/from them directly. That would eliminate an extra data buffer allocation and copy.
Attached is flamegraph of a system doing about 400MBPS sequential read (light load). The entire left half can be ignored as it's all local VMs but the memcpy above icl_sof_conn are related to either this ticket or a memcpy that can be removed by iscsi dma offload of chelsio cards as being worked on here #17698
- Tracker changed from Feature to Bug
- Status changed from Not Started to In Progress
- Target version changed from 11.3 to 11.2-RC2
- Severity set to Low
- Reason for Blocked deleted (
Dependant on a related task to be completed)
- Seen in set to 11.1-U4
- ChangeLog Required set to No
I've merged patch (will be in 11.1-U5 and 11.2), removing limit on maximal amount of decompressed ARC data of 100MB, allowing it to reach up to 3% of ARC. It should reduce the decompression overhead for metadata and some small part of very hot data.
We may also consider tuning of default indirect block size, but that is still being investigated.
- Needs Doc changed from Yes to No
- Needs Merging changed from Yes to No