Enhance compressed ARC performance
It seems that with the trunking in of compressed ARC into FreeBSD 10, the maximum IOPS that can be achieved on a LUN regardless of CPU power is now roughly 130,000 8k read IOPS in FreeNAS 9.10 and higher. For high transaction databases like Elasticsearch this could be negatively impacting. Can currently be worked around by setting vfs.zfs.compressed_arc_enabled=0 in /boot/loader.conf in 11-RC (due to a bug in 9.10.2, this workaround does not function).
I didn't know which category to put this in exactly so I put it in iSCSI - please move as necessary.
#1 Updated by Alexander Motin over 3 years ago
- Tracker changed from Umbrella to Feature
- Subject changed from Enhance compressed ARC / iSCSI locking to Enhance compressed ARC performance
- Category changed from 89 to 200
- Status changed from Unscreened to Screened
- Target version set to 11.1
- % Done set to 0
From your data I haven't seen significant lock congestion, so lets narrow the topic down to testing compressed ARC performance.
#2 Updated by Alexander Motin almost 3 years ago
- Priority changed from No priority to Nice to have
- Target version changed from 11.1 to 11.2-BETA1
Compressed ARC performance is more question to upstream FreeBSD or even OpenZFS, it it out of scope of FreeNAS, but I'll leave it here, hoping to play with it later.
#4 Updated by Nick Wolff over 2 years ago
CTL source code list a potential performance improvement(see below) in pulling data directly from to avoid a second buffer. This get's complicated by compressed arc but we need to make sure we are not adding additional memcpys into code path and ideally it would be nice to do this when arc isn't compressed.
ZFS ARC backend for CTL. Since ZFS copies all I/O into the ARC (Adaptive Replacement Cache), running the block/file backend on top of a ZFS-backed zdev or file will involve an extra set of copies. The optimal solution for backing targets served by CTL with ZFS would be to allocate buffers out of the ARC directly, and DMA to/from them directly. That would eliminate an extra data buffer allocation and copy.
Attached is flamegraph of a system doing about 400MBPS sequential read (light load). The entire left half can be ignored as it's all local VMs but the memcpy above icl_sof_conn are related to either this ticket or a memcpy that can be removed by iscsi dma offload of chelsio cards as being worked on here #17698
#5 Updated by Alexander Motin over 2 years ago
- Tracker changed from Feature to Bug
- Status changed from Not Started to In Progress
- Target version changed from 11.3 to 11.2-RC2
- Severity set to Low
- Reason for Blocked deleted (
Dependant on a related task to be completed)
- Seen in set to 11.1-U4
- ChangeLog Required set to No
I've merged patch (will be in 11.1-U5 and 11.2), removing limit on maximal amount of decompressed ARC data of 100MB, allowing it to reach up to 3% of ARC. It should reduce the decompression overhead for metadata and some small part of very hot data.
We may also consider tuning of default indirect block size, but that is still being investigated.
#9 Updated by Dru Lavigne over 2 years ago
- Needs Doc changed from Yes to No
- Needs Merging changed from Yes to No