Project

General

Profile

Bug #32628

Unstable write performance and session death, due to spacemap IO on large HDDs

Added by Stilez y 10 months ago. Updated 8 days ago.

Status:
Blocked
Priority:
Important
Assignee:
Alexander Motin
Category:
OS
Target version:
Severity:
Low Medium
Reason for Closing:
Reason for Blocked:
Waiting for feedback
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

This is a confirmed issue, whose cause is checked with the FreeBSD-fs team and is very likely to be a known upstream ZFS bug" related to a dirty data accounting issue. But it's having a severe effect on my FreeNAS system. Dru asked me to file a bug report [1].

Scenario and symptoms as follows:

Hardware/setup:

  • 22 GB raw capacity pool, 55% full, extreme amounts of duplicate data (4.1 x dedup ratio), hence dedup enabled and server specced for dedup use.
  • Fast and appropriate server for a dedup pool (Xeon v4 3.5+ GHz, 10G network with Chelsio T4 cards, 7200 enterprise based pool, 128GB 2400 ECC, NVMe ZIL + L2ARC, 1600W PSU + UPS, all new)
  • LAN config simple: PC, ESXi server and FreeNAS server all connected to ports on the same 10G switch
  • Largely single user usage, via Samba (WIndows) and iSCSI (ESXi/Windows)

Symptoms, troubleshooting, outcome:

The basic symptom is that large file writes across the LAN Samba and iSCSI work at high speed, but then consistently time out with TCP zero window after some tens of seconds or minutes. This happens even if the only activity on the LAN or NAS is a single 100GB file being sent across Samba/iSCSI. It does not affect disk reads, nor writes that are local (within the NAS vs over the network).

Tcpdump, netstat, gstat and iostat (and a bit of listening to disk activity!) shows that the issue originates in the server, that it relates to the network RCV buffer filling up, and that the TCP window is progressively reduced as a result. It ends up routinely being pushed to zero or single digit window size for many seconds - as soon as it tries to recover, it's overwhelmed again (so to speak). The behaviour can be seen with millisecond resolution - the network RCV buffer fills linearly, and is not being correctly emptied. When it can empty, it happens very fast - within a millisecond. But as I've described, it quickly gets pushed down again to zero.

At the client end, if a send SMB packet gets a zero window reply, it waits for a second, then a second, then 2, 4, 8, 20, 20 seconds, then decides the link has failed and drops the Samba session after about 58 seconds of "hanging". At the NAS end, the buffers are not emptied promptly by ZFS, so they fill aggressively and statistically at some point it's very likely that all 5 or 6 of the client attempts to send will happen when a buffer is full, and that's the point the session can't recover.

Troubleshooting shows that the issue is almost certainly within ZFS - it's not a network issue although that's where it is visible. The network itself is doing exactly what it should - when a buffer fills, reduce the window, when it empties, increase the window. If I use RAM for testing, to increase TXG sizes and write intervals considerably, I can hear the disks idling, then clattering as they flush the ZFS TXG data, and also watch TXG flushing with iostat. The network RCV buffer filling and sudden emptying is tied to that sound - meaning that when the RCV buffer does empty according to netstat/tcpdump, it's audibly around the same time that I hear a write flush start and I can see heavy disk writes begin in iostat.

Basic troubleshooting is all OK - I've switched ports, cables, cards, transceivers, switch, PCIe slots etc makes no difference, nor does reinstalling/testing with clean install of client or server. I can't test on 1G because 1G wouldn't saturate the RCV buffer in the first place, but there's no reason to believe it's a networking issue at all.

The basic suspicion is that ZFS is not pulling data from the network buffers correctly. From simple descriptions, what SHOULD happen is that data should always be received smoothly into the current txg. When it's time for that txg to write out/flush to disk, data from the network should be smoothly reading into the next txg. That's not happening. What appears to be happening is that ZFS is simply not pulling data from the network in the background, at some points and times in its cycle. If at any time ZFS doesn't smoothly pull data from the network, at 10G speeds it would only takes a few milliseconds for the RCV buffer to fill, and the symptoms noticed would be the expected outcome.

Results of asking on FreeBSD-fs (file system) mailing list:

I was pointed to this upstream 'known bug' in ZFS as the very likely cause . It describes an issue specific to the ZFS deduplication write pathway, where it doesn't notify correctly of txgs related to dirty data, and therefore txgs are not flushed and filled (propagation of dirty data accounting is incorrect). The bug description states:

"The [present] action is taken to handle various rounding errors and rare scenarios. However, it appears to mask problems with dirty data accounting in not so rare scenarios. For example, it seems that zio_ddt_write does not propagate io_physdone callback to any of child zios that it creates. We observe that during heavy writing to datasets with dedup enabled the dirty data size stays at its maximum value most of the time before dropping to almost zero and quickly growing back again. ... There are also other scenarios when a buffer is not deducted from the dirty size until the big cleanup. ... It is also not adjusted in the syncing context because there is no physical write."

"Logical i/os that write to more than one top-level vdev (i.e. those with more than one DVA, aka ditto blocks, which all metadata are) would only be accounted when both DVAs were written. This tended to cause the dirty accounting for metadata writes to jump at the end of the txg, rather than completing smoothly as physical i/o completes."

The reporter of the bug confirmed that this sounded the likely reason for the issues observed on my NAS. I also posted a lot of the diagnostic/troubleshooting data on the forum if it's helpful, but probably not useful now a root cause seems to have been identified.

The bug hasn't had action for ages, although the area of action/fix is described and it's only rated as "medium" for difficulty. Can something be done to progress it?


Related issues

Related to FreeNAS - Bug #69699: Sysctl improvements related to zfs space map handlingScreened

History

#1 Updated by Dru Lavigne 10 months ago

  • Assignee changed from Release Council to Alexander Motin
  • Target version changed from Backlog to 11.2-RC2

#2 Updated by Alexander Motin 10 months ago

  • Status changed from Unscreened to Screened
  • Priority changed from No priority to Important
  • Severity set to Low Medium

Thank you for the heads-up. I was looking on the code around the dirty buffers accounting recently, but had a different goals and was not aware about that bug report. It needs another look.

#3 Updated by Alexander Motin 8 months ago

  • Subject changed from Upstream ZFS write pathway bug causing severe issues to Unstable write performance, possibly caused by ZFS
  • Status changed from Screened to Blocked
  • Target version changed from 11.2-RC2 to Backlog
  • Reason for Blocked set to Need additional information from Author

I had conversation about Matt Ahrens about the mentioned bug, and he does not believe it should cause so significant problems. ZFS has a limit of dirty data kept in ARC, which for large systems is about 4GB. Comparing to that amount, whether dirty amount measured on logical of physical ZIO level should not matter, since it barely reach even 1MB any way.

So I would say that we need some additional input for this issue, FreeNAS debug archive, created while system exposing the problem, attached to the ticket would be a minimal starting point. Some visual observations of disk (gstat -I 1s -p), CPU (top -SHIz) and memory usage (also top or other means) in dynamics would also be helpful. Otherwise we may need to reproduce it ourselves, that will take more time.

#4 Updated by Dru Lavigne 8 months ago

  • Status changed from Blocked to Closed
  • Target version changed from Backlog to N/A

#5 Updated by Stilez y about 1 month ago

Please reopen this issue. I know what's causing it, now.

The previous interpretation of the issue was incorrect. The problem in the OP isn't dirty data cache freeing, as freebsd-fs list replies suggested. It's an unreported effect of space map processing, when faced with large modern HDDs. And it can kill even modest file write sessions.

I'm 100% confident about what's going on, at this point. To jump ahead, after documenting the points below, I found a BSDCan 2016 talk by Matt Ahrens this morning, which all but confirms this update - see below.

CURRENT TROUBLESHOOTING WORK

The original problem persisted through to FreeNAS 11.2-REL, and in December 2018 I decided to teach myself a bit more than basic dtrace to dig into it. The result was immediately revealing.

  • The pool is 4 x striped 3-way mirrors, each 8TB enterprise SAS-3 7200 via LSI 9311 HBA (PCIe3 x8) and 57% full (14TB allocated/10TB free, standard compression/blocksize 128K etc). It should have a very good async IO performance.
  • But upon quiescing the server and 10G LAN, then feeding it a single monolithic 50GB async write, dtrace unexpectedly showed an intense load of predominantly 4k (up to ~ 128K) IO was being generated at device level. A single monolithic write shouldn't create predominantly 4K IOs in such profusion, or at least not unless the pool is quite full. This pool had over 10TB free space across 4 vdevs. Additionally, the 1M ~ 64M writes for the data should be seen, but those are mostly missing or at best form a very small sporadic part of the disk requests.
  • It's also instantly apparent that if ZFS were somehow unexpectedly generating copmparatively many 4K IO requests, it would explain the stalling of writes described in the original post - it never gets to clear disk IO queues at pool device level in the TXG cycle, sufficiently to cope. This immediately suggested some kind of fragmentation issue. Probably metadata related, because the as zdb -mmm confirmed, the pool hasn't ever had nearly enough data to have issues finding a few contiguous 50G chunks. It had many empty metaslabs.
  • The proximate cause for the file write stall issue now seemed clear. It seemed like some form of fragmentation, but the pool was far, far below the level where most documentation says that fragmentation kicks in (85% "best fit"). I pulled the very similar HDDs and HBAs from a backup server, added them to the server showing the symptom, created a comparable empty pool also 4 way striped, and performed the identical write to that pool - and the issue completely vabnished. Same client, same source, same system, almost identical HBA/disks, empty pool. Dtrace obligingly showed that as expected, this correlated with write size/volumes coming back to expected values. Almost all writes to pool #2 were 1M ~ 64M as expected, there was only the expected modest degree of 4K (or 4K ~ 128K) IO. The 50GB monolithic file write didn't have any stalling issues at all.
  • I tried numerous sysctl approaches suggested by Delphix' blog on write throttle tuning [especially read the last part] related to the write throttle, but none helped much, even at extreme values. For example, I doubled, then redoubled delay_scale up from 500K to 50M (x100) eventually, and when that achieved nothing, I took dirty data max/min params down gradually from 60% to as low as 5% ~ 15%, as suggested. But even that still wasn't enough to fix it.
  • The ZFS write throttle should respond when disk IO is choking the system or "maxing out", but clearly there's a secondary problem that for unknown reasons, the write throttle is in fact not invoked, or not adequately effective, in this scenario. Presumably that could be because this 4K flurry is originating in a place where the write throttle isn't triggered, or because the write throttle isn't designed for the scenario and is lacking in some way.

DTRACE OUTPUT

Typical dtrace output for pool #1. The histograms show a predominance of 4K (~ 128K) writes:

da12 Write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@                      113
            8192 |@@@@@@@@@@@                              68
           16384 |@@@@@@@                                  42
           32768 |@@                                       12
           65536 |                                         2
          131072 |                                         0

da14 Write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@                        97
            8192 |@@@@@@@@@@@@                             65
           16384 |@@@@@@@@                                 46
           32768 |@@                                       13
           65536 |                                         1
          131072 |                                         1
          262144 |                                         0

da13 Write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@                       115
            8192 |@@@@@@@@@@@@                             76
           16384 |@@@@@@@@                                 47
           32768 |@@                                       10
           65536 |                                         2
          131072 |                                         0

da7 Write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@                     67
            8192 |@@@@@@@@@                                31
           16384 |@@@@@@                                   19
           32768 |@@                                       8
           65536 |@                                        4
          131072 |@@                                       8
          262144 |                                         0

da4 Write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@@@@                  102
            8192 |@@@@@@@                                  29
           16384 |@@@@                                     18
           32768 |@@@                                      12
           65536 |@@                                       9
          131072 |@                                        5
          262144 |                                         0

da5 Write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@                     75
            8192 |@@@@@@@@                                 28
           16384 |@@@@@                                    19
           32768 |@@                                       9
           65536 |@@@                                      12
          131072 |@@                                       6
          262144 |                                         0

da8 Write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@                     95
            8192 |@@@@@@@@@                                42
           16384 |@@@@@                                    23
           32768 |@@@                                      15
           65536 |@@                                       8
          131072 |@@                                       9
          262144 |                                         0

da1 Write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@                     105
            8192 |@@@@@@@@                                 43
           16384 |@@@@                                     23
           32768 |@@@@                                     21
           65536 |@@                                       12
          131072 |@@                                       9
          262144 |                                         0

da6 Write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@                          65
            8192 |@@@@@@@                                  31
           16384 |@@@@@                                    23
           32768 |@@@@@@                                   25
           65536 |@@@                                      14
          131072 |@@                                       10
          262144 |                                         0

da3 Write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@                       59
            8192 |@@@@@@                                   20
           16384 |@@@@                                     13
           32768 |@@@@                                     12
           65536 |@@@@                                     13
          131072 |@@@@@                                    16
          262144 |                                         0

ada1 Write
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@                                       13
            1024 |@@                                       17
            2048 |@@@@@@                                   52
            4096 |@@@@@                                    41
            8192 |@@@                                      25
           16384 |@@@@@                                    46
           32768 |@                                        7
           65536 |@                                        8
          131072 |@@@@@@@@@@@@@@@                          126
          262144 |                                         0

ada0 Write
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@                                        10
            1024 |@@                                       16
            2048 |@@@@@@                                   50
            4096 |@@@@@                                    37
            8192 |@@@@                                     29
           16384 |@@@@@                                    44
           32768 |@                                        6
           65536 |@                                        9
          131072 |@@@@@@@@@@@@@@@                          126
          262144 |                                         0

da5 Write
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |@@@@@@@@@@@@@@                           51
          131072 |@@@@@@@@@@@@@@@@@@@@@                    77
          262144 |@@@                                      12
          524288 |@@                                       6
         1048576 |                                         1
         2097152 |@                                        2
         4194304 |                                         0

da4 Write
           value  ------------- Distribution ------------- count
            8192 |                                         0
           16384 |@                                        3
           32768 |                                         1
           65536 |@@@@@@@@@@@@@@@@@@@@                     87
          131072 |@@@@@@@@@@@@@@                           60
          262144 |@@@                                      14
          524288 |@@                                       7
         1048576 |                                         0
         2097152 |                                         2
         4194304 |                                         1
         8388608 |                                         0

da12 Write
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |@@@@@@@                                  44
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             164
          262144 |@@@@                                     23
          524288 |@                                        4
         1048576 |                                         2
         2097152 |                                         0

da13 Write
           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        207
          262144 |@@@@@                                    34
          524288 |@                                        7
         1048576 |                                         2
         2097152 |                                         0

da1 Write
           value  ------------- Distribution ------------- count
           16384 |                                         0
           32768 |@@                                       8
           65536 |@@@@@@@                                  35
          131072 |@@@@@@@@@@@@@@@@@@@@                     109
          262144 |@@@@@@@                                  39
          524288 |@@@                                      17
         1048576 |@                                        4
         2097152 |                                         0
         4194304 |                                         1
         8388608 |                                         0

da8 Write
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |                                         2
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@                118
          262144 |@@@@@@@@@@                               46
          524288 |@@@@                                     19
         1048576 |@                                        6
         2097152 |                                         0
         4194304 |                                         0
         8388608 |                                         1
        16777216 |                                         0

da0 Write
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |@@@@@@@@@@@@                             49
          131072 |@@@@@@@@@@@@@@@@@@                       76
          262144 |@@@@@                                    20
          524288 |@@@@                                     19
         1048576 |                                         0
         2097152 |                                         2
         4194304 |                                         1
         8388608 |                                         2
        16777216 |                                         0

Dtrace for pool #2 on the same server has the complete opposite and shows the expected results: almost no small IO, and the predominant large writes now show up as well:

da14 Read
           value  ------------- Distribution ------------- count
         4194304 |                                         0
         8388608 |@@@@@@@@@@@@@@@@@@@@                     1
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@                     1
        67108864 |                                         0

ada0 Other
           value  ------------- Distribution ------------- count
           16384 |                                         0
           32768 |@@@                                      1
           65536 |@@@                                      1
          131072 |                                         0
          262144 |                                         0
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        10
         8388608 |                                         0

ada1 Other
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |@@@@@@@                                  2
          131072 |                                         0
          262144 |                                         0
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        10
         8388608 |                                         0

da0 Read
           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |@@@@@@@@@@                               1
          262144 |                                         0
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |@@@@@@@@@@@@@@@@@@@@                     2
         8388608 |                                         0
        16777216 |                                         0
        33554432 |@@@@@@@@@@                               1
        67108864 |                                         0

da12 Read
           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |@@@@@@@@@@@@@@@@@@@@                     2
          262144 |                                         0
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |                                         0
         8388608 |@@@@@@@@@@                               1
        16777216 |                                         0
        33554432 |@@@@@@@@@@                               1
        67108864 |                                         0

da0 Other
           value  ------------- Distribution ------------- count
        33554432 |                                         0
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
       134217728 |                                         0

da1 Other
           value  ------------- Distribution ------------- count
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
        67108864 |                                         0

da3 Other
           value  ------------- Distribution ------------- count
        33554432 |                                         0
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
       134217728 |                                         0

da6 Other
           value  ------------- Distribution ------------- count
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
        67108864 |                                         0

da13 Other
           value  ------------- Distribution ------------- count
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
        67108864 |                                         0

da14 Other
           value  ------------- Distribution ------------- count
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
        67108864 |                                         0

da8 Other
           value  ------------- Distribution ------------- count
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@                     1
        67108864 |@@@@@@@@@@@@@@@@@@@@                     1
       134217728 |                                         0

A bit more digging led me to intense reading on metaslabs and spacemaps, and their fragmentation.

NOTE: Since posting this, I've created a "heat map" of metaslab usage and fragmentation (see below). It shows that of the 4 vdevs, the worst fragmentation was 2 vdevs at just <= ~55% FRAG, and every vdev had > ~25% of metaslabs with 50% or less usage, confirming my impression that the pool just shouldn't be exhibiting fragmentation issues or finding it hard to write such data.

While this pool's spacemaps don't look very fragmented at all (zdb -mm/mmm and charting the result per metaslab), apparently on the basis that they can choke ZFS to a total standstill, they are.

CLIENT SCREENSHOT AND DISCUSSION

Here's a quick screenshot comparing the same write to the two pools, on the same system (Windows client -> Samba):

The top shows 2x 30GB files copying to pool #1 (30TB, 57% full, presumed fragmented), the bottom shows the same 2 files copying to pool #2 (30TB, newly created/empty). The copies were started within a couple of seconds of each other, and ran simultaneously. Same client, same files, same LAN link, same Samba daemon, same system, same FreeNAS box at the other end. On the NAS, 11.2-U1 REL with two pools: same HBA, same structure, very similar disks. One pool is established with 14TB of 25TB used, the other is 25TB empty.

For a while both transfers run fine. They share the bandwidth and both write at 200 MB/s. But after about a minute, the first transfer plummets to zero - and never recovers. However the second transfer not only continues, but is able to use the spare bandwidth to double its speed.

So as stated, the issue isn't in the client OS, or the LAN, or the network stack. The only differentiator is that the two pools both have plenty of free space, but different fragmentation - and dtrace shows the first is apparently being killed by 4K IO that the second isn't experiencing. Or at least, it's experiencing both a large amount of small IO, and a backing up of disk IO requests causing the source data pipeline to stall, which sounds like the same thing.

DISCUSSION + FOLLOWUP SINCE ABOVE

Firstly this is clearly a ZFS upstream issue. So running it by someone who intimately knows ZFS will be helpful.

I've emailed a summary "out of the blue" to a couple of people this last week, but no replies yet. I probably hit spam/"unknown sender" filters, or didn't explain it well.

I've also tried to take things further. Spacemap management on ZFS is not too well documented, I've gone through metaslab.c, but there are no hints around, other than rebuilding the pool (zfs send > recv) to defragment it. I'm doing that. It will probably work, but is likely to return periodically.

MATT AHREN'S 2016 TALK

Finally - and what prompted this bug report update - I found apparent confirmation of the scope for exactly this issue I've described, in a talk by Matt Ahrens at BSDCan 2016 . The linked PDF discusses allocation performance and I draw the following comments from it. (See pages 20 "Writing spacemaps" onward):

  • "Dynamic behaviour: ~600 i/os per vdev per txg"
  • _"Problems: Reading spacemap is slow, and loaded spacemaps use lots of RAM
  • "... space_map_blksz=4KB" [This fits the "predominantly 4K" IO seen]
    "... space_map_blksz=128KB would mean writing 25MB/vdev/txg"
    "... frees are scattered across all metaslabs"
    "... 1TB vdev => spacemap up to 5MB (1,280 blocks)" (so for 30TB.....?)
    "... metaslabs cover huge range (50GB on 10TB vdev)"
    "... range tree uses up to 320MB RAM (on 10TB vdev) - 5 million segments per metaslab"

This seems to describe exactly what this bug reports.

What that paper doesn't mention, or conclude, is that it constitutes an issue capable of killing fairly modest and unremarkable file server sessions and choking the server to a standstill, even on fast adequately-specced enterprise hardware under very low load, such as "single client writing single 30GB file to a ~55% full pool".

WORKAROUNDS AND MITIGATION HOPES?

Since a proper fix will be complex, perhaps there are workarounds. I can think of three which might be helpful to work on until the basic problem has a better answer:

  1. BETTER/MORE PRECISE DATA VIA DTRACE?: As the paper suggests, there may be workarounds in terms of metaslab parameters or other tunables, or appropriate dtrace scripts that can be used to identify an optimal metaslab setup?
  2. TUNABLES AND DEFAULTS?: What would appropriate metaslab-related tunables be, when a pool comprises 8 - 20 TB vdevs? Clearly ZFS spacemaps were not envisaged to cover 100 GB each, which is the size expected when HAMR HDDs launch next year. Right now a spacemap covers 0.5% of the vdev, however large that is. What should min+max metaslab sizes, and spacemap block size, be, as common individual HDDs size grows from 8TB ~ 24TB, to ensure sanity?
  3. INVOKE EFFECTIVE WRITE THROTTLE?: A more immediately promising mitigation could be to look at why write throttling isn't being triggered, when spacemap fragmentation leads to ZFS starting to choke on 4K and other small IO? It seems that a good first step would be to ensure that this issue can't get as far as paralysing not just ZFS but the entire networking or other session (if it's a file write over LAN) without the write throttle kicking in much earlier to slow down the source.

If anything can be done on these last points, please do. Right now this file server - despitre being built specifically for ZFS needs (even heavily over-specced) is simply unusable without these help.

It might start to hit others more often, too, especially 10G LAN commercial users buying into HAMR in 2019/2020 - exactly the same scenario. Also because the longer ZFS is out there, the more there may be pools that have had time for this issue to develop.

Thanks for reading, sorry it was long!

#6 Updated by Stilez y about 1 month ago

  • Subject changed from Unstable write performance, possibly caused by ZFS to Unstable write performance and session death, due to spacemap IO on large HDDs

#7 Updated by Alexander Motin about 1 month ago

  • Status changed from Closed to Unscreened

Thanks for your input, I'll look on it closer nearest time. Reopening the ticket for now.

In context of mentioned spacemaps on large disks, there is a new spacemap format, support for which should appear in FreeNAS 11.3. It supposed to more efficiently store large sequential chunks of free space. It won't help much to fragmented pools, where spacemap is big for reason, but at least something.

Also there were few bugs fixed in 11.2 process, related to choosing metaslab to write, that should reduce chances of synchronous spacemap loads, severely hurting write performance. It significantly improved number of our internal tests with heavy fragmentation, but may be your case is different.

#8 Updated by Stilez y about 1 month ago

  • Seen in changed from 11.1-U4 to 11.2-RELEASE-U1

Thanks Alex.

The above report was on 11.2-U1, so I've updated the "Seen in" field so it's clear it's seen on the current system. Please revert if version should be left at 11.1-U4.

The 11.2 fixes didn't seem to improve anything - I didn't realise there were any changes. It's hard to see a reason why my pool would be especially pathological - if anything I'd expect it to be less heavily churned and fewer issues than an average 2 year old pool, if I had to guess at some kind of norm for usage.

BASIC SPACEMAP ANALYSIS (per zfs + zdb output):

VDEV+ POOL STATS

VDEV        SIZE      ALLOC      FREE      FRAG      CAP        NUM. OF       VDEV          METASLAB
                                                                METASLABS     SIZE (MB)     SIZE (MB)

0           7.25T      4.12T     3.13T      41%      56%        232            7602176      32768
1           5.44T      3.83T     1.61T      56%      70%        174            5704253      32783
2           5.44T      3.82T     1.62T      56%      70%        174            5704253      32783
3           (OMITTED: ZIL)
4           7.27T      2.70T     4.56T      29%      37%        465            7623148      16394

Pool total  25.4T      14.5T     10.9T      43%      57%
METASLAB SPACE USED (HEAT MAP)

VDEV              0      1      2      4

COUNT OF METASLABS BY USAGE (0%=UNUSED)

0% - 4.9%        47      8      4    186
5% - 9.9%         8      2      3     13
10% - 14.9%       8      4      5     12
15% - 19.9%       4      4      2     10
20% - 24.9%       5      5      5      5
25% - 29.9%       3      5      4      9
30% - 34.9%       4      6      5     18
35% - 39.9%       7      5      4     37
40% - 44.9%       2      4      5      6
45% - 49.9%       4      4      4     10
50% - 54.9%       3      4      6      5
55% - 59.9%       3      3      7     10
60% - 64.9%       8      9      8      4
65% - 69.9%       5      4      6      6
70% - 74.9%      10      7     12      8
75% - 79.9%       7      3      2      7
80% - 84.9%      15      6     16      6
85% - 89.9%      19     20     20     18
90% - 94.9%      28     17     20     48
>=95%            42     54     36     47

TOTAL           232    174    174    465

SUMMARY:

<50%             40%    27%    24%    66%
50-69%            8%    11%    16%     5%
70-84%           14%     9%    17%     5%
85-89%            8%    11%    11%     4%
90%+             30%    41%    32%    20%

                100%   100%   100%   100%
TOTAL SPACEMAP RECORDS FOR POOL:  ~400M  (per @zdb -mmmm@)

As you can see, there is some fragmentation, but as I described, certainly not what would seem excessive or enough to pathologically kill ZFS async write processing. Would any other info help at this point?

#9 Updated by Stilez y about 1 month ago

See also request: bug #69699 - sysctl change requests that may help somewhat.

#10 Updated by Dru Lavigne about 1 month ago

  • Target version changed from N/A to Backlog
  • Reason for Blocked deleted (Need additional information from Author)

#11 Updated by Alexander Motin about 1 month ago

  • Related to Bug #69699: Sysctl improvements related to zfs space map handling added

#12 Updated by Alexander Motin about 1 month ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Waiting for feedback

Colleague reminded me that in original message you've mentioned use of dedup. As I see in ddt_zap.c, it also uses 4KB blocks for both indirect and data blocks. May be that is what you really see as 4KB disk reads, not a spacemap. Spacemap reads, while indeed 4KB now, are usually read in bursts with significant prefetch and queue depth. It means they should not block I/O often and for very long. Spacemaps unlike dedup tables are easily prefetchable, and in many cases should be preloaded far in advance of been needed, so should not block I/O. On our tests with heavily trashed pools we could see I/O blocked by spacemaps loads, but at that point system was visually CPU bound, processing what was read, rather then doing the reads.

So, of you are still using dedup, please disable it, or create a new dataset with it disabled, and lets see what happen. While it won't do anything to already dedupped data, new writes after that should not depend on dedup table IIRC.

#13 Updated by Stilez y about 1 month ago

I'm trying it the other way around - defragging the pool via send -R | recv to an identically structured but newly created pool on the same server (same HDD models/same vdev+pool layout), but with 3 changes - defrag as mentioned, large spacemap blocks, and smaller metaslabs (8GB each => 698 metaslabs on 1x 6TB vdev and 931 metaslabs on each of 3 x 8TB vdev, previously ~200 per vdev). The old pool used standard defaults of 4K blocks and 200 ms per vdev (=> 28GB metaslabs on 1x 6TB vdev and 37GB on each of 3x 8TB vdev).

(Regarding dedup, the pool is currently obtaining a ratio of about 3.2x on 14TB of data, so the space+cost saving is huge and the config is appropriate. The server is specced from scratch for dedup use, with a fast AESNI CPU, 250GB NVMe L2ARC, enough fast ECC RAM to hold entire DDT in ARC as well as 80GB more, and sysctls set to ensure >=85GB of ARC reserved for metadata which is >> the ~ 40GB of DDT RAM size, so DDT is likely to be both held and infrequently evicted)

The send/recv has a few days to run, then I'll test the old and new pools and discover whether the fault persists on identical pools +datasets with only those differences between them, and with the default + modified loader variables for these sysctls. Please hold the ticket open for the next week for this.

I have a debug request - if you can perhaps append to this, some quick+crude suggested dtrace scripts that will provide (a) hisatogram of spacemap reads + their sizes, (a) hisatogram of spacemap writes + their sizes, and (c) DDT reads and their sizes, we will obtain certainty.

Last for the moment, whatever the cause, there does seem to be a clear write pathway loophole/bug by which, if large numbers of small metadata IO requests are being generated "on the side" (from spacemaps or otherwise), ZFS doesn't trigger necessary/expected write throttling or gracefully signal the source to slow down upon the disk system becomes choked/overwhelmed. Can you check if your colleague tentatively concurs, or if he/she has an alternative understanding, and again, any suggestions how to confirm if there is another explanation?

#14 Updated by Alexander Motin 8 days ago

Have you had any results from your new pool or other updates?

Unfortunately I have no preexisting scripts to analyze the I/O you've requested.

Also available in: Atom PDF