Project

General

Profile

Bug #55728

mountd queue overflowing

Added by Dave Flowers about 1 month ago. Updated 2 days ago.

Status:
Ready for Testing
Priority:
No priority
Assignee:
Sean Fagan
Category:
OS
Target version:
Seen in:
Severity:
Medium
Reason for Closing:
Reason for Blocked:
Other: make a note in comments
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

I have a compute cluster (Ubuntu, current) that uses several freenas's as fileservers. Directories (both data and user) are automounted from the cluster. When someone launches a job across a large number of machines at once (40-50), attempts to mount either home or data directories can fail with "permission denied". Looking at the servers, this is because the mountd listen queue is overflowing:

sonewconn: pcb 0xfffff801e5278570: Listen queue overflow: 193 already in queue awaiting acceptance (48 occurrences)

One time, I watched the results from netstat -Lan to confirm this was from mountd and saw the queue size reach 188 (out of 128).

There does not appear to be any way to either increase the size of the queue (either by kernel variable, ala kern.ipc.soacceptqueue, or by command line option), nor any way to improve mountd's ability to serve queue requests (such as by giving it more threads). I have found historical references to kern.ipc.somaxconn, but they're generally around eight years old, and the code for mountd that I pulled from a current version of freebsd has the listen queue limited to SOMAXCONN, which is hard coded to 128.

This would appear to be a general problem of using a freenas to serve files to a compute cluster.

Associated revisions

Revision 1e9c76f0 (diff)
Added by Sean Fagan 7 days ago

Another fix for the mountd problem reported by Dave Flowers;
this one reduces the number of DNS calls made.

I'll be doing a differential revision for FreeBSD as soon as
my build for that finishes and I can test it.

Ticket: #55728

Revision 9e7a34ca (diff)
Added by Sean Fagan 6 days ago

Use 80 column lines; this again matches the freebsd differential
proposal.

Ticket: #55728

Revision 5aa097bc (diff)
Added by Sean Eric Fagan 5 days ago

Another fix for the mountd problem reported by Dave Flowers; (#158)

  • Another fix for the mountd problem reported by Dave Flowers;
    this one reduces the number of DNS calls made.

I'll be doing a differential revision for FreeBSD as soon as
my build for that finishes and I can test it.

Ticket: #55728

  • Use 80 column lines; this again matches the freebsd differential
    proposal.

Ticket: #55728

History

#1 Updated by Dru Lavigne about 1 month ago

  • Category changed from Services to OS
  • Assignee changed from Release Council to Alexander Motin

#3 Updated by Alexander Motin about 1 month ago

  • Assignee changed from Alexander Motin to Sean Fagan

It does look like a mountd limitation without run-time workaround. Use of the hard value of SOMAXCONN does not allow to use of increased kern.ipc.soacceptqueue, if it is. Quickly grepping through the code I see some base system daemons pass -1 there instead, which defaults to present kern.ipc.soacceptqueue value. If I see correctly it is a violation of POSIX, but on FreeBSD it works.

Sean, could you investigate this deeper, discuss with FreeBSD people and fix? It should be a low-hanging fruit unless somebody object it formally.

#4 Updated by Sean Fagan about 1 month ago

Last I saw, there was a push to move even the NFSv3 mount code into the kernel, so I do expect there to be some pushback.

Changing the listen to use -1 is easy, obviously, and makes sense. Increasing kern.ipc.soacceptqueue for mountd might have some issues with impacting the rest of the system. Making it use multiple threads looks a bit tricky, as the RPC code is involved.

I don't have a compute cluster so that makes testing it hard. Dave, I can easily build you a version for freenas that changes the listen() call, to see if that helps; would you be willing to try that? After that, things get harder.

#5 Updated by Alexander Motin about 1 month ago

Pushback on what? Kernel socked API has exactly the same backlog handling, so no matter where the code is running, it would not affect the overflow handling. The only argument against I may guess is POSIX violation, or some specific reason to have a hard limit here.

#6 Updated by Sean Fagan about 1 month ago

Pushback on any major change to mountd :).

Making listen use -1 is an easier change to mountd than adding options, so I don't expect much pushback on that. Figuring out how to add threads is a larger change, and thus likelier to be objected to.

#7 Updated by Sean Fagan about 1 month ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Waiting for feedback

The change works, but I have no idea if it fixes the problem, since I can't reproduce it.

#8 Updated by Sean Fagan 14 days ago

This was commited as r340442 on FreeBSD HEAD.

#9 Updated by Sean Fagan 12 days ago

Pull request #157 for os.

#10 Updated by Dave Flowers 11 days ago

  • File tcpdump_output added

I pulled the code to mountd.c from github and compiled it on one of our freenas's. It works, in that the queue size can now be set by kern.ipc.soacceptqueue.

Sadly, this has not solved my problem, although it has helped me to get more information on it, since mountd is (eventually) processing all the requests it gets, so I don't have to consider stuff being dropped.

I'm not sure if this is something I should put in a new bug report or not, so I'll post it here and you can direct me to repost it to a new issue as need be.

The current behavior of mountd under the conditions described above (getting near simultaneous mount requests from 47 machines) is roughly the same, with this patch in place - most of the mount requests still fail. The problem is not lost or dropped packets, but timeouts - mountd is simply taking too long to serve mount requests.

As I had access to source, I put in some more logging to try and figure out what was going on, adding syslog entries for each pass through mountd's main serving loop.

The general flow for a request was

client (on random port) -> server : null request
client (on low port) -> server : null request (solaris skips this step)
client (on same low port) -> server : actual mount request

The trouble was that mountd was only serving 11 requests every 5 seconds. This meant that the server's reply to a given request usually came too late, and the client wouldn't accept the response, having sent out another initiating null request. After a few minutes of this, the clients would give up and mark the disk as inaccessible.

I don't know why mountd was serving requests so slowly, but it was very evident in the system logs - every 5 seconds, 11 requests would get (nearly instantly) processed. This is obviously incorrect behavior, but I have no idea why it's happening. nfs service in general works fine, even to very high speeds. I've done tcp logging on the server, and the 5 second delay is present there, too - after an initial ~1.4 second flurry there's a 3.8s gap (I've attached the tcpdump data; 807 is the mountd port), and that repeats (albeit with less activity).

It looks a lot like mountd is being rate limited in some way, but I have no idea what could be doing that.

#11 Updated by Sean Fagan 11 days ago

  • Status changed from Blocked to In Progress

Hrm, neither do I. The analysis helps though.

mountd itself doesn't rate limit. Unless there's a firewall rule in place doing rate limiting, the kernel prints about limited rates (but that's only for ICMP). If there are dropped packets, that might explain it (since it's TCP, with backoff, but not that high).

11:27:49.894864 IP 172.26.23.59.807 > 172.26.22.41.37826: Flags [.], ack 45, win 1025, options [nop,nop,TS val 2567334464 ecr 527664253], length 0
11:27:51.988069 IP 172.26.23.59.807 > 172.26.22.77.40045: Flags [P.], seq 1:29, ack 46, win 1026, options [nop,nop,TS val 4082121105 ecr 1065761289], length 28

seems to be a good example here: there is a 2.093205 second gap between the responses, and there's nothing else coming in on port 807.

mountd itself does a fair amount of work, including system calls, between getting the request, and sending out a response. Which means it's possible the issue is the filesystem. I do note there are lots of requests coming in. That said, a 2.1second delay doesn't seem right, either.

#12 Updated by Dave Flowers 8 days ago

With a bit of low-tech profiling, I found the issue - getnameinfo(), when called the first time in mntsrv() to perform a reverse name lookup, sometimes takes 5 seconds to return (instead of a small fraction of a second). This may be an issue with our local setup, but it also shows that getnameinfo() is being called a lot more than it needs to be - each pass through mntsrv(), even though only the MNT, UMNT, and UMNTALL cases actually use the host name it returns. I suggest moving getnameinfo() into those switch cases rather than putting it up front (replace all the !lookup_failed checks with the call to getnameinfo()). Most calls to mntsrv() are for the NULL case, which doesn't need it.

This does solve the problem for me, as it allows the calls to NULL to quickly return, keeping the handshake active. The actual mounts are slow, though, because of the call in the MNT case, so I've gone and removed it and instead will just use the numeric host address (which completely solved the problem).

When I have time, I might look into why the dns lookup is sometimes giving a five second pause, but this is not quite as urgent at this point.

#13 Updated by Sean Fagan 8 days ago

  • Severity changed from New to Medium

I thought about DNS, but was tracing the wrong path. Thanks for finding that.

diff --git a/usr.sbin/mountd/mountd.c b/usr.sbin/mountd/mountd.c
index a39b4a73222..08d0b301690 100644
--- a/usr.sbin/mountd/mountd.c
+++ b/usr.sbin/mountd/mountd.c
@@ -1024,8 +1024,13 @@ mntsrv(struct svc_req *rqstp, SVCXPRT *transp)
                syslog(LOG_ERR, "request from unknown address family");
                return;
        }
-       lookup_failed = getnameinfo(saddr, saddr->sa_len, host, sizeof host, 
-           NULL, 0, 0);
+       switch (rqstp->rq_proc) {
+       case MOUNTROC_MNT:
+       case MOUNTPROC_UMNT:
+       case MOUNTPROC_UMNTALL:
+               lookup_failed = getnameinfo(saddr, saddr->sa_len, host, sizeof host, 
+                   NULL, 0, 0);
+       }
        getnameinfo(saddr, saddr->sa_len, numerichost,
            sizeof numerichost, NULL, 0, NI_NUMERICHOST);
        switch (rqstp->rq_proc) {

would seem to do that.

My main blocker in getting this checked in is reproduction. I can think of a few reasons why it might be happening for you -- obviously, check your name server list, and ensure that each of them can do the reverse lookup. I assume you're not using NIS :).

#14 Updated by Dave Flowers 8 days ago

Ahh! Sure enough, one of the machines in /etc/resolv.conf was incorrect (it no longer runs a dns server), and the default for RES_TIMEOUT is, of course, five seconds. So, whenever that machine was queried, there was a five second pause.

Still, best to avoid unneeded dns traffic, right? ;)

#15 Updated by Sean Fagan 8 days ago

Well I can probably reproduce that easily enough then, I think. If you want to try the patch as well, that'd be swell.

#16 Updated by Dave Flowers 7 days ago

Your patch has one typo, but otherwise seems to work fine.

MOUNTROC_MNT should be MOUNTPROC_MNT

#17 Updated by Sean Fagan 7 days ago

Yeah, I found that in my build log. Thanks!

#18 Updated by Sean Fagan 7 days ago

  • Status changed from In Progress to Blocked
  • Reason for Blocked changed from Waiting for feedback to Other: make a note in comments

Pull request 158 for freenas/os/11-stable.

#19 Updated by Sean Fagan 5 days ago

  • Status changed from Blocked to Ready for Testing
  • Target version changed from Backlog to 11.2-U2

Merged (and also into FreeBSD).

#20 Updated by Dru Lavigne 2 days ago

  • File deleted (tcpdump_output)

Also available in: Atom PDF