Project

General

Profile

Bug #927

NFS lockd locks up and hoses clients too

Added by Joe Greco over 6 years ago. Updated about 1 year ago.

Status:
Closed: Cannot reproduce
Priority:
Important
Assignee:
-
Category:
Middleware
Target version:
-
Start date:
Due date:
% Done:

0%

Severity:
Backlog Priority:
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Hardware Configuration:
ChangeLog Required:
No
QA Status:
Not Tested

Description

I'm running across a problem that seems very similar to

http://lists.freebsd.org/pipermail/freebsd-net/2009-July/022443.html

rpc.lockd appears to be getting stuck. When this happens, all other NFS clients basically go to hell eventually when they try a lock.

On the [[FreeNAS]] side, the problem is characterized by rpc.lockd being stuck in "rpcrec", and a stream of kernel messages.

% ps agxlww | grep lockd
0 1619 1 0 44 0 7956 804 rpcrec Ds ?? 0:02.45 /usr/sbin/rpc.lockd
% dmesg
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
NLM: failed to contact remote rpcbind, stat = 5, port = 28416
etc

On the client side, it's a stream of

"nfs server $foo:$mnt: lockd not responding"

The rpc.lockd process is not killable via kill -9; a reboot momentarily clears the condition. If you simultaneously reboot the client that apparently triggered the condition, then it sometimes goes away. Rebooting both the server and client fixes it, until the client tries it again.

The client that is causing this is Apple OS X, and it seems to have been brought on by trying to share our photo archive and letting iPhoto create its library on the NFS share.

I realize that this is not strictly a [[FreeNAS]] problem and is most likely [[FreeBSD]]'s lockd implementation. However, I'm reporting it here because it's happening to me on [[FreeNAS]] and it's more likely to impact [[FreeNAS]] users.

History

#1 Updated by Anonymous about 6 years ago

The only way to work around this issue on NFSv3 is to use soft mountpoints. Unfortunately this isn't an an available option from the NFS server side.

This will most likely be a non-issue with NFSv4 because NFSv4 is a stateful protocol -- unlike v2 and v3 -- but I could be wrong.

#2 Updated by nabbebx - about 6 years ago

So if the only workaround is not an option, and NFSv4 is not due in till [[FreeNas]] 9.0, what do we do in the mean time?

#3 Updated by Anonymous about 6 years ago

Thanks for the reminder. I'll need to look into this again if I have time in the next couple weeks.

#4 Updated by nabbebx - about 6 years ago

Any chance we can get this marked for Release 8.1 or lower?

#5 Updated by nabbebx - almost 6 years ago

How about getting this for a 8.2 release date? It is a bit of a crucial feature for a NAS...

#6 Updated by Xin Li almost 6 years ago

Hi, I have a theory. Could you please help us by trying disabling firewall on the client side and see if the problem goes away?

Thanks in advance!

#7 Updated by nabbebx - almost 6 years ago

Sorry to disappoint, there is no client side firewall enabled.

#8 Updated by Xin Li almost 6 years ago

Hi,

Replying to [comment:7 nabbebx]:

Sorry to disappoint, there is no client side firewall enabled.

Since you are not the original submitter, are you hitting the same problem? E.g. do you encounter the same NLM: failed to contact remote rpcbind on server side?

#9 Updated by Xin Li almost 6 years ago

Some observations with testing on OS X with firewall enabled and blocking all incoming traffic:

import fcntl
f = open("testfile", "w")
fcntl.flock(f, fcntl.LOCK_EX)

The script will stream a few timeout messages until firewall is turned on. The NFS locking protocol requires to talk with the client side's locking manager. Note that this may or may not be the problem the user hit until we get confirmation from the reporter, I just write it here for future reference.

#10 Updated by Joe Greco almost 6 years ago

No, because even if there had been a firewall on one of the clients, that should not cause the server lockd process to go insane and to take down service for other clients. The environment in question had actually been working fine for a short time with iPhoto, but fell apart after two instances of iPhoto tried to access the share, if I recall correctly. We had had no problems with NFS up until that point.

#11 Updated by Xin Li almost 6 years ago

Replying to [comment:10 jgreco]:

No, because even if there had been a firewall on one of the clients, that should not cause the server lockd process to go insane and to take down service for other clients. The environment in question had actually been working fine for a short time with iPhoto, but fell apart after two instances of iPhoto tried to access the share, if I recall correctly. We had had no problems with NFS up until that point.

No I'm not saying that the server should behave this way, I just wanted to confirm if firewall caused the problem, otherwise we might be fixing something different.

#12 Updated by Jordan Hubbard almost 4 years ago

  • Status changed from Unscreened to Closed: Cannot reproduce
  • Seen in set to

Not seeing this.

#13 Updated by Chris C almost 4 years ago

We are seeing issues very similar to the original post. We are using FreeNAS-8.0.3-RELEASE-p1-x64. In the log messages we start seeing the error:

"Failed to contact local NSM - rpc error 5"

Soon after we noticed that lockd goes into an uninterruptible state and Centos clients start seeing the following message and hang:

"lockd: FreeNASserver name not responding, still trying"

We also observed that the rpc.lockd process is not killable via kill -9. The only course of action we can find is reboot. After this seems to clears the condition temporarily but the issue comes back intermittently. We are currently trying to work around this by having clients use the nolock mount option. There definitely seems to be an issue with lockd or statd on this version of FreeNAS. We have enabled additional logging by turning on debug mode on /usr/sbin/rpc.statd and /usr/sbin/rpc.lockd.

We are looking to upgrade to a newer release but wanted to see if others know if this issue might be addressed in later releases.

#14 Updated by Josh Paetzel almost 4 years ago

Chris,

I recommend you update to the 9.2.1.4 release that went out today. There have been a number of improvements to the NFS subsystem.

Of particular interest to you is the ability of the kernel to print out the ip of a client that is giving the server lockd problems. In addition we have resolved some issues that would cause the server lockd to stall.

You'll also get a nice performance increase. Latency is much improved.

#15 Updated by Frank Wall about 3 years ago

I guess I'm seeing the same issue with FreeNAS 9.3-BETA :-(

FreeNAS Server:

Failed to contact local NSM - rpc error 5

FreeBSD Clients:

nfs server XXX:/mnt/myvolume: lockd not responding
nfs server XXX:/mnt/myvolume: lockd is alive again
nfs server XXX:/mnt/myvolume: lockd not responding
nfs server XXX:/mnt/myvolume: lockd not responding
nfs server XXX:/mnt/myvolume: lockd is alive again

#16 Updated by Rich Maclannan over 2 years ago

  • Seen in changed from to 9.3-STABLE-201505130355

Sorry to bump an old thread, but I saw this exact scenario today, and the only solution was to reboot.

It seems to be caused by a client running an rpmbuild against data shared over NFSv3. The error was identical to https://bugzilla.redhat.com/show_bug.cgi?id=494042

Unfortunately, I didn't get chance to apply the latest round of updates against the filer before rebooting, so I'll be doing that tonight.

However:

- should I be thinking about NFSv4 only
- has there been progress on this error in another (more recent) thread?

Thanks!

#17 Updated by Ash Gokhale over 1 year ago

related: HPD-939-45858
included collateral includes cores and extensive debugging.

#18 Updated by Ash Gokhale over 1 year ago

And again with centos: OWF-338-92811

#19 Updated by Ash Gokhale about 1 year ago

  • Fence Lizard changed from No to Yes

#20 Updated by Nicholas Bettencourt about 1 year ago

VHY-748-66332.

#21 Avatar?id=14398&size=24x24 Updated by Kris Moore about 1 year ago

Guys, it might be worth-while opening a new ticket for this one. I almost missed it myself, and a new ticket will make sure it gets assigned / tracked by the correct people going forward.

#22 Updated by Nicholas Bettencourt about 1 year ago

Will work with the team to create a new ticket. Also need to throw JYR-608-70399 on this

Also available in: Atom PDF