Project

General

Profile

Bug #20125

NFS lockd locks up and hoses clients too

Added by Nicholas Bettencourt about 2 years ago. Updated 5 days ago.

Status:
Closed
Priority:
Expected
Assignee:
Sean Fagan
Category:
OS
Target version:
Seen in:
TrueNAS - 9.3-STABLE-201605241903
Severity:
New
Reason for Closing:
Cannot Reproduce
Reason for Blocked:
Waiting for feedback
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Related projects 1 project

Description

This was created in reference to FreeNAS issue 927.

related: HPD-939-45858
included collateral includes cores and extensive debugging.


Related issues

Related to FreeNAS - Bug #927: NFS lockd locks up and hoses clients tooClosed: Cannot reproduce

History

#1 Updated by Nicholas Bettencourt about 2 years ago

  • Related to Bug #927: NFS lockd locks up and hoses clients too added

#2 Updated by Bonnie Follweiler about 2 years ago

  • Assignee set to Kris Moore

#3 Avatar?id=14398&size=24x24 Updated by Kris Moore about 2 years ago

  • Assignee changed from Kris Moore to Chris Torek
  • Target version set to TrueNAS-9.10.2-U2

Over to Torek! Ash I believe can provide more gory details if you need anything

#5 Updated by Bonnie Follweiler about 2 years ago

  • Private changed from No to Yes

#6 Updated by Bonnie Follweiler about 2 years ago

  • Private changed from Yes to No

#8 Updated by Bonnie Follweiler about 2 years ago

  • File debug-truenas1-20160526100540..tgz added
  • File debug-truenas1-20160826145047..tgz added
  • File debug-truenas1-20160826145047.(1).tgz added
  • File debug-truenas1-20160826145047.(2).tgz added
  • File debug-truenas1-20160826161132_during_nfs_problems..tgz added
  • File nber_truenas1_network_problems_lockd_issue_20160526.tgz added
  • File truenas1-TrueNAS-9.3-STABLE-201602032047-20160826145305.db added
  • File truenas1-TrueNAS-9.3-STABLE-201602032047-20160826145305(1).db added
  • File truenas1-TrueNAS-9.3-STABLE-201602032047-20160826145305(2).db added
  • File truenas1_followup_info_20160601.txt added

#9 Updated by Bonnie Follweiler about 2 years ago

  • Private changed from No to Yes

#10 Updated by Gary Graham almost 2 years ago

  • Private changed from Yes to No

#11 Updated by Gary Graham almost 2 years ago

  • Private changed from No to Yes

#12 Avatar?id=14398&size=24x24 Updated by Kris Moore almost 2 years ago

  • Target version changed from TrueNAS-9.10.2-U2 to TrueNAS-9.10.3

#13 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

  • Target version changed from TrueNAS-9.10.3 to 11.0

#14 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

  • Target version changed from 11.0 to TrueNAS 11.1-U1

#15 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

  • Assignee changed from Chris Torek to Ash Gokhale

#16 Updated by Ash Gokhale over 1 year ago

#17 Updated by Ash Gokhale over 1 year ago

finally - a break and a customer willing to diag

rpc/clnt_dg.c: unmodified: line 460
     /*
         * Try to get a place in the congestion window.
         */
        while (cu->cu_sent >= cu->cu_cwnd) {
                cu->cu_cwnd_wait = TRUE;
                error = msleep(&cu->cu_cwnd_wait, &cs->cs_lock,
                    cu->cu_waitflag, "rpccwnd", 0);
                if (error) {
                        errp->re_errno = error;
                        if (error == EINTR || error == ERESTART)
                                errp->re_status = stat = RPC_INTR;
                        else
                                errp->re_status = stat = RPC_CANTSEND;
                        goto out;
                }
        }
** or ***
rpc/clnt_dg.c: unmodified: line 562
get_reply:
        for (;;) {
                /* Decide how long to wait. */
                if (next_sendtime < timeout)
                        tv = next_sendtime;
                else
                        tv = timeout;
                tv -= time_waited;

                if (tv > 0) {
                        if (cu->cu_closing || cu->cu_closed) {
                                error = 0;
                                cr->cr_error = ESHUTDOWN;
                        } else {
                                error = msleep(cr, &cs->cs_lock,   
                                    cu->cu_waitflag, cu->cu_waitchan, tv);
                        }
                } else {
                        error = EWOULDBLOCK;
                }

We enter a sleep from either a high congestion condition
.. super fishy tunings exist for entering the congestion window here:

#define CWNDSCALE       256
#define MAXCWND         (32 * CWNDSCALE)

this is old deep magic to grok about rpc congestion window and scaling - this technique might not apply to >10Gb attach servers

*or

#18 Avatar?id=14398&size=24x24 Updated by Kris Moore over 1 year ago

Good findings! Go ahead and keep investigating, I've added Mav / Sean as watches, perhaps they can comment on tunings here as well.

#19 Updated by Nicholas Bettencourt over 1 year ago

  • Support Department Priority changed from 0 to 2

#20 Updated by Alexander Motin over 1 year ago

Considering that it sleeps in sleepq_timedwait(), I suppose it should be the second place, since first place has no timeout and so would sleep in sleepq_wait() instead. It waits for some RPC response, I guess inside nlm_host_unmonitor().

#21 Updated by Dru Lavigne about 1 year ago

  • Target version changed from TrueNAS 11.1-U1 to 11.1

#23 Updated by Dru Lavigne about 1 year ago

  • Target version changed from 11.1-U1 to 11.1-U1

#24 Updated by Dru Lavigne about 1 year ago

  • Status changed from Screened to 46

Sasha: do you see this going into U1? If not, please punt to 11.2.

#25 Updated by Alexander Motin about 1 year ago

  • Target version changed from 11.1-U1 to TrueNAS 11.1-U2

I don't have anything to add for now.

#27 Updated by Dru Lavigne about 1 year ago

  • Status changed from 46 to Screened

#28 Avatar?id=13649&size=24x24 Updated by Ben Gadd 12 months ago

  • Due date set to 02/12/2018

#29 Updated by Dru Lavigne 12 months ago

  • 1 deleted project (FreeNAS)
  • Project changed from TrueNAS to FreeNAS
  • Category changed from Middleware to Middleware
  • Status changed from Screened to Not Started
  • Target version changed from TrueNAS 11.1-U2 to 11.2-RC2
  • Support Suite Ticket deleted (HPD-939-45858,OWF-338-92811,VHY-748-66332.JYR-608-70399, WOW-910-90479)
  • Migration Needed deleted (No)
  • Hide from ChangeLog deleted (No)
  • Support Department Priority deleted (2)

#31 Avatar?id=13649&size=24x24 Updated by Ben Gadd 11 months ago

  • Due date deleted (02/12/2018)
  • Assignee changed from Alexander Motin to Benno Rice

Benno, can you investigate

#32 Updated by Ash Gokhale 11 months ago

  • File debug-20180305164133.tar added

One more whale attack:
Recap

Locking hangs via NFS, from java application
RPC error 5 
[root@truenas-dev-002] ~# dtrace -n 'fbt::nfsrpc*: { @[probefunc,probename] = count()}'
dtrace: description 'fbt::nfsrpc*: ' matched 84 probes
^C^C
*crkets *
CORRECT>ps -axuw | grep rpc (y|n|e|a)? no
greo: Command not found.
[root@truenas-dev-002] ~# ps -axuw | grep rpc
root 6344 0.0 0.0 18280 2096 ?? Ss 22Aug16 77:39.75 /usr/sbin/rpcbind -h 192.168.205.4
root 6374 0.0 0.0 366656 6084 ?? Ss 22Aug16 9:03.63 /usr/sbin/rpc.statd -h 192.168.205.4 -p 662
root 6385 0.0 0.0 63292 1616 ?? Ds 22Aug16 17:16.46 /usr/sbin/rpc.lockd -p 5999
root 15388 0.0 0.0 16284 1972 1 S+ 3:48PM 0:00.00 grep rpc
[root@truenas-dev-002] ~# procstat -kk 6385
PID TID COMM TDNAME KSTACK
6385 100884 rpc.lockd NLM: master mi_switch+0x194 sleepq_timedwait+0x42 _sleep+0x1c9 clnt_dg_call+0x7dd clnt_call_private+0xe8 nlm_check_idle+0x1ca nlm_find_host_by_name+0xb8 nlm_do_lock+0x75 nlm4_lock_4_svc+0x11 nlm_prog_4+0x300 svc_run_internal+0x6be svc_run+0x16f sys_nlm_syscall+0x63c amd64_syscall+0x5ea Xfast_syscall+0xf7

rpcinfo:
100021 0 udp 5999 nlockmgr
100021 0 tcp 5999 nlockmgr
100021 1 udp 5999 nlockmgr
100021 1 tcp 5999 nlockmgr
100021 3 udp 5999 nlockmgr
100021 3 tcp 5999 nlockmgr
100021 4 udp 5999 nlockmgr
100021 4 tcp 5999 nlockmgr

[root@truenas-dev-002] ~# netstat -an | grep 5999
tcp4 4020 0 192.168.205.4.5999 192.168.200.123.787 ESTABLISHED
tcp4 66560 0 192.168.205.4.5999 10.250.0.144.666 ESTABLISHED
tcp4 8576 0 192.168.205.4.5999 10.250.0.41.956 ESTABLISHED
tcp4 18224 0 192.168.205.4.5999 10.250.0.105.909 ESTABLISHED
tcp4 66560 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 22780 0 192.168.205.4.5999 10.250.0.247.813 ESTABLISHED
tcp4 66560 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 28320 0 192.168.205.4.5999 192.168.201.178.942 ESTABLISHED
tcp4 35072 0 192.168.205.4.5999 10.250.0.88.688 ESTABLISHED
tcp4 41888 0 192.168.205.4.5999 10.250.0.113.917 ESTABLISHED
tcp4 66560 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 50652 0 192.168.205.4.5999 10.250.0.96.921 ESTABLISHED
tcp4 66560 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 55476 0 192.168.205.4.5999 10.250.0.94.802 ESTABLISHED
tcp4 64588 0 192.168.205.4.5999 10.250.1.1.725 ESTABLISHED
tcp4 66608 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 66560 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 66560 0 192.168.205.4.5999 10.250.0.144.666 CLOSED <<----- mark of the beast? this client is showing the locking erro, no record on the client except 'not responding'
tcp4 66576 0 192.168.205.4.5999 10.250.0.144.666 CLOSED <<--- really large input queue depth
tcp4 66576 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 66608 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 66608 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 66608 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 66576 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 66608 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 66560 0 192.168.205.4.5999 10.250.0.144.666 CLOSED
tcp4 66552 0 192.168.205.4.5999 192.168.201.178.672 CLOSED
tcp4 66592 0 192.168.205.4.5999 10.250.0.41.817 CLOSED
tcp4 66592 0 192.168.205.4.5999 10.250.0.96.708 CLOSED
tcp4 66592 0 192.168.205.4.5999 10.250.1.1.866 CLOSED
tcp4 66592 0 192.168.205.4.5999 10.250.0.247.671 CLOSED
tcp4 66560 0 192.168.205.4.5999 10.250.0.113.816 CLOSED
tcp4 66560 0 192.168.205.4.5999 10.250.0.105.966 CLOSED
tcp4 3660 0 192.168.205.4.5999 10.250.0.77.784 CLOSED
tcp4 0 0 *.5999 *.* LISTEN
tcp6 0 0 *.5999 *.* LISTEN

is related ??clnt_dg.c: unmodified, readonly: line 957 ... which is a jump structure so does not have a call entry path
while (cu->cu_closing)
msleep(cu, &cs->cs_lock, 0, "rpcclose", 0);
KASSERT(cu->cu_closed, ("client should be closed"));
mtx_unlock(&cs->cs_lock);

uname -a
root@build3.ixsystems.com:/tank/home/stable-builds/TN/objs/os-base/amd64/tank/home/stable-builds/TN/FreeBSD/src/sys/TRUENAS.amd64

[root@truenas-dev-002] ~# procstat -t 6385
PID TID COMM TDNAME CPU PRI STATE WCHAN
6385 100884 rpc.lockd NLM: master 20 120 sleep nlmrcv <-=----- wait channel 

[root@truenas-dev-002] ~# procstat -v 6385
PID START END PRT RES PRES REF SHD FL TP PATH
6385 0x400000 0x40e000 r-x 14 0 1 0 CN-- vn /usr/sbin/rpc.lockd
6385 0x60d000 0x800000 rw- 0 0 1 0 C--- df
6385 0x80060d000 0x800626000 r-x 25 0 58 0 CN-- vn /libexec/ld-elf.so.1
6385 0x800626000 0x800649000 rw- 0 0 1 0 C--- df
6385 0x800825000 0x800827000 rw- 0 0 1 0 C--- df
6385 0x800827000 0x800830000 r-x 7 0 4 2 CN-- vn /usr/lib/librpcsvc.so.5
6385 0x800830000 0x800a2f000 --- 0 0 1 0 CN-- df
6385 0x800a2f000 0x800a30000 rw- 0 0 1 0 CN-- vn /usr/lib/librpcsvc.so.5
6385 0x800a30000 0x800a40000 r-x 16 0 77 38 CN-- vn /lib/libutil.so.9
6385 0x800a40000 0x800c40000 --- 0 0 1 0 CN-- df
6385 0x800c40000 0x800c41000 rw- 0 0 1 0 CN-- vn /lib/libutil.so.9
6385 0x800c41000 0x800c43000 rw- 0 0 0 0 ---- --
6385 0x800c43000 0x800d73000 r-x 304 0 112 54 CN-- vn /lib/libc.so.7
6385 0x800d73000 0x800f72000 --- 0 0 1 0 CN-- df
6385 0x800f72000 0x800f7c000 rw- 0 0 1 0 C--- vn /lib/libc.so.7
6385 0x800f7c000 0x800f96000 rw- 0 0 1 0 C--- df
6385 0x801000000 0x804400000 rw- 21 0 1 0 C--- df
6385 0x7ffffffdf000 0x7ffffffff000 rw- 0 0 1 0 C--D df
6385 0x7ffffffff000 0x800000000000 r-x 1 0 61 0 ---- ph

Booting controller 2; returning to service after failover.

relatted:
https://lists.freebsd.org/pipermail/freebsd-net/2009-July/022443.html
https://wiki.freebsd.org/WaitChannels
https://www.freebsd.org/cgi/man.cgi?query=locking&sektion=9&manpath=freebsd-release-ports

-Ash
Technical Support, iXsystems

#33 Updated by Benno Rice 10 months ago

Have we been able to reproduce this in-house? Maybe by adding packet drops or the like?

Reading the code and looking at the wait channel provided we're definitely in the second code path that Ash identified. The NLM client sets its wait channel to "nlmrcv" which is what we're seeing there. This implies that we're both losing packets and failing to set the timeout properly.

#34 Updated by Bill O'Hanlon 10 months ago

We have not reproduced in house. I may have time to give it a try.

The customer has seen another event today. Looks to be the same as previous events.

#35 Updated by Benno Rice 10 months ago

rmacklem@ disowns NLM entirely. The code itself came from Doug Rabson via Isilon a fair ol' while ago. I know personally that Isilon no longer uses this code and instead uses a userspace NFS server.

This is going to be fun.

#36 Updated by William Grzybowski 9 months ago

  • Category changed from Middleware to OS

#38 Updated by Dru Lavigne 7 months ago

  • Assignee changed from Benno Rice to Alexander Motin

#39 Updated by Alexander Motin 7 months ago

  • Status changed from Not Started to Unscreened
  • Assignee changed from Alexander Motin to John Hixson

#40 Updated by John Hixson 7 months ago

  • Status changed from Unscreened to Screened

I attempted some facts gathering but was unable to get a clear understanding of issue. "lockd hangs sometimes under unknown circumstances" is how I have come to understand this. I will wait and see if anyone sees this in the wild and I will also wait to see if this can be reproduced. Until then, I can't really do anything here but sit on this ticket.

#42 Updated by John Hixson 6 months ago

  • Category changed from OS to Services

#43 Updated by Dru Lavigne 6 months ago

  • Target version changed from 11.2-RC2 to 11.2-BETA3

#44 Updated by John Hixson 6 months ago

  • Target version changed from 11.2-BETA3 to 11.2-U2

#45 Updated by Dru Lavigne 3 months ago

  • Assignee changed from John Hixson to William Grzybowski

#46 Updated by William Grzybowski 3 months ago

  • Category changed from Services to OS
  • Status changed from Screened to Unscreened
  • Assignee changed from William Grzybowski to Alexander Motin

Alexander, one more NFS ticket, over for your team.

#47 Updated by Alexander Motin 3 months ago

  • Assignee changed from Alexander Motin to Sean Fagan

#48 Avatar?id=14398&size=24x24 Updated by Kris Moore 2 months ago

  • Severity changed from Med High to New

#49 Updated by Sean Fagan 8 days ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Waiting for feedback

This is now a relatively old bug, so I have to ask: is it still happening?

#50 Updated by Dru Lavigne 5 days ago

  • File deleted (debug-truenas1-20160526100540..tgz)

#51 Updated by Dru Lavigne 5 days ago

  • File deleted (debug-truenas1-20160826145047..tgz)

#52 Updated by Dru Lavigne 5 days ago

  • File deleted (debug-truenas1-20160826145047.(1).tgz)

#53 Updated by Dru Lavigne 5 days ago

  • File deleted (debug-truenas1-20160826145047.(2).tgz)

#54 Updated by Dru Lavigne 5 days ago

  • File deleted (debug-truenas1-20160826161132_during_nfs_problems..tgz)

#55 Updated by Dru Lavigne 5 days ago

  • File deleted (nber_truenas1_network_problems_lockd_issue_20160526.tgz)

#56 Updated by Dru Lavigne 5 days ago

  • File deleted (truenas1-TrueNAS-9.3-STABLE-201602032047-20160826145305(1).db)

#57 Updated by Dru Lavigne 5 days ago

  • File deleted (truenas1-TrueNAS-9.3-STABLE-201602032047-20160826145305.db)

#58 Updated by Dru Lavigne 5 days ago

  • File deleted (truenas1-TrueNAS-9.3-STABLE-201602032047-20160826145305(2).db)

#59 Updated by Dru Lavigne 5 days ago

  • File deleted (truenas1_followup_info_20160601.txt)

#60 Updated by Dru Lavigne 5 days ago

  • File deleted (debug-20180305164133.tar)

#61 Updated by Dru Lavigne 5 days ago

  • Status changed from Blocked to Closed
  • Target version changed from 11.2-U2 to N/A
  • Private changed from Yes to No
  • Reason for Closing set to Cannot Reproduce

Also available in: Atom PDF