Project

General

Profile

Bug #34204

Strengthen locking for the NFSv4.1 server DestroySession operation

Added by Patrik Kernstock 11 months ago. Updated 9 months ago.

Status:
Done
Priority:
No priority
Assignee:
John Hixson
Category:
Services
Target version:
Seen in:
Severity:
Medium
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

Hi,

I hope this is the correct place reporting a kernel panic I experienced about a month ago. Unfortunately I don't have exact steps to reproduce that, nor did I actively tried reproducing it as the FreeNAS instance is basically in a "production-homelab"-use for all my virtual machines. I do not have enough resources to build up a second test system or risk a data loss by trying to get kernel panics.

This is currently the setup I have running:
- I have iSCSI running since about 4 months just fine, even no crashes under high load (I/O benchmarks); 2x ESXis are running virtual machines from that storage.
- I'm using multipath: 4 NICs, 4 own subnets to the FreeNAS instance. So basically 4 paths.

What I have done, and why:
- Then I wanted to configure a dedicated "ISOStore" NFS share, mounted as a datastore on the ESXi. A central place to keep various ISOs.
- A created a NFSv3 share on FreeNAS, mounted it on ESXi just fine. Used the 5th NIC (management NIC).
- Some minutes later I thought about going for NFSv4 and using multipathing for a bit more performance and redundancy. Why not?
- I enabled NFSv4 on FreeNAS on the share I have used before, configured the NFSv4 share on the cluster level of my 2 ESXis and was using the existing iSCSI NICs.
- Mounting took a while, succeeded on host A. Nothing happend... waited... nothing happend. Reloaded the vSphere Web Client and page did not load, nor did any other server on the cluster.
- Some moments of screaming/crying/panic later I figured out the FreeNAS machine completely crashed due a kernel panic, which was automatically rebooting. Kernel panic dumps extracted.
- After reboot everything was working "fine". Luckily no data-loss nor corrupted virtual machines (Thanks ZFS!)

Unfortunately I'm not really fimiliar with the magic stuff behind FreeBSD/FreeNAS/NFS to be able to get any idea how and why this actually happend. I just thought I should send this in as a bug, as this maybe a very, very critical thing when this happens in production, specially in companies. (And it's definitely related to any NFS stuff, as I can read in the kernel panic text)

This was the kernel panic message I was able to find in msgbuf.txt:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address    = 0x2f0
fault code        = supervisor read data, page not present
instruction pointer    = 0x20:0xffffffff80980438
stack pointer            = 0x28:0xfffffe0a936fddd0
frame pointer            = 0x28:0xfffffe0a936fde20
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 42415 (nfsd: master)

If you need to know anything more, please feel free to ask.


Related issues

Related to FreeNAS - Bug #34125: [nfs] kernel panic in nfsrv_checksequenceClosed
Related to FreeNAS - Bug #27859: nfsrv_checksequence - trap 9: general protection fault while in kernel modeClosed
Related to FreeNAS - Bug #33654: Fatal Trap 12 in nfsd, frequent random reboots after Kernel crash, version 11.1Closed

History

#1 Updated by Dru Lavigne 11 months ago

  • Assignee changed from Release Council to Alexander Motin

#2 Updated by Alexander Motin 11 months ago

  • Status changed from Unscreened to Screened
  • Assignee changed from Alexander Motin to Benno Rice
  • Severity changed from New to Medium

kgdb for nfsrv_checksequence+0x208 gave me this line in NFSv4 source code:

(kgdb) l * nfsrv_checksequence+0x208
0xffffffff807ad118 is in nfsrv_checksequence (/sources/iX/os/sys/fs/nfsserver/nfs_nfsdstate.c:5840).
5835            if (sep->sess_clp->lc_req.nr_client != NULL &&
5836                (sep->sess_crflags & NFSV4CRSESS_CONNBACKCHAN) != 0) {
5837                    savxprt = sep->sess_cbsess.nfsess_xprt;
5838                    SVC_ACQUIRE(nd->nd_xprt);
5839                    nd->nd_xprt->xp_p2 =
5840                        sep->sess_clp->lc_req.nr_client->cl_private;
5841                    nd->nd_xprt->xp_idletimeout = 0;        /* Disable timeout. */
5842                    sep->sess_cbsess.nfsess_xprt = nd->nd_xprt;
5843                    if (savxprt != NULL)
5844                            SVC_RELEASE(savxprt);

, but that may need verification.

Benno, could you take a look? May be something there was already changed in head.

#3 Updated by Alexander Motin 11 months ago

  • Status changed from Screened to Unscreened

#4 Updated by Patrik Kernstock 11 months ago

Recently after some research I found someone having also a kernel panic in combination with NFS, but I'm not sure if that's related to my experienced crash. As the filed bug was also hidden from the public, I wasn't able to keep track on this issue and therefor I opened a new one. Hopefully this is no duplicate and still helps locating the issue! The issue I'm speaking about (but I can't access, got from my browser history) was: https://redmine.ixsystems.com/issues/34125 (The person also filed a bug report at FreeBSD here https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=228497)

#5 Updated by Dru Lavigne 11 months ago

  • Related to Bug #34125: [nfs] kernel panic in nfsrv_checksequence added

#6 Updated by Alexander Motin 11 months ago

This does look like a duplicate of #34125. But reference to FreeBSD does sound interesting, since it seems like Rick Macklem (NFS code maintainer) there has already created some patch we could look on and possibly backport.

#7 Updated by Alexander Motin 11 months ago

  • Related to Bug #27859: nfsrv_checksequence - trap 9: general protection fault while in kernel mode added

#8 Updated by Benno Rice 11 months ago

I've reached out to Rick Macklem to see if he has plans to commit the patch in the PR referenced above or if he needs additional testing.

#9 Updated by Benno Rice 11 months ago

rmacklem committed his patch:

https://svnweb.freebsd.org/base?view=revision&revision=334396

And I've merged it into our main tree:

https://github.com/freenas/os/pull/123

It's not yet pulled into a release though. I'll need to work out which one it goes in to.

#10 Updated by Dru Lavigne 11 months ago

  • Status changed from Unscreened to In Progress

#11 Updated by Alexander Motin 11 months ago

  • Related to Bug #33654: Fatal Trap 12 in nfsd, frequent random reboots after Kernel crash, version 11.1 added

#12 Updated by Dru Lavigne 10 months ago

  • Assignee changed from Benno Rice to Alexander Motin

#13 Updated by Alexander Motin 10 months ago

  • Assignee changed from Alexander Motin to John Hixson

#14 Updated by John Hixson 10 months ago

  • Category changed from OS to Services

#15 Updated by Dru Lavigne 10 months ago

  • Subject changed from Kernel panic when using NFS4 to Strengthen locking for the NFSv4.1 server DestroySession operation

#17 Updated by John Hixson 10 months ago

  • Subject changed from Strengthen locking for the NFSv4.1 server DestroySession operation to Kernel panic when using NFS4

Patrik, could you upgrade to 11.2BETA? Several NFS patches have gone in and I'd like to verify if they fix your problem or not.

#18 Updated by Dru Lavigne 9 months ago

  • Subject changed from Kernel panic when using NFS4 to Strengthen locking for the NFSv4.1 server DestroySession operation
  • Status changed from In Progress to Ready for Testing
  • Target version changed from Backlog to 11.2-BETA2
  • Needs Doc changed from Yes to No
  • Needs Merging changed from Yes to No

#20 Updated by Dru Lavigne 9 months ago

  • File deleted (ddb.txt)

#21 Updated by Dru Lavigne 9 months ago

  • File deleted (version.txt)

#22 Updated by Dru Lavigne 9 months ago

  • File deleted (msgbuf.txt)

#23 Updated by Dru Lavigne 9 months ago

  • File deleted (config.txt)

#24 Updated by Bonnie Follweiler 9 months ago

  • Status changed from Ready for Testing to Passed Testing
  • Needs QA changed from Yes to No

we don't have the resources available to test this at this time

#25 Updated by Dru Lavigne 9 months ago

  • Status changed from Passed Testing to Done

#26 Updated by Patrik Kernstock 9 months ago

@John Hixson: Sorry for the delayed response on this! Unfortunately I'd rather prefer not installing a beta release on my "production-like" HomeLab machine, as it's my main iSCSI server for quite everything at home. Is there any way so that I can backport only NFSv4-related changes to the upcoming 11.1-U6? e.g. compiling NFS server binary and just manually replacing it as a workaround? (yes, of course non-supported) That would be the best way for me testing just the NFS-related changes in the next maintenance window.

IMO the done changes on the source code shouldn't cause any conflicts: https://github.com/freenas/os/commits/freenas/master/sys/fs/nfsserver

Also available in: Atom PDF