Project

General

Profile

Bug #17240

Spin lock held too long, kernel panic

Added by Jelle Rijnboutt about 4 years ago. Updated almost 3 years ago.

Status:
Closed: User Config Issue
Priority:
Important
Assignee:
Josh Paetzel
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Intel Core i3-4170
Supermicro x10sl7-f
Crucial CT102472BD160B x 4

ChangeLog Required:
No

Description

My boot usb was giving CAM read & write errors, so I had to replace it with a different one. Now I did a fresh install on another usb, all went well, succesfully booted into the fresh install. Then I upload my old config and as I tried to boot into freenas using my old config I get a "panic: spin lock held too long" error. I can not boot into my system anymore and access my data anymore! I do not know what spin lock it is whining about, does it have something to do with a disk not starting up? My system consists of a intel i3-4170, Supermicro x10sl7-f motherboard, 32GB of Crucial CT102472BD160B ECC RAM and 8 disks; 1 Seagate 2tb disk, 1 WD green 2tb disk (in mirror with the seagate disk), 6 WD RED 3TB disks in a RaidZ2. The motherboard, processor, RAM modules and 3 of the WD RED disks are brand new, the rest of the hardware is at least 2 years old. First I had a dell mobo with intel e5300 processor and 8gb of DDR2 non-ECC RAM and I had 3 WD RED disks in RAIDZ1. I deemed RAIDZ1 not save enough and decided to upgrade to RAIDZ2 with 3 new disks. I was confident this setup would work and it did for about a month but now I just can find the root of the problem. I already ran a Memtest with 5 passes, no errors. I could run some more passes, but frankly the RAM is brand new and has ECC so I just do not believe RAM is to blame. I also managed to look at the S.M.A.R.T data of all the disk and nothing out of the ordinary there. I tried the fresh install and upload config method probably 10 times and I always get the same issue.

I read some old posts with similiar errors and the culprit was multithreading. Only that was in Freenas 7 and the conclusion was that it would have been fixed in future updates. Two generations later I cannot imagine this is the same problem.

1.png (170 KB) 1.png Picture of the error at startup Jelle Rijnboutt, 09/02/2016 05:46 AM
FREENAS_error.jpg (92.1 KB) FREENAS_error.jpg ERRORS Jelle Rijnboutt, 10/15/2016 07:49 AM
2.jpg (98.1 KB) 2.jpg ERRORS Jelle Rijnboutt, 10/15/2016 07:49 AM
6953
7462
7463

History

#1 Updated by Bonnie Follweiler about 4 years ago

  • Assignee set to Josh Paetzel

#2 Updated by Jelle Rijnboutt about 4 years ago

It managed to boot succesfully once and I could see my pools were healthy, which is a relief, but quite soon after it crashed again, in the same manner as my other bug report 17140. I am now quite certain this is the same problem, especially because I checked the data in the crash folder and found that the panic file also contained "spin lock held too long". So bug 17140 and 17240 are most likely have the same root problem, but I still am completely clueless what is causing it. It is especially weird that error occurs at random times, most of the time it fails during boot, but now I was able to briefly boot into the system until it crashed anyway.

#3 Updated by Josh Paetzel about 4 years ago

  • Status changed from Unscreened to 15
  • Priority changed from No priority to Important

While it's impossible to make guesses without more data, this type of thing is generally not a software problem. It's usually some piece of hardware going out to lunch and the software eventually gives up and says, "I can't wait forever here, so I guess I'll just pull the ripcord"

/data/crash will have a more complete crash dump, the fact that it was stable until you reloaded the config, and the fact that the panic was doing ZFS'y things, leads me to believe the issue is likely with your storage subsystem.

If you pull the disks out it won't be able to import the pool and will likely be stable and allow you to copy off /data/crash and attach it to this ticket.

#4 Updated by Jelle Rijnboutt about 4 years ago

  • File crash.zip added

You are likely right because the first time it succesfully booted. I attached the crash folder. If you can look through it to find a possible root of the problems I will in the mean time try to find out if it is a defective cable, i have some spare sata cables lying around.

#5 Updated by Josh Paetzel about 4 years ago

Ok, I'd like to try something, it looks like it's an encrypted pool with a bunch of disks and a two core system.
This will create hundreds of geli threads

It looks like there's either a bug in geli, or the hardware is dropping interrupts.

First, can you confirm the pool is encrypted and what CPU you are using?

If the pool is encrypted:

Try booting the system without the disks, then add the following tunable of type sysctl

kern.geom.eli.threads

with a value of 4

The shut the system down, put the disks back in, and see if it's stable.

#6 Updated by Jelle Rijnboutt about 4 years ago

There is an encrypted pool, but it is not in use. The disk belonging to that pool is no longer in use and I just never deleted the pool belonging to it. If that is causing trouble I am more than willing to delete that pool. In any case, the other pools which are in use and are important are not encrypted. The cpu I am using is a Intel Core i3-4170, which is a dual-core with hyper-threading. It also supports intel AES new instructions, which is I believe a kind of hardware acceleration for encryption, but as I said, I do not use encryption, so it is not really relevant any way. Actually, about that 'ghost pool' I still have and don't use, I am just going to delete because I should have done it anyway and then it can be excluded from the potential causes list.

#7 Updated by Josh Paetzel about 4 years ago

Ok, that works. Let me know if that restores stability to your system. (It might)

#8 Updated by Jelle Rijnboutt about 4 years ago

It did not. But, I am fairly certain now that either one of my cables or my disk is defective, even though there were no SMART errors. I have tried booting the system with everytime 2 disks disconnected in the raidZ2 pool and through that I believe I know now what disk is causing it. First I am going to try a new cable because obviously that would be easier to replace, if that fails to solve the issue I would be forced to remove this disk from the pool and RMA it. Maybe I'll just wipe it first and resilver the pool, perhaps it just got corrupted, but if that also fails then I'd really have no choice.

#9 Updated by Jelle Rijnboutt about 4 years ago

So the disk I thought was the problem was not the problem, because without it I got the errors again. Now I am trying to boot without different disks, but now with 4/6 disks it can't even open the pool. When I try the command 'zpool import mypool' it says I/O error: Destroy and recreate pool from backup. That is not supposed to happen. It must mean one of those disks is corrupt. But it is the disks that I thought were the problem but apparently were not, that are now connected and must be causing these problems. So perhaps they are corrupt, but not causing the errors. I am back at the beginning. Any suggestions Josh?

#10 Updated by Jelle Rijnboutt about 4 years ago

I am starting to think this pool is somehow corrupted and that is causing all the problems. I do not have a backup so that would be a big problem.

#11 Updated by Josh Paetzel about 4 years ago

Would it be possible to try the disks in a different system? It seems unlikely that a diak going out to lunch would cause these problems. It's more likely a controller is causing the issues.

#12 Updated by Jelle Rijnboutt about 4 years ago

Now I am facing an entirely different problem, when I try to start my server it fails to get past the bios screen where it says 'system initializing' and it shows the number 15 in the bottom right corner. Apparently POST code 15 would indicate problems with my memory, but I do not believe any of it. I tried booting after removing different RAM modules but it never did get past the screen. On top of that I let memtest86 run for a whole night yesterday with no errors found. I know for sure my memory is compatible as it also performed well for about a month and frankly with all the trouble I had before this, it just seems the motherboard is a much more likely culprit than the memory. It would explain everything if the motherboard is defective. Don't suppose you have any ideas about this?

#13 Updated by Josh Paetzel about 4 years ago

I agree the motherboard sems like a likely culprit. Something is generating spurious interrupts, and that could do it.

#14 Avatar?id=14398&size=24x24 Updated by Kris Moore about 4 years ago

  • Status changed from 15 to Closed: User Config Issue

#15 Updated by Jelle Rijnboutt almost 4 years ago

7462
7463

Well, I got a replacement for my motherboard and now at least it boots again, but the underlying issue with Freenas has not disappeared. Still getting a lot of different errors, all concerning some communication with my disks I think. Added some nice screensshots of the errors I am getting now. Tried a fresh install (9.10.1-U2) again but that did not solve the issue, it too crashed. So even when the volume is not mounted the disks cause problems. At least we know now that the issue is not caused by a defect in the motherboard or at least that it would be very unlikely. Could it be a compatibility issue with the SAS controller or something like that? It is so odd that my old dell computer ran more reliable than this server-grade motherboard. I thought x10sl7-f was a good choice for FreeNAS. Love to hear some new ideas.

#16 Updated by Dru Lavigne almost 3 years ago

  • File deleted (crash.zip)

#17 Updated by Dru Lavigne almost 3 years ago

  • Target version changed from 9.10.1-U2 to N/A

Also available in: Atom PDF