Spin lock held too long, kernel panic
Intel Core i3-4170
Crucial CT102472BD160B x 4
My boot usb was giving CAM read & write errors, so I had to replace it with a different one. Now I did a fresh install on another usb, all went well, succesfully booted into the fresh install. Then I upload my old config and as I tried to boot into freenas using my old config I get a "panic: spin lock held too long" error. I can not boot into my system anymore and access my data anymore! I do not know what spin lock it is whining about, does it have something to do with a disk not starting up? My system consists of a intel i3-4170, Supermicro x10sl7-f motherboard, 32GB of Crucial CT102472BD160B ECC RAM and 8 disks; 1 Seagate 2tb disk, 1 WD green 2tb disk (in mirror with the seagate disk), 6 WD RED 3TB disks in a RaidZ2. The motherboard, processor, RAM modules and 3 of the WD RED disks are brand new, the rest of the hardware is at least 2 years old. First I had a dell mobo with intel e5300 processor and 8gb of DDR2 non-ECC RAM and I had 3 WD RED disks in RAIDZ1. I deemed RAIDZ1 not save enough and decided to upgrade to RAIDZ2 with 3 new disks. I was confident this setup would work and it did for about a month but now I just can find the root of the problem. I already ran a Memtest with 5 passes, no errors. I could run some more passes, but frankly the RAM is brand new and has ECC so I just do not believe RAM is to blame. I also managed to look at the S.M.A.R.T data of all the disk and nothing out of the ordinary there. I tried the fresh install and upload config method probably 10 times and I always get the same issue.
I read some old posts with similiar errors and the culprit was multithreading. Only that was in Freenas 7 and the conclusion was that it would have been fixed in future updates. Two generations later I cannot imagine this is the same problem.
#2 Updated by Jelle Rijnboutt about 4 years ago
It managed to boot succesfully once and I could see my pools were healthy, which is a relief, but quite soon after it crashed again, in the same manner as my other bug report 17140. I am now quite certain this is the same problem, especially because I checked the data in the crash folder and found that the panic file also contained "spin lock held too long". So bug 17140 and 17240 are most likely have the same root problem, but I still am completely clueless what is causing it. It is especially weird that error occurs at random times, most of the time it fails during boot, but now I was able to briefly boot into the system until it crashed anyway.
#3 Updated by Josh Paetzel about 4 years ago
- Status changed from Unscreened to 15
- Priority changed from No priority to Important
While it's impossible to make guesses without more data, this type of thing is generally not a software problem. It's usually some piece of hardware going out to lunch and the software eventually gives up and says, "I can't wait forever here, so I guess I'll just pull the ripcord"
/data/crash will have a more complete crash dump, the fact that it was stable until you reloaded the config, and the fact that the panic was doing ZFS'y things, leads me to believe the issue is likely with your storage subsystem.
If you pull the disks out it won't be able to import the pool and will likely be stable and allow you to copy off /data/crash and attach it to this ticket.
#4 Updated by Jelle Rijnboutt about 4 years ago
- File crash.zip added
You are likely right because the first time it succesfully booted. I attached the crash folder. If you can look through it to find a possible root of the problems I will in the mean time try to find out if it is a defective cable, i have some spare sata cables lying around.
#5 Updated by Josh Paetzel about 4 years ago
Ok, I'd like to try something, it looks like it's an encrypted pool with a bunch of disks and a two core system.
This will create hundreds of geli threads
It looks like there's either a bug in geli, or the hardware is dropping interrupts.
First, can you confirm the pool is encrypted and what CPU you are using?
If the pool is encrypted:
Try booting the system without the disks, then add the following tunable of type sysctl
with a value of 4
The shut the system down, put the disks back in, and see if it's stable.
#6 Updated by Jelle Rijnboutt about 4 years ago
There is an encrypted pool, but it is not in use. The disk belonging to that pool is no longer in use and I just never deleted the pool belonging to it. If that is causing trouble I am more than willing to delete that pool. In any case, the other pools which are in use and are important are not encrypted. The cpu I am using is a Intel Core i3-4170, which is a dual-core with hyper-threading. It also supports intel AES new instructions, which is I believe a kind of hardware acceleration for encryption, but as I said, I do not use encryption, so it is not really relevant any way. Actually, about that 'ghost pool' I still have and don't use, I am just going to delete because I should have done it anyway and then it can be excluded from the potential causes list.
#8 Updated by Jelle Rijnboutt about 4 years ago
It did not. But, I am fairly certain now that either one of my cables or my disk is defective, even though there were no SMART errors. I have tried booting the system with everytime 2 disks disconnected in the raidZ2 pool and through that I believe I know now what disk is causing it. First I am going to try a new cable because obviously that would be easier to replace, if that fails to solve the issue I would be forced to remove this disk from the pool and RMA it. Maybe I'll just wipe it first and resilver the pool, perhaps it just got corrupted, but if that also fails then I'd really have no choice.
#9 Updated by Jelle Rijnboutt about 4 years ago
So the disk I thought was the problem was not the problem, because without it I got the errors again. Now I am trying to boot without different disks, but now with 4/6 disks it can't even open the pool. When I try the command 'zpool import mypool' it says I/O error: Destroy and recreate pool from backup. That is not supposed to happen. It must mean one of those disks is corrupt. But it is the disks that I thought were the problem but apparently were not, that are now connected and must be causing these problems. So perhaps they are corrupt, but not causing the errors. I am back at the beginning. Any suggestions Josh?
#12 Updated by Jelle Rijnboutt about 4 years ago
Now I am facing an entirely different problem, when I try to start my server it fails to get past the bios screen where it says 'system initializing' and it shows the number 15 in the bottom right corner. Apparently POST code 15 would indicate problems with my memory, but I do not believe any of it. I tried booting after removing different RAM modules but it never did get past the screen. On top of that I let memtest86 run for a whole night yesterday with no errors found. I know for sure my memory is compatible as it also performed well for about a month and frankly with all the trouble I had before this, it just seems the motherboard is a much more likely culprit than the memory. It would explain everything if the motherboard is defective. Don't suppose you have any ideas about this?
#15 Updated by Jelle Rijnboutt almost 4 years ago
- File FREENAS_error.jpg FREENAS_error.jpg added
- File 2.jpg 2.jpg added
- Target version set to 9.10.1-U2
Well, I got a replacement for my motherboard and now at least it boots again, but the underlying issue with Freenas has not disappeared. Still getting a lot of different errors, all concerning some communication with my disks I think. Added some nice screensshots of the errors I am getting now. Tried a fresh install (9.10.1-U2) again but that did not solve the issue, it too crashed. So even when the volume is not mounted the disks cause problems. At least we know now that the issue is not caused by a defect in the motherboard or at least that it would be very unlikely. Could it be a compatibility issue with the SAS controller or something like that? It is so odd that my old dell computer ran more reliable than this server-grade motherboard. I thought x10sl7-f was a good choice for FreeNAS. Love to hear some new ideas.