Project

General

Profile

Bug #25636

regular unexpected reboots

Added by James N. about 3 years ago. Updated about 3 years ago.

Status:
Closed: User Config Issue
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Fractal Design Node 804
Intel Xeon E5-2620 V4 2.1 GHz LGA 2011-3 85W BX80660E52620V4 Server Processor
ASRock X99M Extreme4
Crucial 32GB (2 x 16GB) 288-Pin DDR4 SDRAM ECC DDR4 2133 (PC4 17000) Server Memory Model CT2K16G4RFD4213
8x HDD 8TB WD Red
Crucial 128GB SSD (Used for Jails)
EVGA 550 B1 PSU
Boot volume is dual 8GB Kingston DTSE9 G2 USB 3.0 flash drives
-generic video card for console management when needed

ChangeLog Required:
No

Description

I have been attempting to troubleshoot this issue all day today. I was running 9.10.2-U2 until this AM. I started getting unexpected reboots consistently. While troubleshooting I checked everything I could think of and followed several pieces of advice on forums. I have disabled extra features and stopped/ disabled all my jails. Only active resource at this point is my iscsi targets. I found what appears to be a related bug entry Bug #21850. After several forum posts suggested upgrading to latest stable I have now upgraded to U6.

Shortly after upgrading and rebooting a few times to ensure it was stable/ booting on the correct version I experience another crash.

I have attached some of the debug exports I did earlier today.

I am hoping for stability during business hours tomorrow as this storage is currently being used as a iscsi target for my Hyper-V failover cluster and the contained VHDX holds most of the user file data.

systat-vm.PNG (87.8 KB) systat-vm.PNG James N., 08/20/2017 11:01 PM
systat-vm-2.PNG (106 KB) systat-vm-2.PNG James N., 08/21/2017 06:15 PM
temp check.PNG (141 KB) temp check.PNG James N., 08/22/2017 11:29 AM
IMG_20170822_201311.jpg (2.94 MB) IMG_20170822_201311.jpg PSU Hardware diagnostic James N., 08/22/2017 10:19 PM
IMG_20170823_011218.jpg (5.22 MB) IMG_20170823_011218.jpg KDB enter panic James N., 08/22/2017 11:14 PM
IMG_20170823_034713.jpg (4.04 MB) IMG_20170823_034713.jpg James N., 08/23/2017 01:53 AM
12196
12210
12218
12228
12230
12232

History

#1 Updated by James N. about 3 years ago

  • File debug-freenas-20170820235142.txz added

#2 Updated by James N. about 3 years ago

  • File post_django_restart_iscsi_remained_up_plugins_disabled_-_debug-freenas-20170820202911.tgz added

#3 Updated by James N. about 3 years ago

  • File debug-freenas-20170820091049.tgz added

#4 Updated by James N. about 3 years ago

  • File debug-freenas-20170820010229.tgz added

#5 Updated by James N. about 3 years ago

12196

FreeNAS rebooted again with no new info in log. I was monitoring systat -vm during the reboot and took a screenshot which is attached.

At this point I know the data is good according to the smart scans, scrubs etc, however I can't keep the system running long enough to move the data to new storage. Looking for any help possible to figure out what is causing the reboots.

#6 Updated by James N. about 3 years ago

  • Hardware Configuration updated (diff)

Edit* Added hardware config in case it is needed.

System was stable for weeks prior to this issue starting lat Friday night/ Saturday morning.

#7 Updated by Dru Lavigne about 3 years ago

  • Assignee changed from Release Council to Alexander Motin

Sasha: from the logs, upsmon seems to be part of the issue.

#8 Updated by James N. about 3 years ago

Dru Lavigne wrote:

Sasha: from the logs, upsmon seems to be part of the issue.

I disabled the UPS service because of all the UPSmon logs. Do you think it might still be the cause of the issue? I didn't disable it until late yesterday so there are errors in logs for UPSmon going through the earlier logs. I believe I disabled UPS service between 8 am CST and 12 pm CST on sunday 8.20. debug-freenas-20170820235142.txz should be the most recent logs, anything after 12:00 should be post the UPS service disable.

#9 Updated by James N. about 3 years ago

  • File debug-freenas-20170821072238.tgz added

I am attaching an updated debug from this AM. Logs have rolled over since yesterday and the UPSmon errors are no longer present but the system is still unstable.

searching through the messages log there appears to be the same repeating pattern

Aug 21 03:09:41 freenas proftpd2220: 127.0.0.1 - ProFTPD 1.3.5a (maint) (built Tue Jul 25 2017 00:16:58 UTC) standalone mode STARTUP
Aug 21 03:09:42 freenas root: /etc/rc: WARNING: failed precmd routine for vmware_guestd
Aug 21 03:09:43 freenas smbd: dnssd_clientstub ConnectToServer: connect()-> No of tries: 1
Aug 21 03:09:44 freenas smbd: dnssd_clientstub ConnectToServer: connect()-> No of tries: 2
Aug 21 03:09:45 freenas smbd: dnssd_clientstub ConnectToServer: connect()-> No of tries: 3
Aug 21 03:09:46 freenas smbd: dnssd_clientstub ConnectToServer: connect() failed path:/var/run/mdnsd Socket:36 Err:-1 Errno:2 No such file or directory
Aug 21 03:10:00 freenas alert.py: [common.pipesubr:66] Popen()ing: /usr/local/sbin/dmidecode -s system-product-name
Aug 21 03:10:00 freenas alert.py: [common.pipesubr:66] Popen()ing: /usr/local/sbin/dmidecode -s baseboard-product-name
Aug 21 03:46:53 freenas syslog-ng1475: syslog-ng starting up; version='3.6.4'

it looks like the system is running as you would expect and then there is an alert that syslog-ng starting up. I looked into the alert.py that keeps repeating in the logs but that appears to be something we can ignore? Since nothing is really being logged in messages, any suggestions on where else to look?

#10 Updated by James N. about 3 years ago

After throttling back external processes and limiting over all actions against the storage it appears to be remaining stable for a longer period of time.

It looks as though high levels of iscsi calls are causing the storage to crash. the iscsi debug seems to be void of useful information and I don't see any system alerts indicating it could be hardware/ I see no indication of any hardware tests needed.

Suggestions would be appreciated at this point.

#11 Updated by James N. about 3 years ago

12210

The server crashed again when the disk load was very low. There was almost no iSCSI traffic active at the time of this system crash.

#12 Updated by Alexander Motin about 3 years ago

  • Status changed from Unscreened to 15
  • Target version set to N/A

Unfortunately your debug data mention only one kernel panic three months ago. Have you seen what happened on console during server reboot? Any messages? Failed attempts to dump the core? If system just instantly reboots without any messages, then I would guess some hardware issue.

#13 Updated by Alexander Motin about 3 years ago

If it is a software problem, to get any more input you may enable debugging kernel. Also it would be good to migrate to 11.0 branch, since we are not going to fix 9.10 any way, even if we find something there.

#14 Updated by James N. about 3 years ago

Alexander Motin wrote:

Unfortunately your debug data mention only one kernel panic three months ago. Have you seen what happened on console during server reboot? Any messages? Failed attempts to dump the core? If system just instantly reboots without any messages, then I would guess some hardware issue.

I have not seen any messages as I have not been in front of the monitor during any of the crashes. Is there any way to force monitor output to be viewed via shell?

If you think it is hardware, where would you start? I am not seeing any errors in smart etc. The systat values don't imply that it is crashing due to high mem or cpu utilization best I can tell (let me know if I am mistaken).

#15 Updated by James N. about 3 years ago

Alexander Motin wrote:

If it is a software problem, to get any more input you may enable debugging kernel. Also it would be good to migrate to 11.0 branch, since we are not going to fix 9.10 any way, even if we find something there.

Is there any risk to migrating to the 11.0 branch on a production system with active pool? I have never changed branches with FreeNAS before, I have only ever run upgrades to current branches. It appears to be an easy swap, I just don't want to end up in a failed state. I installed 9.10.2 because I was under the impression that coral wasn't supported going forward and 9.10.2 was the next latest supported stable branch.

#16 Updated by Alexander Motin about 3 years ago

James Nicholas wrote:

I have not seen any messages as I have not been in front of the monitor during any of the crashes. Is there any way to force monitor output to be viewed via shell?

Closest alternative could be IPMI console (if you have it) or serial console.

If you think it is hardware, where would you start? I am not seeing any errors in smart etc. The systat values don't imply that it is crashing due to high mem or cpu utilization best I can tell (let me know if I am mistaken).

I'd think first about power supply or temperature. If system has any log event in BIOS or IPMI, I'd check it for anything related (may be cleared it first and then let it repeat once more).

#17 Updated by Alexander Motin about 3 years ago

James Nicholas wrote:

Is there any risk to migrating to the 11.0 branch on a production system with active pool? I have never changed branches with FreeNAS before, I have only ever run upgrades to current branches. It appears to be an easy swap, I just don't want to end up in a failed state. I installed 9.10.2 because I was under the impression that coral wasn't supported going forward and 9.10.2 was the next latest supported stable branch.

If you worry, you can dump configuration, do fresh install on other media and reload the configuration. Though even with simple update you should be able to revert back to previous boot environment. FreeNAS does not update ZFS pool automatically, so you should always be able to revert back until you request pool update explicitly.

#18 Updated by James N. about 3 years ago

12218

Alexander Motin wrote:

Closest alternative could be IPMI console (if you have it) or serial console.

I'd think first about power supply or temperature. If system has any log event in BIOS or IPMI, I'd check it for anything related (may be cleared it first and then let it repeat once more).

I will go sit in front of it to watch output tonight.

It is in a climate controlled room right now and system is not reporting high temps, so I think it is unlikely temp related but I will double check that tonight. Hard drives are all reporting around 30C (this is consistent). The Crucial drive listed at 20C right now is an SSD used for my Jails - which are all disabled during this troubleshooting. CPU is stable around 17C. Screenshot attached.

Power supply is worth checking. I will change over to a different UPS and re-establish UPSMon tonight as well. Any suggestions on utilities to specifically check the internal power supply?

I will run a memtest86 tonight to confirm memory as well even though there is nothing really pointing at memory right now either.

#19 Updated by James N. about 3 years ago

Alexander Motin wrote:

If you worry, you can dump configuration, do fresh install on other media and reload the configuration. Though even with simple update you should be able to revert back to previous boot environment. FreeNAS does not update ZFS pool automatically, so you should always be able to revert back until you request pool update explicitly.

I will just move over to 11.0 then. As long as there is no risk to the data pool it shouldn't be a problem. Is there a specific patch version on 11.0 I should ensure I get updated to, or is just switching to the 11.0 branch sufficient?

#20 Updated by Alexander Motin about 3 years ago

James Nicholas wrote:

I will just move over to 11.0 then. As long as there is no risk to the data pool it shouldn't be a problem. Is there a specific patch version on 11.0 I should ensure I get updated to, or is just switching to the 11.0 branch sufficient?

Switching branch will automatically update you to latest version.

#21 Updated by James N. about 3 years ago

Are there any hardware tests (besides memtest) you would advise running during tonight's maintenance?

#22 Updated by James N. about 3 years ago

12228
12230

I have performed a physical hardware test on the power supply. It passed perfectly. I have attached a picture of the results. I additionally replaced the UPS the server was attached to.

I attempted to change to the 11.0 branch but the system crashed each time I attempted it from a normal upgrade.

Now after rebooting it appears to have lost it's configured interfaces.

I ran a scrub and found check sum on one of the USB drives after the failed install. I then rebooted the server with that USB detached. It appeared to detect correct settings but then crashed during the boot with no errors on the screen of any kind, just went from loading to black screen (as all the other crashes have been as well) After booting again I noticed that the reason it previously lost network config is it was booting the initial install image very old version. I rebooted and this time selected 9.10.2-U6 and it loaded appropriately.

Shortly after it was all up and running, when the iscsi initiator on my VM host attempted to connect to the FreeNAS target the FreeNAS rebooted again.

Again I booted the server again selecting 9.10.2-U6. It appears to have booted correctly and is running once again.

This was the error I received:
ug. 23, 2017, 12:52 a.m. - The boot volume state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

If the issue is not Temp and not power related, what else could be causing this? These shutdowns appear to be becoming more frequent with each troubleshooting step.

I have attached an updated debug. After exporting the debug, the system crashed. I have uploaded a photo of what appears to have been a kernal panic on the console during this crash.

At this point, I am looking for options to just get the data off of this system before something happens to corrupt/ lose the data. I will continue to troubleshoot this but I need to ensure there isn't a risk on the data given how much it is crashing. If I were to remove each physical disk and place them in an external enclosure and slave it into a vm, would I be able to import my FreeNAS db config and then import the volumes and copy out the data?

#23 Updated by James N. about 3 years ago

  • File debug-freenas-20170823034025.tgz added

Shortly after posting my last update the system crashed again, and when it came back online was unable to boot into the OS at all. After some consideration I figured the issue at hand might be related to the USB drives I am using, and so I have now taken fresh USB stick and done a clean install of 11.0. I imported the config, and reconfigured iscsi and all services are now up and running.

However after booting a VM's who's VHDX is located on the iscsi volume, the system again crashed. It appears to be booting and performing better than before, however it did still crash so I am now uploading again the debug detail.

The system is staying up and running much better than it was, but as it did crash the one time I believe the issue may be present still. Since it passed my prior PSU test, please advise what you would suggest checking next.

#24 Updated by James N. about 3 years ago

12232

Right as I was posting my last update the console readout indicated an issue with an object call back. I have attached a picture of that output.

#25 Updated by Alexander Motin about 3 years ago

  • Status changed from 15 to Closed: User Config Issue

I still think it must be something with the hardware. Aside of power supply, there is also power converters on motherboard (which you barely can test). I generally don't exactly trust desktop motherboard from ASRock in server role. The only panic during boot you've shown does not look like an OS bug that could cause your later spontaneous reboots. It may be caused by BIOS issue in CPU initialization, I had alike few years back on desktop board from MSI, which could not start unless it had USB mouse inserted. :)

FreeNAS does not store pool configuration in its database or tie it to specific disks or controllers. ZFS pool can be imported by any other FreeNAS installation of the same or newer version. Restoring configuration allow you to keep other things like network and services configuration, etc, but it is not necessary to copy the data.

#26 Updated by Alexander Motin about 3 years ago

James Nicholas wrote:

Right as I was posting my last update the console readout indicated an issue with an object call back. I have attached a picture of that output.

It is a harmless bug of performance statistics daemon. It happen on many systems and should not be a problem.

#27 Updated by James N. about 3 years ago

Alexander Motin wrote:

I still think it must be something with the hardware. Aside of power supply, there is also power converters on motherboard (which you barely can test). I generally don't exactly trust desktop motherboard from ASRock in server role. The only panic during boot you've shown does not look like an OS bug that could cause your later spontaneous reboots. It may be caused by BIOS issue in CPU initialization, I had alike few years back on desktop board from MSI, which could not start unless it had USB mouse inserted. :)

FreeNAS does not store pool configuration in its database or tie it to specific disks or controllers. ZFS pool can be imported by any other FreeNAS installation of the same or newer version. Restoring configuration allow you to keep other things like network and services configuration, etc, but it is not necessary to copy the data.

I have not had a system crash since the upgrade to 11.0. I will be going on-site tomorrow night monitor as I attempt to take a full backup of the data. If the system crashes again then I will replace the PSU as a test (even though it past hardware diagnostic). If it does not crash then maybe the upgrade resolved whatever the issue was. I will report back the results.

#28 Updated by James N. about 3 years ago

James Nicholas wrote:

I have not had a system crash since the upgrade to 11.0. I will be going on-site tomorrow night monitor as I attempt to take a full backup of the data. If the system crashes again then I will replace the PSU as a test (even though it past hardware diagnostic). If it does not crash then maybe the upgrade resolved whatever the issue was. I will report back the results.

The system crashed last night at 2:35 AM. I will be doing the prior mentioned maintenance at 8 PM tonight.

#29 Updated by James N. about 3 years ago

  • Hardware Configuration updated (diff)

James Nicholas wrote:

James Nicholas wrote:

I have not had a system crash since the upgrade to 11.0. I will be going on-site tomorrow night monitor as I attempt to take a full backup of the data. If the system crashes again then I will replace the PSU as a test (even though it past hardware diagnostic). If it does not crash then maybe the upgrade resolved whatever the issue was. I will report back the results.

The system crashed last night at 2:35 AM. I will be doing the prior mentioned maintenance at 8 PM tonight.

I pulled the power supply and noticed when I pulled it that the EVGA 550 B1 has an eco button on the inside of the power supply which was turned on. I am wondering since it passed hardware diagnostic test if the eco setting was causing the problem. I have replaced the PSU and it appears stable right now and has passed all other tests I have thrown at it. I will monitor for a bit to ensure it is stable but right now it is looking like eco button was causing it.

#30 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170820235142.txz)

#31 Updated by Dru Lavigne about 3 years ago

  • File deleted (post_django_restart_iscsi_remained_up_plugins_disabled_-_debug-freenas-20170820202911.tgz)

#32 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170820091049.tgz)

#33 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170820010229.tgz)

#34 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170821072238.tgz)

#35 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170823010310.tgz)

#36 Updated by Dru Lavigne about 3 years ago

  • File deleted (debug-freenas-20170823034025.tgz)

#37 Updated by Dru Lavigne about 3 years ago

  • Private changed from Yes to No

Also available in: Atom PDF