Project

General

Profile

Bug #29133

Fix" Unauthorized Reboot" emails when system did not actually reboot

Added by T F over 3 years ago. Updated over 3 years ago.

Status:
Done
Priority:
No priority
Assignee:
Vladimir Vinogradenko
Category:
Middleware
Target version:
Seen in:
Severity:
Medium
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

Reporting this as a bug from: https://forums.freenas.org/index.php?threads/unauthorized-reboot-11-1-release.60260

NOTE that some users in that thread are reporting their systems ARE actually rebooting. Thankfully for me that is NOT the case. For me, it seems the middleware python PID is crashing & dumping core which is in turn touching the /data/sentinels/unauthorized-reboot file which is triggering the reboot email. So far since upgrading various systems from various prior FN releases (9.10 & 10.x I believe) I have seen this on 3 different systems - 2 of which had been running for ~1 year with no such issues/emails and the third had also been running quietly for ~6months.

Copying & pasting my latest forum post with the details:
Mine still aren't actually rebooting, but I received another email from a new system today. Tracked it down to this:

[CODE]Mar 6 03:45:03 hostname1 kernel: pid 262 (python3.6), uid 0: exited on signal 11 (core dumped)[/CODE]
So I checked for python processes started around that time and found this:

[CODE]
root 13531 0.0 0.1 74884 65288 - I 03:45 0:02.68 /usr/local/bin/python /etc/find_alias_for_smtplib.py -FCronDaemon -odi -oem -oi -t (python3.6)
root 13665 0.0 0.2 188560 138304 - S 03:45 0:12.82 python3.6: middlewared (python3.6)[/CODE]
Went back and found the same thing on another system:
[CODE]Feb 28 12:19:08 hostname2 kernel: pid 250 (python3.6), uid 0: exited on signal 11 (core dumped)[/CODE]
And again:
[CODE]root 42507 0.0 0.2 194096 136816 - S Wed12 22:34.33 python3.6: middlewared (python3.6)[/CODE]
hey [USER=21716]@Scharbag[/USER] any updates on your bug? I still can't see it. I'm about ready to log one of mine own too because my hosts thankfully aren't actually rebooting ... they're just emailing me telling me they did erroneously.

The bug referenced above is 28028, but I still get a 403 forbidden error when I try to view it.


Related issues

Copied to FreeNAS - Bug #29785: Add a worker process to use with non thread-safe modulesDone

Associated revisions

Revision e1233663 (diff)
Added by William Grzybowski over 3 years ago

feat(middlewared): add `run_in_proc` Ticket: #29133

Revision c3a022b8 (diff)
Added by William Grzybowski over 3 years ago

fix(middlewared/alert): only issue an email if uptime is lower than 3m Ticket: #29133

Revision e75a77bc (diff)
Added by William Grzybowski over 3 years ago

fix(middlewared/alert): only issue an email if uptime is lower than 3m Ticket: #29133 (cherry picked from commit c3a022b8720d0db0698ae450755cb562b3f84dc2)

Revision 98c50072 (diff)
Added by Vladimir Vinogradenko over 3 years ago

fix(middlewared/alert): only issue an email if middleware starts for the first time Ticket: #29133

Revision 8dc60903 (diff)
Added by Vladimir Vinogradenko over 3 years ago

fix(middlewared/alert): only issue an email if middleware starts for the first time Ticket: #29133

Revision 4d9bfa2c (diff)
Added by William Grzybowski over 3 years ago

feat(middlewared): add `run_in_proc` Ticket: #29133

Revision 096802d8 (diff)
Added by William Grzybowski over 3 years ago

feat(middlewared): add `run_in_proc` Ticket: #29133

History

#1 Updated by Dru Lavigne over 3 years ago

  • Private changed from No to Yes
  • Reason for Blocked set to Need additional information

Please attach a debug to this ticket (System -> Advanced -> Save Debug).

#2 Updated by T F over 3 years ago

Dru - is there any way to programmatically sanitize the data produced by save debug? The last time I checked it disclosed way more information than I expected - including the system DB complete with password hashes and internal usernames/groups/LDAP-info/IPaddresses/etc

#3 Updated by Dru Lavigne over 3 years ago

Let's start with /var/log/messages and /var/log/middleware, and see if the dev needs anything more than that once they start investigating.

#5 Updated by William Grzybowski over 3 years ago

  • Status changed from Unscreened to Not Started
  • Assignee changed from Release Council to William Grzybowski

It seems middlewared is segfaulting.

Can you attach /var/db/system/cores/python3.6.core ?

#6 Updated by William Grzybowski over 3 years ago

  • Status changed from Not Started to Blocked
  • Reason for Blocked changed from Need additional information to Waiting for feedback

#7 Updated by T F over 3 years ago

  • File python3.6.core.gz added

attached

#8 Updated by William Grzybowski over 3 years ago

The process is segfault'ing in OpenSSL code, which seems strange at the least.

How many times did that happen? And can you reliably reproduce it somehow?

#9 Updated by T F over 3 years ago

For me it has happened one time on each of 3 machines. 1 of the 3 is actively used for AFP sharing, while the other two are replication targets that no users connect to directly at this time. I haven't been able to discern any behavior patterns that cause it to occur and the times when it has happened have been widespread:

t02 - System booted at Tue Feb 27 04:27:55 2018 was not shut down properly
t01 - System booted at Wed Feb 28 12:19:15 2018 was not shut down properly
n02 - System booted at Tue Mar 6 03:45:10 2018 was not shut down properly

I haven't tried to reproduce it and the systems were likely mostly/completely idle at all of those times. I can try and grab the core dumps from the other two systems to see if it's the same thing.

#10 Updated by William Grzybowski over 3 years ago

T F wrote:

For me it has happened one time on each of 3 machines. 1 of the 3 is actively used for AFP sharing, while the other two are replication targets that no users connect to directly at this time. I haven't been able to discern any behavior patterns that cause it to occur and the times when it has happened have been widespread:

t02 - System booted at Tue Feb 27 04:27:55 2018 was not shut down properly
t01 - System booted at Wed Feb 28 12:19:15 2018 was not shut down properly
n02 - System booted at Tue Mar 6 03:45:10 2018 was not shut down properly

I haven't tried to reproduce it and the systems were likely mostly/completely idle at all of those times. I can try and grab the core dumps from the other two systems to see if it's the same thing.

Yeah, lets get them to make sure its the same thing.

This happened when trying to send an email, so one way is try to send test emails in them.

#11 Updated by T F over 3 years ago

  • File t01-python.core.gz added
  • File t02-python.core.gz added

attached

#12 Updated by T F over 3 years ago

also are you sure about it being an email that triggered this? the system sent an email after the fact to alert me for sure, but I didn't receive another email prior to this (though maybe it crashed before it could send me a prior one). I sent a test email from 2 of the 3 systems and both arrived without python appearing to have crashed

#13 Updated by William Grzybowski over 3 years ago

T F wrote:

also are you sure about it being an email that triggered this? the system sent an email after the fact to alert me for sure, but I didn't receive another email prior to this (though maybe it crashed before it could send me a prior one). I sent a test email from 2 of the 3 systems and both arrived without python appearing to have crashed

Yes, the thread that did tracecback originated on the mail send call, while starting SSL TLS routines.

I wonder if that would happen again after a reboot?

#14 Updated by William Grzybowski over 3 years ago

William Grzybowski wrote:

T F wrote:

also are you sure about it being an email that triggered this? the system sent an email after the fact to alert me for sure, but I didn't receive another email prior to this (though maybe it crashed before it could send me a prior one). I sent a test email from 2 of the 3 systems and both arrived without python appearing to have crashed

Yes, the thread that did tracecback originated on the mail send call, while starting SSL TLS routines.

I wonder if that would happen again after a reboot?

The other two segfaults are something else entirely.

#15 Updated by William Grzybowski over 3 years ago

Also, any chance you have been using iocage (jails in new UI?)

#16 Updated by T F over 3 years ago

William Grzybowski wrote:

Also, any chance you have been using iocage (jails in new UI?)

Negative ghost rider

#17 Updated by William Grzybowski over 3 years ago

That is very strange, can you also attach messages and middlewared.log of these other machines as well?

#18 Updated by William Grzybowski over 3 years ago

Also one more question, were all of these running 11.1-U2, 100% sure?

#19 Updated by William Grzybowski over 3 years ago

  • Target version set to 11.1-U3

ping

#20 Updated by T F over 3 years ago

100% sure all are running 11.1-U2. let me go back and grab the logs from t02 & n02

#21 Updated by T F over 3 years ago

t02 messages:
Feb 27 04:27:13 t02 daemon6339: 2018/02/27 04:27:13 [WARN] agent: Check 'freenas_health' is now warning
Feb 27 04:27:47 t02 kernel: pid 250 (python3.6), uid 0: exited on signal 10 (core dumped)
Feb 27 04:29:13 t02 daemon6339: 2018/02/27 04:29:13 [WARN] agent: Check 'freenas_health' is now warning

t02 middleware.log:
[2018/02/26 01:54:29] (DEBUG) ServiceService._simplecmd():255 - Calling: restart(collectd)
[2018/02/27 04:27:50] (DEBUG) raven.base.Client.set_dsn():243 - Configuring Raven for host: <raven.conf.remote.RemoteConfig object at 0x80ba20c50>
[2018/02/27 04:27:50] (DEBUG) middlewared.__plugins_load():671 - Loading plugins from /usr/local/lib/python3.6/site-packages/middlewared/plugins
[2018/02/27 04:27:54] (WARNING) middlewared._loop_monitor_thread():997 - Task seems blocked: File "/usr/local/lib/python3.6/asyncio/coroutines.py", line 94, in init
self._source_traceback = traceback.extract_stack(sys._getframe(1))
File "/usr/local/lib/python3.6/traceback.py", line 207, in extract_stack
stack = StackSummary.extract(walk_stack(f), limit=limit)
File "/usr/local/lib/python3.6/traceback.py", line 358, in extract
f.line
File "/usr/local/lib/python3.6/traceback.py", line 282, in line
self._line = linecache.getline(self.filename, self.lineno).strip()
File "/usr/local/lib/python3.6/linecache.py", line 16, in getline
lines = getlines(filename, module_globals)
File "/usr/local/lib/python3.6/linecache.py", line 47, in getlines
return updatecache(filename, module_globals)
File "/usr/local/lib/python3.6/linecache.py", line 136, in updatecache
with tokenize.open(fullname) as fp:
File "/usr/local/lib/python3.6/tokenize.py", line 454, in open
encoding, lines = detect_encoding(buffer.readline)
File "/usr/local/lib/python3.6/tokenize.py", line 423, in detect_encoding
first = read_or_stop()
File "/usr/local/lib/python3.6/tokenize.py", line 381, in read_or_stop
return readline()

[2018/02/27 04:27:55] (DEBUG) mDNSBrowserService.start():466 - mDNSBrowserService: start()
[2018/02/27 04:27:55] (DEBUG) middlewared.__plugins_load():730 - All plugins loaded
[2018/02/27 04:27:55] (DEBUG) middlewared.debug():161 - DiscoverThread: name=_net-assistant flags=0x00000003 error=0
[2018/02/27 04:27:55] (DEBUG) middlewared.debug():161 - DiscoverThread: name=_printer flags=0x00000003 error=0
[2018/02/27 04:27:55] (DEBUG) middlewared.debug():161 - DiscoverThread: name=_ipp flags=0x00000003 error=0
  • SNIP: just continues logging all the things it discovers via bonjour on the LAN **

n02 messages:
Mar 6 00:00:00 n02 syslog-ng9113: Configuration reload request received, reloading configuration;
Mar 6 03:45:03 n02 kernel: pid 262 (python3.6), uid 0: exited on signal 11 (core dumped)
Mar 7 00:00:00 n02 syslog-ng9113: Configuration reload request received, reloading configuration;

n02 middleware.log:
[2018/02/26 02:40:41] (DEBUG) middleware.notifier._pipeopen():185 - Popen()ing: /sbin/zfs list -H -o name -t volume,filesystem
[2018/02/26 02:40:42] (DEBUG) middleware.notifier._pipeopen():185 - Popen()ing: /usr/local/bin/net groupmap list
[2018/02/26 02:40:42] (DEBUG) middleware.notifier._pipeopen():185 - Popen()ing: /usr/local/bin/net groupmap list
Exception in thread Thread-58:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 549, in run
self.register()
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 546, in register
self._register(self.hostname, self.regtype, self.port)
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 536, in _register
pybonjour.DNSServiceProcessResult(ref)
File "/usr/local/lib/python3.6/site-packages/pybonjour.py", line 895, in DNSServiceProcessResult
_DNSServiceProcessResult(sdRef)
File "/usr/local/lib/python3.6/site-packages/pybonjour.py", line 286, in _errcheck
raise cls(result)
pybonjour.BonjourError: (-65563, 'unknown')

Exception in thread Thread-56:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 549, in run
self.register()
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 546, in register
self._register(self.hostname, self.regtype, self.port)
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 536, in _register
pybonjour.DNSServiceProcessResult(ref)
File "/usr/local/lib/python3.6/site-packages/pybonjour.py", line 895, in DNSServiceProcessResult
_DNSServiceProcessResult(sdRef)
File "/usr/local/lib/python3.6/site-packages/pybonjour.py", line 286, in _errcheck
raise cls(result)
pybonjour.BonjourError: (-65563, 'unknown')

Exception in thread Thread-60:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 549, in run
self.register()
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 546, in register
self._register(self.hostname, self.regtype, self.port)
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 536, in _register
pybonjour.DNSServiceProcessResult(ref)
File "/usr/local/lib/python3.6/site-packages/pybonjour.py", line 895, in DNSServiceProcessResult
_DNSServiceProcessResult(sdRef)
File "/usr/local/lib/python3.6/site-packages/pybonjour.py", line 286, in _errcheck
raise cls(result)
pybonjour.BonjourError: (-65563, 'unknown')

Exception in thread Thread-54:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 549, in run
self.register()
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 546, in register
self._register(self.hostname, self.regtype, self.port)
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 536, in _register
pybonjour.DNSServiceProcessResult(ref)
File "/usr/local/lib/python3.6/site-packages/pybonjour.py", line 895, in DNSServiceProcessResult
_DNSServiceProcessResult(sdRef)
File "/usr/local/lib/python3.6/site-packages/pybonjour.py", line 286, in _errcheck
raise cls(result)
pybonjour.BonjourError: (-65563, 'unknown')

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.6/site-packages/middlewared/plugins/mdns.py", line 201, in run
pybonjour.DNSServiceProcessResult(ref)
File "/usr/local/lib/python3.6/site-packages/pybonjour.py", line 895, in DNSServiceProcessResult
_DNSServiceProcessResult(sdRef)
File "/usr/local/lib/python3.6/site-packages/pybonjour.py", line 286, in _errcheck
raise cls(result)
pybonjour.BonjourError: (-65563, 'unknown')

[2018/02/26 02:40:46] (DEBUG) ServiceService._simplecmd():255 - Calling: restart(collectd)
[2018/02/26 02:40:46] (DEBUG) ServiceService._simplecmd():255 - Calling: restart(collectd)
[2018/02/26 02:40:46] (DEBUG) ServiceService._simplecmd():255 - Calling: reload(disk)
[2018/02/26 02:40:46] (DEBUG) ServiceService._simplecmd():255 - Calling: reload(disk)
[2018/02/26 02:40:47] (DEBUG) ServiceService._simplecmd():255 - Calling: restart(collectd)
[2018/02/26 02:40:47] (DEBUG) ServiceService._simplecmd():255 - Calling: restart(collectd)
[2018/03/06 03:45:06] (DEBUG) raven.base.Client.set_dsn():243 - Configuring Raven for host: <raven.conf.remote.RemoteConfig object at 0x80ba20c50>
[2018/03/06 03:45:06] (DEBUG) middlewared.__plugins_load():671 - Loading plugins from /usr/local/lib/python3.6/site-packages/middlewared/plugins
[2018/03/06 03:45:09] (DEBUG) mDNSBrowserService.start():466 - mDNSBrowserService: start()
[2018/03/06 03:45:10] (DEBUG) middlewared.debug():161 - DiscoverThread: name=_net-assistant flags=0x00000003 error=0
[2018/03/06 03:45:10] (DEBUG) middlewared.debug():161 - DiscoverThread: name=_printer flags=0x00000003 error=0
[2018/03/06 03:45:10] (DEBUG) middlewared.debug():161 - DiscoverThread: name=_http flags=0x00000003 error=0
  • SNIP: just continues logging all the things it discovers via bonjour on the LAN **

#22 Updated by Dru Lavigne over 3 years ago

  • Subject changed from "Bogus" Unauthorized Reboot Emails (systems are not actually rebooting) in FN 11.1-U2 to Fix" Unauthorized Reboot Emails when system did not actually reboot

#23 Updated by Dru Lavigne over 3 years ago

  • File deleted (messages)

#24 Updated by Dru Lavigne over 3 years ago

  • File deleted (middlewared.log)

#25 Updated by Dru Lavigne over 3 years ago

  • Subject changed from Fix" Unauthorized Reboot Emails when system did not actually reboot to Fix" Unauthorized Reboot" emails when system did not actually reboot

#26 Updated by William Grzybowski over 3 years ago

  • Copied to Bug #29785: Add a worker process to use with non thread-safe modules added

#27 Updated by William Grzybowski over 3 years ago

  • Status changed from Blocked to In Progress
  • Reason for Blocked deleted (Waiting for feedback)

#28 Updated by Vladimir Vinogradenko over 3 years ago

  • Status changed from In Progress to Done
  • Assignee changed from William Grzybowski to Vladimir Vinogradenko

#29 Updated by T F over 3 years ago

does this mean the middleware crashes will continue and the fix will be to suppress the email notification?

#30 Updated by Vladimir Vinogradenko over 3 years ago

You are right. We are working on these crashes in https://redmine.ixsystems.com/issues/29785

#31 Updated by T F over 3 years ago

Gotcha. Can I get access to that bug (currently get a 403)?

#32 Updated by William Grzybowski over 3 years ago

T F wrote:

Gotcha. Can I get access to that bug (currently get a 403)?

Try now

#33 Updated by Dru Lavigne over 3 years ago

  • File deleted (python3.6.core.gz)

#34 Updated by Dru Lavigne over 3 years ago

  • File deleted (t01-python.core.gz)

#35 Updated by Dru Lavigne over 3 years ago

  • File deleted (t02-python.core.gz)

#36 Updated by Dru Lavigne over 3 years ago

  • Private changed from Yes to No

Also available in: Atom PDF