Project

General

Profile

Bug #46821

Disable loop debug

Added by Jonathan Caron almost 2 years ago. Updated almost 2 years ago.

Status:
Done
Priority:
No priority
Assignee:
William Grzybowski
Category:
Middleware
Target version:
TrueNAS - TrueNAS-11.1-U6.1
Seen in:
Severity:
Med High
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

Getting the following 2 messages after server re-start, usually within 10-15 minutes. Server responsiveness slowly degrades to the point sbm shares are no longer accessible. Side effect, even ssh to the server is taking longer for authentication and running commands like tail.

Sep 16 09:17:01 nas /alert.py: [system.alert:393] Alert module '<samba4.Samba4Alert object at 0x814d50588>' failed: Failed connection handshake

and

Sep 16 09:15:29 nas /alert.py: [system.alert:393] Alert module '<samba4.Samba4Alert object at 0x814d50588>' failed: timed out

Hardware is

FreeNAS 11.1u6: Supermicro X11SSM, Xeon E3-V5, 16G ram, Seasonic PSU, 4 WD Reds, 1 Pool Geli Encrypted
Running Plex and Syncthing Jails, 1 Linux VM for unifi controller
Setup running since May 2017

I have a debug file for the U6 and the U5 version, both with the problem.

Found 2 related posts on the forums:
https://forums.freenas.org/index.php?threads/samba4-samba4alert.63499/
https://forums.freenas.org/index.php?threads/freenas-sluggish-webui-not-responding-samba4alert-failed-messages.69677/

Willing to help figuring out.

2018.09.27.fstat (11.8 MB) 2018.09.27.fstat Jonathan Caron, 09/28/2018 04:27 AM

Related issues

Has duplicate FreeNAS - Bug #44229: System alert 393 samba4.samba4alert failed handshakeClosed

Associated revisions

Revision a57041c5 (diff)
Added by William Grzybowski almost 2 years ago

fix(middlewared): disable loop debug for 11.1 Ticket: #46821

History

#1 Updated by Dru Lavigne almost 2 years ago

  • Private changed from No to Yes
  • Reason for Blocked set to Need additional information from Author

Jonathan: go ahead and attach the debugs to this ticket. It is marked private until the dev has a chance to review them and determine the cause.

#2 Updated by Jonathan Caron almost 2 years ago

  • File 11.1U6-debug-nas-20180914002443.tgz added

#3 Updated by Jonathan Caron almost 2 years ago

  • File 11.1U5-debug-nas-20180915213122.tgz added

#4 Updated by Jonathan Caron almost 2 years ago

Now also receiving new type of alerts

Sep 18 07:15:58 nas /alert.py: [system.alert:393] Alert module '<update_check.UpdateCheckAlert object at 0x81405f240>' failed: Failed connection handshake
Sep 18 08:54:57 nas uwsgi: [freeadmin.navtree:562] Couldn't retrieve https://10.99.0.15/plugins/syncthing/1/_s/treemenu: The read operation timed out
Sep 18 08:54:57 nas uwsgi: [freeadmin.navtree:562] Couldn't retrieve https://10.99.0.15/plugins/plexmediaserver/2/_s/treemenu: The read operation timed out

Will upload a debug as soon as possible then restart the system, no longer usable with smb shares.

#5 Updated by Jonathan Caron almost 2 years ago

  • File 11.1U5-nonfunctional-debug-nas-20180918113327.tgz added

#6 Updated by Dru Lavigne almost 2 years ago

  • Assignee changed from Release Council to William Grzybowski

#8 Updated by Dru Lavigne almost 2 years ago

  • Reason for Blocked deleted (Need additional information from Author)

#9 Updated by William Grzybowski almost 2 years ago

  • Status changed from Unscreened to Blocked
  • Reason for Blocked set to Need additional information from Author

Hello,

Looking at your debug it seems it ultimately goes down to the boot media devices.

It seems you have two Sandisk and either one or both of them are performing very poorly.

Any change you can try swapping them with something new to rule out USB flash drives going bad?

#10 Updated by Jonathan Caron almost 2 years ago

I've split my mirror, dropped a drive, will test one by one, if no fix, i'll go out and get a new one to test out.
Can you tell me how you've identified the potential mis-behaving drive? i could track the issue easier.

#11 Updated by William Grzybowski almost 2 years ago

Jonathan Caron wrote:

I've split my mirror, dropped a drive, will test one by one, if no fix, i'll go out and get a new one to test out.
Can you tell me how you've identified the potential mis-behaving drive? i could track the issue easier.

Unfortunately we still don't have a way to track this. This is more a hunch and experience seeing this kind of problem. There are some smoking guns in middlewared.log about code taking too much time everytime something in the disk needs to be accessed.

#12 Updated by Jonathan Caron almost 2 years ago

Thank you for the information, I've tried breaking the mirror, it took more time but the errors were back. I've ordered SSDs to replace the keys, ETA wednesday next week, will updatet once replacmeent is done and problem goes away.

Jonathan

#13 Updated by Jonathan Caron almost 2 years ago

SSD installed, starting to monitor. next update in 48 hours.

thank you

Jonathan

#14 Updated by Jonathan Caron almost 2 years ago

  • File debug-nas-20180925111025.tgz added

Hello,

Sadly bad news, messages are back

Sep 25 03:20:18 nas /alert.py: [system.alert:393] Alert module '<samba4.Samba4Alert object at 0x814ccd278>' failed: timed out
Sep 25 03:25:12 nas /alert.py: [system.alert:393] Alert module '<samba4.Samba4Alert object at 0x814ccd278>' failed: Failed connection handshake

I've added the new debug file.

Thank you

Jonathan

#15 Updated by William Grzybowski almost 2 years ago

Thats very strange, it started happening during the night, its when the periodic nightly scripts run.

It seems the boot media is under high load during that period, making some calls timeout.

I see the following process running in the debug:
find -sx / /dev/null -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} +

Is that not going away?

#16 Updated by Jonathan Caron almost 2 years ago

  • File debug-nas-20180925205436.tgz added

Here's my current ps output

root@nas:/mnt/sol/home/jonathan # ps -a
  PID TT  STAT    TIME COMMAND
 4109 v0  Is+  0:01.90 /usr/local/bin/python /etc/netcli (python3.6)
 4110 v1  Is+  0:00.00 /usr/libexec/getty Pc ttyv1
 4111 v2  Is+  0:00.00 /usr/libexec/getty Pc ttyv2
 4112 v3  Is+  0:00.00 /usr/libexec/getty Pc ttyv3
 4113 v4  Is+  0:00.00 /usr/libexec/getty Pc ttyv4
 4114 v5  Is+  0:00.00 /usr/libexec/getty Pc ttyv5
 4115 v6  Is+  0:00.00 /usr/libexec/getty Pc ttyv6
 4116 v7  Is+  0:00.00 /usr/libexec/getty Pc ttyv7
45240  0  Is+  0:00.01 -csh (csh)
 1503  1  Ss   0:00.01 -csh (csh)
 1696  1  S    0:00.01 sudo su
 1716  1  S    0:00.00 su
 1717  1  S    0:00.01 _su (csh)
 1724  1  R+   0:00.00 ps -a

My last error happen at noon, so if it's running every night i'll take a snap of ps tomorrow am.

While i took a debug, the errors came back, here's another debug file if you can compare other things.

As a side note, i had to do a re-install and restore backup to move to my new ssd.

Jonathan

#17 Updated by Jonathan Caron almost 2 years ago

Here's the output of this morning ps-a

root@nas:/mnt/sol/home/jonathan # ps -a
  PID TT  STAT    TIME COMMAND
 4109 v0  Is+  0:01.90 /usr/local/bin/python /etc/netcli (python3.6)
 4110 v1  Is+  0:00.00 /usr/libexec/getty Pc ttyv1
 4111 v2  Is+  0:00.00 /usr/libexec/getty Pc ttyv2
 4112 v3  Is+  0:00.00 /usr/libexec/getty Pc ttyv3
 4113 v4  Is+  0:00.00 /usr/libexec/getty Pc ttyv4
 4114 v5  Is+  0:00.00 /usr/libexec/getty Pc ttyv5
 4115 v6  Is+  0:00.00 /usr/libexec/getty Pc ttyv6
 4116 v7  Is+  0:00.00 /usr/libexec/getty Pc ttyv7
45240  0  Ss   0:00.01 -csh (csh)
59973  0  S    0:00.01 sudo su
59974  0  S    0:00.00 su
59975  0  S    0:00.01 _su (csh)
59977  0  R+   0:00.00 ps -a

let me know if i can provide anything else.

Jonathan

#18 Updated by William Grzybowski almost 2 years ago

Hi,

gonna need "ps axuww", ps -a does not have all the running processes.

#19 Updated by Jonathan Caron almost 2 years ago

% sudo ps axuww
Password:
USER         PID  %CPU %MEM     VSZ     RSS TT  STAT STARTED       TIME COMMAND
root          11 385.0  0.0       0      64  -  RL   Mon20   8495:28.89 [idle]
root        9081  13.2  8.1 2159748 1356156  -  S    Mon20    320:07.60 bhyve: 2_Unifi (bhyve)
root          24   6.4  0.0       0      16  -  DL   Mon20    134:01.57 [vnlru]
root         233   0.1  0.8  197404  136808  -  S    Mon20     10:09.02 python3.6: middlewared (python3.6)
root           0   0.0  0.1       0   17168  -  DLs  Mon20      6:16.68 [kernel]
root           1   0.0  0.0    5388     236  -  ILs  Mon20      0:00.04 /sbin/init --
root           2   0.0  0.0       0      16  -  DL   Mon20      0:00.00 [crypto]
root           3   0.0  0.0       0      16  -  DL   Mon20      0:00.00 [crypto returns]
root           4   0.0  0.0       0      32  -  DL   Mon20      0:18.61 [cam]
root           5   0.0  0.0       0      48  -  DL   Mon20      0:00.05 [ctl]
root           6   0.0  0.0       0     192  -  DL   Mon20      0:33.40 [zfskern]
root           7   0.0  0.0       0      16  -  DL   Mon20      0:00.02 [soaiod1]
root           8   0.0  0.0       0      16  -  DL   Mon20      0:00.02 [soaiod2]
root           9   0.0  0.0       0      16  -  DL   Mon20      0:00.02 [soaiod3]
root          10   0.0  0.0       0      16  -  DL   Mon20      0:00.00 [audit]
root          12   0.0  0.0       0     352  -  WL   Mon20      7:44.91 [intr]
root          13   0.0  0.0       0      64  -  DL   Mon20      0:00.00 [ng_queue]
root          14   0.0  0.0       0      48  -  DL   Mon20      1:13.11 [geom]
root          15   0.0  0.0       0      80  -  DL   Mon20      0:08.75 [usb]
root          16   0.0  0.0       0      16  -  DL   Mon20      0:00.02 [soaiod4]
root          17   0.0  0.0       0      16  -  DL   Mon20      0:42.98 [rand_harvestq]
root          18   0.0  0.0       0      16  -  DL   Mon20      0:01.28 [acpi_thermal]
root          19   0.0  0.0       0      48  -  DL   Mon20      0:13.36 [pagedaemon]
root          20   0.0  0.0       0      16  -  DL   Mon20      0:00.04 [vmdaemon]
root          21   0.0  0.0       0      16  -  DL   Mon20      0:00.00 [pagezero]
root          22   0.0  0.0       0      16  -  DL   Mon20      0:00.56 [bufspacedaemon]
root          23   0.0  0.0       0      16  -  DL   Mon20      0:00.57 [bufdaemon]
root          25   0.0  0.0       0      16  -  DL   Mon20      1:48.19 [syncer]
root         231   0.0  0.0    6344    1044  -  Is   Mon20      0:00.00 daemon: /usr/local/bin/middlewared[233] (daemon)
root         320   0.0  0.0       0      16  -  DL   Mon20      0:03.11 [ipmi0: kcs]
root         360   0.0  0.0       0      16  -  DL   Mon20      0:00.00 [g_mp_kt]
root         421   0.0  0.0       0      16  -  DL   Mon20      0:00.04 [g_mirror swap0]
root         423   0.0  0.0       0      16  -  DL   Mon20      0:00.05 [g_mirror swap1]
root         425   0.0  0.0       0      16  -  DL   Mon20      0:00.09 [g_eli[0] mirror/swa]
root         426   0.0  0.0       0      16  -  DL   Mon20      0:00.08 [g_eli[1] mirror/swa]
root         427   0.0  0.0       0      16  -  DL   Mon20      0:00.08 [g_eli[2] mirror/swa]
root         428   0.0  0.0       0      16  -  DL   Mon20      0:00.07 [g_eli[3] mirror/swa]
root         431   0.0  0.0       0      16  -  DL   Mon20      0:00.09 [g_eli[0] mirror/swa]
root         432   0.0  0.0       0      16  -  DL   Mon20      0:00.07 [g_eli[1] mirror/swa]
root         433   0.0  0.0       0      16  -  DL   Mon20      0:00.09 [g_eli[2] mirror/swa]
root         434   0.0  0.0       0      16  -  DL   Mon20      0:00.08 [g_eli[3] mirror/swa]
root        1030   0.0  0.0       0      16  -  DL   Mon20      0:00.00 [ftcleanup]
root        1577   0.0  0.0    6556    1384  -  Is   Mon20      0:00.00 /usr/sbin/moused -p /dev/ums0 -t auto -I /var/run/moused.ums0.pid
root        1590   0.0  0.0    9172     844  -  Is   Mon20      0:00.12 /sbin/devd -q
root        1785   0.0  0.0   13532      16  -  I    Mon20      0:00.00 /usr/local/sbin/syslog-ng -p /var/run/syslog.pid
root        1786   0.0  0.0   18968    4808  -  Ss   Mon20      0:12.72 /usr/local/sbin/syslog-ng -p /var/run/syslog.pid
root        1914   0.0  0.0    3520    3584  -  Ss   Mon20      0:22.51 /usr/sbin/watchdogd --pretimeout 5 --pretimeout-action log,printf
root        2210   0.0  0.1   12512   12620  -  Ss   Mon20      0:05.51 /usr/sbin/ntpd -g -c /etc/ntp.conf -p /var/run/ntpd.pid -f /var/db/ntpd.drift
uucp        2312   0.0  0.0    6784    2256  -  Ss   Mon20      0:13.27 /usr/local/libexec/nut/usbhid-ups -a ups
uucp        2314   0.0  0.0   18896    2300  -  Ss   Mon20      0:03.04 /usr/local/sbin/upsd
uucp        2317   0.0  0.0    6580    1240  -  Is   Mon20      0:00.14 /usr/local/bin/upslog -s ups -l /var/log/ups.log -i 300
root        2320   0.0  0.0    6608    2188  -  Is   Mon20      0:00.00 /usr/local/sbin/upsmon localhost
uucp        2322   0.0  0.0    6608     808  -  S    Mon20      0:02.31 /usr/local/sbin/upsmon localhost
root        2632   0.0  0.0   12904    4964  -  Is   Mon20      0:00.00 /usr/local/sbin/sshd
root        2786   0.0  0.0    8656    1416  -  Is   Mon20      0:00.20 /usr/local/sbin/smartd -n -i 1800 -c /usr/local/etc/smartd.conf -p /var/run/smartd-daemon.pid
root        2811   0.0  0.0    6344    1840  -  Is   Mon20      0:00.00 daemon: /usr/local/sbin/pbid[2812] (daemon)
root        2812   0.0  0.0    7096    1020  -  I    Mon20      0:00.28 /bin/sh /usr/local/sbin/pbid
root        3007   0.0  0.0   29304    4508  -  Is   Mon20      0:00.01 nginx: master process /usr/local/sbin/nginx
messagebus  3017   0.0  0.0    7604    2032  -  Is   Mon20      0:00.00 /usr/local/bin/dbus-daemon --system
root        3033   0.0  0.0   11992     944  -  Ss   Mon20      0:01.23 ladvd: parent [priv] (ladvd)
ladvd       3034   0.0  0.0    9944    1128  -  S    Mon20      0:01.08 ladvd: child (ladvd)
root        3070   0.0  0.5  150324   89784  -  S    Mon20      2:00.12 python: alertd (python3.6)
root        3081   0.0  0.5  150752   88236  -  S    Mon20      0:09.98 /usr/local/bin/uwsgi --chdir=/usr/local/www --module=freenasUI.wsgi:application --master --pidfile=/var/run/django.pid --fastcgi-socket=127.0.0.1:9042 -T --threads=15 -L -b 20480 --daemonize=/var/log/uwsgi.log
root        3126   0.0  0.0    6344    1840  -  Is   Mon20      0:00.00 daemon: /usr/bin/env[3127] (daemon)
root        3127   0.0  0.1   33336    9572  -  S    Mon20      0:10.76 /usr/local/bin/consul-alerts start --alert-addr=localhost:8542 --watch-events --watch-checks --log-level=warn
root        3133   0.0  0.0    6344    1972  -  Is   Mon20      0:00.01 daemon: /usr/bin/env[3135] (daemon)
root        3135   0.0  0.1   45332   18984  -  S    Mon20      3:54.03 /usr/local/bin/consul agent -data-dir=/var/tmp/consul -config-dir=/usr/local/etc/consul.d -server -bootstrap-expect=1 -ui -bind=127.0.0.1 -enable-script-checks=true -log-level=warn
root        3482   0.0  0.9  217500  149752  -  I    Mon20     15:23.59 /usr/local/bin/uwsgi --chdir=/usr/local/www --module=freenasUI.wsgi:application --master --pidfile=/var/run/django.pid --fastcgi-socket=127.0.0.1:9042 -T --threads=15 -L -b 20480 --daemonize=/var/log/uwsgi.log
root        3883   0.0  0.0    9000    3048  -  Is   Mon20      0:00.03 /usr/sbin/zfsd
root        3893   0.0  0.0    6496     716  -  Ss   Mon20      0:00.29 /usr/sbin/cron -s
root        4317   0.0  0.1   32912    8496  -  I    Mon20      0:02.06 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /usr/local/bin/consul-alerts watch checks --alert-addr localhost:8542 --log-level warn
root        4318   0.0  0.1   32912    9204  -  I    Mon20      0:00.16 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /usr/local/bin/consul-alerts watch event --alert-addr localhost:8542 --log-level warn
root        4771   0.0  0.0       0      16  -  DL   Mon20      0:19.31 [g_eli[0] gptid/3da6]
root        4772   0.0  0.0       0      16  -  DL   Mon20      0:14.37 [g_eli[1] gptid/3da6]
root        4773   0.0  0.0       0      16  -  DL   Mon20      0:15.58 [g_eli[2] gptid/3da6]
root        4774   0.0  0.0       0      16  -  DL   Mon20      0:13.82 [g_eli[3] gptid/3da6]
root        4776   0.0  0.0       0      16  -  DL   Mon20      0:18.78 [g_eli[0] gptid/3ea7]
root        4777   0.0  0.0       0      16  -  DL   Mon20      0:14.27 [g_eli[1] gptid/3ea7]
root        4778   0.0  0.0       0      16  -  DL   Mon20      0:15.28 [g_eli[2] gptid/3ea7]
root        4779   0.0  0.0       0      16  -  DL   Mon20      0:13.79 [g_eli[3] gptid/3ea7]
root        4781   0.0  0.0       0      16  -  DL   Mon20      0:23.79 [g_eli[0] gptid/aab9]
root        4782   0.0  0.0       0      16  -  DL   Mon20      0:17.49 [g_eli[1] gptid/aab9]
root        4783   0.0  0.0       0      16  -  DL   Mon20      0:18.65 [g_eli[2] gptid/aab9]
root        4784   0.0  0.0       0      16  -  DL   Mon20      0:17.35 [g_eli[3] gptid/aab9]
root        4803   0.0  0.0       0      16  -  DL   Mon20      0:22.04 [g_eli[0] gptid/5b27]
root        4804   0.0  0.0       0      16  -  DL   Mon20      0:16.91 [g_eli[1] gptid/5b27]
root        4805   0.0  0.0       0      16  -  DL   Mon20      0:18.23 [g_eli[2] gptid/5b27]
root        4806   0.0  0.0       0      16  -  DL   Mon20      0:16.83 [g_eli[3] gptid/5b27]
root        6785   0.0  0.2  132748   34852  -  SJ   Mon20      0:10.06 /usr/pbi/syncthing-amd64/bin/python2.7 /usr/pbi/syncthing-amd64/control.py start 10.99.0.75 12346
root        6841   0.0  0.0   37092    8228  -  Ss   Mon20      0:01.29 /usr/local/sbin/nmbd --daemon --configfile=/usr/local/etc/smb4.conf
root        6846   0.0  0.3  174044   54020  -  Is   Mon20      0:01.27 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
root        6852   0.0  0.3   85324   49504  -  Ss   Mon20      0:01.67 /usr/local/sbin/winbindd --daemon --configfile=/usr/local/etc/smb4.conf
root        6853   0.0  0.1   46684    9648  -  I    Mon20      0:00.26 /usr/local/sbin/winbindd --daemon --configfile=/usr/local/etc/smb4.conf
root        6868   0.0  0.1  130604   12280  -  S    Mon20      0:01.09 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
root        6874   0.0  0.1  130956   12612  -  S    Mon20      0:00.11 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
nobody      6905   0.0  0.0    7144    2744  -  Ss   Mon20      0:06.69 /usr/local/sbin/mdnsd
root        6907   0.0  0.1   45324    8800  -  S    Mon20      0:00.15 /usr/local/sbin/winbindd --daemon --configfile=/usr/local/etc/smb4.conf
root        6908   0.0  0.3   87108   49612  -  S    Mon20      0:00.18 /usr/local/sbin/winbindd --daemon --configfile=/usr/local/etc/smb4.conf
root        7495   0.0  0.0   10464     912  -  SsJ  Mon20      0:00.52 /usr/sbin/syslogd -s
syncthing   8057   0.0  0.0   10428      16  -  IsJ  Mon20      0:00.00 daemon: /usr/pbi/syncthing-amd64/bin/ST/syncthing[8058] (daemon)
syncthing   8058   0.0  0.0  119068    5916  -  IJ   Mon20      0:02.02 /usr/pbi/syncthing-amd64/bin/ST/syncthing -home=/var/db/syncthing -no-browser
root        8066   0.0  0.0   12564     436  -  SsJ  Mon20      0:00.26 /usr/sbin/cron -s
syncthing   8091   0.0  2.5  463796  422872  -  INJ  Mon20      5:33.63 /usr/pbi/syncthing-amd64/bin/ST/syncthing -home=/var/db/syncthing -no-browser
root        8613   0.0  0.2  132748   35496  -  SJ   Mon20      0:10.02 /usr/pbi/plexmediaserver-amd64/bin/python2.7 /usr/pbi/plexmediaserver-amd64/control.py start 10.99.0.76 12347
root        8694   0.0  0.0   10464    1500  -  SsJ  Mon20      0:00.35 /usr/sbin/syslogd -s
972         8786   0.0  0.9  292136  146360  -  IsJ  Mon20      4:39.54 /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Plex_Media_Server (Plex Media Server)
root        8794   0.0  0.0   12564     532  -  SsJ  Mon20      0:00.29 /usr/sbin/cron -s
root        8852   0.0  0.1  122288   23092  -  Ss   Mon20      2:57.13 /usr/local/sbin/collectd
root        8854   0.0  0.3  183888   57784  -  I    Mon20      1:32.81 /usr/local/sbin/smbd --daemon --configfile=/usr/local/etc/smb4.conf
972         8858   0.0  0.5  203368   77220  -  INJ  Mon20      4:27.24 Plex Plug-in [com.plexapp.system] /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/Framework.bundle/Contents/Resources/Versions/2/Python/bootstrap.py --server-version 1.12.2.4929-29f6d1796 /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/System.bundle (Plex Script Host)972         9063   0.0  0.3  165832   42096  -  IJ   Mon20      0:41.49 /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Plex DLNA Server
root        9088   0.0  0.1   26756   19104  -  S    Mon20      0:12.07 python /usr/local/libexec/novnc/utils/websockify/run --web /usr/local/libexec/novnc/ --wrap-mode=ignore 10.99.0.15:9891 10.99.0.15:9991 (python3.6)
www        33956   0.0  0.0   31352    6976  -  I    00:00      0:04.88 nginx: worker process (nginx)
root       45233   0.0  0.0   13216    6820  -  Is   Tue06      0:00.19 sshd: jonathan [priv] (sshd)
jonathan   45239   0.0  0.0   13216    6716  -  S    Tue06      0:01.15 sshd: jonathan@pts/0 (sshd)
root       45682   0.0  0.3   85324   49372  -  I    Tue06      0:00.12 /usr/local/sbin/winbindd --daemon --configfile=/usr/local/etc/smb4.conf
root       50690   0.0  0.0   12564    1408  -  IJ   03:01      0:00.00 cron: running job (cron)
root       50699   0.0  0.0   13144    1904  -  IsJ  03:01      0:00.00 /bin/sh - /usr/sbin/periodic daily
root       50738   0.0  0.0    6244    1452  -  IJ   03:01      0:00.00 lockf -t 0 /var/run/periodic.daily.lock /bin/sh /usr/sbin/periodic LOCKED daily
root       50740   0.0  0.0   13144    1904  -  IJ   03:01      0:00.00 /bin/sh /usr/sbin/periodic LOCKED daily
root       50760   0.0  0.0   13144    1912  -  IJ   03:01      0:00.00 /bin/sh /usr/sbin/periodic LOCKED daily
root       50761   0.0  0.0    8392    1748  -  IJ   03:01      0:00.00 mail -E -s plexmediaserver_1 daily run output root
root       51339   0.0  0.0   13144    1920  -  IJ   03:07      0:00.00 /bin/sh /etc/periodic/daily/450.status-security
root       51340   0.0  0.0   13144    1916  -  IJ   03:07      0:00.00 /bin/sh - /usr/sbin/periodic security
root       51342   0.0  0.0    6244    1464  -  IJ   03:07      0:00.00 lockf -t 0 /var/run/periodic.security.lock /bin/sh /usr/sbin/periodic LOCKED security
root       51343   0.0  0.0   13144    1916  -  IJ   03:07      0:00.00 /bin/sh /usr/sbin/periodic LOCKED security
root       51350   0.0  0.0   13144    1924  -  IJ   03:07      0:00.00 /bin/sh /usr/sbin/periodic LOCKED security
root       51351   0.0  0.0    8392    1636  -  IJ   03:07      0:00.00 mail -E -s plexmediaserver_1 daily security run output root
root       58630   0.0  0.0   13144    1896  -  IJ   06:13      0:00.00 /bin/sh /etc/periodic/daily/480.leapfile-ntpd
root       58633   0.0  0.0    6244    1436  -  IJ   06:13      0:00.00 sleep 51501
root       63723   0.0  0.0   13144    1936  -  IJ   08:23      0:00.00 /bin/sh - /etc/periodic/security/110.neggrpperm
root       63808   0.0  0.0   13144    1936  -  IJ   08:23      0:00.00 /bin/sh - /etc/periodic/security/110.neggrpperm
root       63809   0.0  0.0   10388    3640  -  DJ   08:23      0:02.86 find -sx / /dev/null -type f ( ( ! -perm +010 -and -perm +001 ) -or ( ! -perm +020 -and -perm +002 ) -or ( ! -perm +040 -and -perm +004 ) ) -exec ls -liTd {} +
root       63810   0.0  0.0    8292    1512  -  IJ   08:23      0:00.00 tee /dev/stderr
root       63811   0.0  0.0   12592    1680  -  IJ   08:23      0:00.00 wc -l
972        68151   0.0  0.3  172496   48984  -  IJ   10:17      0:07.47 Plex Plug-in [com.plexapp.agents.lastfm] /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/Framework.bundle/Contents/Resources/Versions/2/Python/bootstrap.py --server-version 1.12.2.4929-29f6d1796 /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/LastFM.bundle (Plex Script Host)
972        68156   0.0  0.3  176592   56640  -  IJ   10:18      0:03.03 Plex Plug-in [com.plexapp.agents.localmedia] /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/Framework.bundle/Contents/Resources/Versions/2/Python/bootstrap.py --server-version 1.12.2.4929-29f6d1796 /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/LocalMedia.bundle (Plex Script Host)
972        68218   0.0  0.3  166864   45084  -  IJ   10:20      0:02.80 Plex Plug-in [com.plexapp.agents.plexmusic] /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/Framework.bundle/Contents/Resources/Versions/2/Python/bootstrap.py --server-version 1.12.2.4929-29f6d1796 /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/PlexMusic.bundle (Plex Script Host)
972        68438   0.0  0.2  166608   40572  -  IJ   10:22      0:01.74 Plex Plug-in [com.plexapp.agents.htbackdrops] /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/Framework.bundle/Contents/Resources/Versions/2/Python/bootstrap.py --server-version 1.12.2.4929-29f6d1796 /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/HTbackdrops.bundle (Plex Script Host)
972        68542   0.0  0.3  166480   42748  -  IJ   10:26      0:02.18 Plex Plug-in [com.plexapp.agents.lyricfind] /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/Framework.bundle/Contents/Resources/Versions/2/Python/bootstrap.py --server-version 1.12.2.4929-29f6d1796 /usr/pbi/plexmediaserver-amd64/share/plexmediaserver/Resources/Plug-ins-29f6d1796/LyricFind.bundle (Plex Script Host)
root       69265   0.0  0.0   13216    7660  -  Is   10:44      0:00.01 sshd: jonathan [priv] (sshd)
jonathan   69270   0.0  0.0   13216    7680  -  S    10:44      0:00.01 sshd: jonathan@pts/1 (sshd)
root       69282   0.0  0.0    4192    1996  -  I    10:44      0:00.00 sleep 300
root       69316   0.0  0.0     468     376  -  D    10:46      0:00.00 /bin/sh -c /sbin/zpool status -x freenas-boot
root        4109   0.0  0.3   72724   55372 v0  Is+  Mon20      0:01.90 /usr/local/bin/python /etc/netcli (python3.6)
root        4110   0.0  0.0    6364    1820 v1  Is+  Mon20      0:00.00 /usr/libexec/getty Pc ttyv1
root        4111   0.0  0.0    6364    1820 v2  Is+  Mon20      0:00.00 /usr/libexec/getty Pc ttyv2
root        4112   0.0  0.0    6364    1820 v3  Is+  Mon20      0:00.00 /usr/libexec/getty Pc ttyv3
root        4113   0.0  0.0    6364    1820 v4  Is+  Mon20      0:00.00 /usr/libexec/getty Pc ttyv4
root        4114   0.0  0.0    6364    1820 v5  Is+  Mon20      0:00.00 /usr/libexec/getty Pc ttyv5
root        4115   0.0  0.0    6364    1820 v6  Is+  Mon20      0:00.00 /usr/libexec/getty Pc ttyv6
root        4116   0.0  0.0    6364    1820 v7  Is+  Mon20      0:00.00 /usr/libexec/getty Pc ttyv7
jonathan   45240   0.0  0.0    7480    3440  0  Is   Tue06      0:00.01 -csh (csh)
root       59973   0.0  0.0   12008    7296  0  I    06:48      0:00.01 sudo su
root       59974   0.0  0.0    6964    2868  0  I    06:48      0:00.00 su
root       59975   0.0  0.0    7480    3664  0  I+   06:48      0:00.01 _su (csh)
jonathan   69271   0.0  0.0    7480    3748  1  Ss   10:44      0:00.01 -csh (csh)
root       69314   0.0  0.0   12008    7308  1  S+   10:45      0:00.01 sudo ps axuww
root       69322   0.0  0.0    6948    2924  1  R+   10:46      0:00.00 ps axuww
% 

#20 Updated by William Grzybowski almost 2 years ago

Thats very strange you still have the "find" process running.

Can you run it manually so we can figure out how long it takes and why it takes so long?

  1. time find -sx / /dev/null -type f > /tmp/find

And then attach /tmp/find as well with the time results?

#21 Updated by William Grzybowski almost 2 years ago

  • Target version changed from Backlog to TrueNAS 11.1-U6.2
  • Severity changed from New to Med High

#22 Updated by Bug Clerk almost 2 years ago

  • Status changed from Blocked to In Progress

#23 Updated by Jonathan Caron almost 2 years ago

  • File find added

Here's the command output:

root@nas:/mnt/sol/home/jonathan # time find -sx / /dev/null -type f > /tmp/find 
0.299u 1.478s 2:10:52.70 0.0%   75+197k 4990+0io 11pf+0w

I've attached the file "find".

Note that the system was super slow while this was running, but only in freenas function, my VM was working properly.
After the command, the system is slower than it should (i've checked that i the find -sx was not running.

thank you

Jonathan

#24 Updated by William Grzybowski almost 2 years ago

Jonathan Caron wrote:

Here's the command output:

[...]

I've attached the file "find".

Note that the system was super slow while this was running, but only in freenas function, my VM was working properly.
After the command, the system is slower than it should (i've checked that i the find -sx was not running.

thank you

Jonathan

Over two hours to list all files in root is way too much (it only takes 7 seconds on my machine), there is definitely something going on here hardware wise.

KINGSTON SA400S37120G SBFK71E0, how old is that device? Is it really healthy? To what SATA port is it connected to?

#25 Updated by Jonathan Caron almost 2 years ago

William Grzybowski wrote:

Jonathan Caron wrote:

Here's the command output:

[...]

I've attached the file "find".

Note that the system was super slow while this was running, but only in freenas function, my VM was working properly.
After the command, the system is slower than it should (i've checked that i the find -sx was not running.

thank you

Jonathan

Over two hours to list all files in root is way too much (it only takes 7 seconds on my machine), there is definitely something going on here hardware wise.

KINGSTON SA400S37120G SBFK71E0, how old is that device? Is it really healthy? To what SATA port is it connected to?

Device is brand new, received on sunday, connected on monday. It's connected to the motherboard SATA port. I didn't test it. That port was used before to connect one of my storage drive (that failed or was being inconsistent)
Considering i had the same issue on the usb key, that's odd it transferred. And I agree, when I looked at the file, it's basically a directory tree list, that should not take 2 hours, not on modern ssd.

Let me know what test you'd like me to run, i can pull the disk out, wipe it, test it, re-install freenas and restore the backup.
In the meanwhile, I'll move the drive to a different SATA port, just to be sure it's not defective.

thank you

Jonathan

#26 Updated by Jonathan Caron almost 2 years ago

Before moving hardware, tried a reboot first

here's the command output

root@nas:/mnt/sol/home/jonathan # time find -sx / /dev/null -type f > /tmp/find
0.164u 0.651s 0:02.79 29.0%    66+174k 12680+0io 0pf+0w

So less than 7 sec, that's in line with what you got. No other changes.

Jonathan

#27 Updated by William Grzybowski almost 2 years ago

Jonathan Caron wrote:

Before moving hardware, tried a reboot first

here's the command output
[...]

So less than 7 sec, that's in line with what you got. No other changes.

Jonathan

Thats very strange, maybe something else is making high usage of the disk, or something else is triggering that behavior.

Once that happens again lets try monitor which processes are doing most IO (top -m io)

#28 Updated by Jonathan Caron almost 2 years ago

Strange is not over...

I think i found the culprit but it make no sense to me...
Looks like it's a plex jail, when set to index the library based on time, just goes stupid. I totally get the IO required, but that's in a jail and jail are hosted on my main volume, not on the boot device.
As soon as i start it with that parameter, the time find takes forever to complete, stop the jail, problem goes away.

When running a jail, is there any writes on the boot device? I do have writes from the bhyve hypervisor for a very small vm and it's not degrading it. Something odd is going on.

Let me know if you have ideas, i will most likely just move plex from a jail to a vm.

Jonathan

#29 Updated by William Grzybowski almost 2 years ago

Interesting finding.

I have consulted with our OS manager and he thinks this is directly related to the number of files open by some process.
There are 129395 open files and that seems to be causing vnode recycling issues.
My guess its the plex process.

Can you get us the output of "fstat" while experiencing that behavior?

#30 Updated by Jonathan Caron almost 2 years ago

  • File 2018.09.27.fstat added

Hello,

Got the following

fstat: kinfo_getfile(): Device busy

in the messages, got a bunch of new things

Sep 27 16:30:21 nas /alert.py: [freenasOS.Manifest:452] Could not get CRL file http://update-master.ixsystems.com/updates/ix_crl.pem
Sep 27 16:30:22 nas /alert.py: [freenasOS.Configuration:772] Unable to load http://update-master.ixsystems.com/updates/ix_crl.pem: <urlopen error [Errno 8] hostname nor servname provided, or not known>
Sep 27 16:30:22 nas /alert.py: [freenasOS.Configuration:786] Unable to load ['http://update-master.ixsystems.com/updates/ix_crl.pem']: <urlopen error [Errno 8] hostname nor servname provided, or not known>

output of fstat attached

#31 Updated by Dru Lavigne almost 2 years ago

  • File deleted (11.1U6-debug-nas-20180914002443.tgz)

#32 Updated by Dru Lavigne almost 2 years ago

  • File deleted (11.1U5-debug-nas-20180915213122.tgz)

#33 Updated by Dru Lavigne almost 2 years ago

  • File deleted (11.1U5-nonfunctional-debug-nas-20180918113327.tgz)

#34 Updated by Dru Lavigne almost 2 years ago

  • File deleted (debug-nas-20180925111025.tgz)

#35 Updated by Dru Lavigne almost 2 years ago

  • File deleted (debug-nas-20180925205436.tgz)

#36 Updated by Dru Lavigne almost 2 years ago

  • File deleted (find)

#37 Updated by Dru Lavigne almost 2 years ago

  • File deleted (2018.09.27.fstat)

#38 Updated by Dru Lavigne almost 2 years ago

  • Subject changed from samba4.Samba4Alert after upgrade to 11.1U6, still there after roll back to U5 to Disable loop debug
  • Status changed from In Progress to Ready for Testing
  • Private changed from Yes to No
  • Needs Doc changed from Yes to No
  • Needs Merging changed from Yes to No

#39 Updated by Dru Lavigne almost 2 years ago

#40 Updated by Dru Lavigne almost 2 years ago

  • Target version changed from TrueNAS 11.1-U6.2 to TrueNAS-11.1-U6.1

#41 Updated by Dru Lavigne almost 2 years ago

  • Reason for Blocked deleted (Need additional information from Author)

#42 Updated by Jonathan Caron almost 2 years ago

Dru Lavigne wrote:

11.1-stable PR: https://github.com/freenas/freenas/pull/1831

Hello,

I'm not familiar with github/pull request, how do i go about testing it?
I also saw this was flagged for truenas instead of freenas?

thx

Jonathan

#43 Updated by William Grzybowski almost 2 years ago

Jonathan Caron wrote:

Dru Lavigne wrote:

11.1-stable PR: https://github.com/freenas/freenas/pull/1831

Hello,

I'm not familiar with github/pull request, how do i go about testing it?
I also saw this was flagged for truenas instead of freenas?

thx

Jonathan

This is only for our internal release process, dont worry about it.
We believe we have fixed time out and connection handshake messages however the system will still experience the filesystem weirdness behavior.

#44 Updated by William Grzybowski almost 2 years ago

Can you please add the fstat again? The file was deleted by mistake.

#46 Updated by Jonathan Caron almost 2 years ago

as requested, this is the original file, if you need a new one let me know, i'll do it tonight.

#47 Updated by William Grzybowski almost 2 years ago

Jonathan Caron wrote:

as requested, this is the original file, if you need a new one let me know, i'll do it tonight.

It seems like syncthing is the culprit here, with 128703 open files!
So moving Plex to a VM wouldnt help anything.

#48 Updated by Jonathan Caron almost 2 years ago

Ok i'll test it this weekend, i guess plex is pushing it over. I'll try only with plex with syncthing off and i'll see what happen.
I'll most likely end up moving both to VMs. i'll report back.

thx!

Let me know if you need anything else.

#49 Updated by William Grzybowski almost 2 years ago

  • Status changed from Ready for Testing to Done
  • Needs QA changed from Yes to No

#51 Updated by Jonathan Caron almost 2 years ago

Hello, confirmed, syncthing is the bad jail...

to compare, here's the result of the find

root@nas:/mnt/sol/home/jonathan # time find -sx / /dev/null -type f > /tmp/find
0.141u 0.464s 0:00.60 100.0%    66+174k 0+0io 0pf+0w

Under a second compared to 4-6 in normal mode.

I'll go look in synchting forums, but i'll probably end up moving both jails to vms.

thank you a lot for your help in figuring this out, should the syncthing bug/behavior be reported somewhere in freenas? or because it's a plugin it's not managed?j

Jonathan

#52 Updated by William Grzybowski almost 2 years ago

Jonathan Caron wrote:

Hello, confirmed, syncthing is the bad jail...

to compare, here's the result of the find

[...]

Under a second compared to 4-6 in normal mode.

I'll go look in synchting forums, but i'll probably end up moving both jails to vms.

thank you a lot for your help in figuring this out, should the syncthing bug/behavior be reported somewhere in freenas? or because it's a plugin it's not managed?j

Jonathan

I suppose we will raise the issue to syncthing upstream, there isnt much else we can do for this particular problem.

Thanks for the follow up.

#53 Updated by Jonathan Caron almost 2 years ago

Hi, this will most likely be my last update, found out that syncthing has a feature to watch sync'd folders. I guess it keep a file pointer to watch for changes. I tried disabling it, the number of entries in fstat went from 128000 to 208.

Thank you again for your help, i've learnt quite a few things and managed to figure out the root cause of this.

#54 Updated by William Grzybowski over 1 year ago

  • Has duplicate Bug #44229: System alert 393 samba4.samba4alert failed handshake added

Also available in: Atom PDF