Samba sometimes goes in an infinite loop
"Sometimes, Samba goes full retard, never go full retard."
Behind the quote, here's the fact : Sometimes Samba goes in an infinite loop (it happened when a client tried to reach a directory from a windows client through the CIFS share). It starts using all the cpu resources (one process uses half of it, it seems, and sometimes 2 processes do this embarrassing thing at the same time) and even stopping the CIFS service via the web interface doesn't work.
Here are some screenshots :
The only thing that is able to get my server to normal is to kill -9 the samba process(es) and then start them back. This can happen multiple times a day. This is very embarrassing and I wonder if my low amount of RAM memory (1GB) could be the cause. Even with a low amount of memory, this shouldn't do this. Even an oldest computer with fewer RAM running Ubuntu with several samba shares doesn't get as many issues as I get now while using samba under FreeNas...
In fact, I am experiencing many lag issues while trying to access the shares from a network client since the 9.1.1 version (the first one I started to use).
I'm currently busy ordering some more RAM for my server but still... this shouldn't be that buggy.
Thank you for your comprehension and thank you for the really good work already accomplished.
#2 Updated by Ben Ringham over 5 years ago
I've also been experiencing this behaviour in 18.104.22.168 and in previous releases, the only way I've been able to resolve it is also using kill -9 on the smbd processes.
Can't work out what is causing it to occur and happens at-least once a day, I don't think this is related to a low amount of memory as I am using 32GB.
Below is a sample top output.
22270 root 2 103 0 367M 33916K CPU1 1 665:21 100.00% smbd
18751 root 2 102 0 363M 30568K CPU3 3 645:10 100.00% smbd
I would be grateful if anyone has any insight as to what might be causing this, or how to find out.
#3 Updated by Anonymous over 5 years ago
- Subject changed from Samba seomtimes goes in an infinite loop to Samba sometimes goes in an infinite loop
We need to set up Samba for coredumps then hit it with SIGABRT so it drops a core. We can pull up the core and see where it stopped when the signal came in.
You will want to update to 22.214.171.124 so the crash reporter is disabled, and change the core dump path to the pool so there's enough space to write the core (the default location is on /data which is way too small).
To set the coredump location somewhere with size (assuming your system partition), run this on the CLI:
Once set, provoke the 100% CPU problem, then run from the CLI:
killall -ABRT smbd
The smbd processes should exit and write /var/db/samba4/smbd.core. Zip up this file and attach it to the ticket, then we can inspect it and see what samba was up to.
Added link to #4285 since pegged CPU mentioned in that bug.
#4 Updated by Kwok Yeung over 5 years ago
I have also been experiencing the same problem with smbd getting stuck at 100% CPU ever since the move to Samba 4.
I don't have a corefile at the moment, but the output of dtruss showed this repeating infinitely:
kevent(0x22, 0x8167C3000, 0x52D) = -1 Err#9
I think this might be an instance of bug 9879 (https://bugzilla.samba.org/show_bug.cgi?id=9879) on the Samba bug tracker?
#5 Updated by Anonymous over 5 years ago
That's a valuable clue. Errno 9 is EBADF, Bad file descriptor, so either a connection disconnected, a file went away or had an I/O issue, or the event management lost track of a file descriptor and fed a closed/invalid one to kevent. If it keeps popping up then its either stuck in the kqueue and the event management should remove it & run cleanup or its a bug elsewhere that needs to be fixed.
#11 Updated by Jérémy Lecocq over 5 years ago
Sorry, I couldn't really update this ticket during the last days. I see that some other people updated their smbd.core files. Is mine still required ? I'm going to set up the variable in case it happens again.
I noticed something else : Samba is way less buggy if no torrent are active in my transmission jail. I'm still running an old version of Transmission (2.77_1) and I accidentally let all the torrents paused for the last 18 hours. Well, the result was that samba answered any request to the share really quickly and never went back in an infinite loop... I don't know if both facts are related to each other.
Thank you really much for the help.
#12 Updated by Anonymous over 5 years ago
No problem, I think we are on to a potential fix. If we get a fix developed & committed we may want you to try applying a nightly then try to provoke the problem again. I doubt transmission is directly involved, but it could be causing I/O latency/pressure or moving files around on the disk behind Samba which might increase the chances of something bad happening.
#14 Updated by Jérémy Lecocq over 5 years ago
Okidoki ! Just tell me when the nightly will be done and I'll try it. ;D I can't really provoke the bug but I can create some I/O pressure to see if nothing bad happens. For example, I had no problem the last 24 hours... and sometimes I get it multiple times a day. It's a bit random.
Thank you for the great job. ;D
#15 Updated by Kwok Yeung over 5 years ago
I think the key factor is churn - files coming and going. I have two FreeNAS boxes, the one serving my home directories and redirected Windows folders exhibits the problem nearly constantly (to the point of being unusable), whereas I have never seen it on the other one which only serves large media files which are rarely updated.
#18 Updated by Kenneth Langga over 5 years ago
I'm getting this error when trying to upgrade via the GUI.
Environment: Software Version: FreeNAS-126.96.36.199-RELEASE-x64 (ce022f0) Request Method: POST Request URL: https://freenas.dream.upd.edu.ph/system/firmwizard/?X-Progress-ID=1a83dd6f-78bb-47ac-833f-72757588178c Traceback: File "/usr/local/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response 107. response = middleware_method(request, callback, callback_args, callback_kwargs) File "/usr/local/www/freenasUI/../freenasUI/freeadmin/middleware.py" in process_view 158. return login_required(view_func)(request, *view_args, **view_kwargs) File "/usr/local/lib/python2.7/site-packages/django/contrib/auth/decorators.py" in _wrapped_view 22. return view_func(request, *args, **kwargs) File "/usr/local/lib/python2.7/site-packages/django/views/generic/base.py" in view 69. return self.dispatch(request, *args, **kwargs) File "/usr/local/lib/python2.7/site-packages/django/contrib/formtools/wizard/views.py" in dispatch 236. response = super(WizardView, self).dispatch(request, *args, **kwargs) File "/usr/local/lib/python2.7/site-packages/django/views/generic/base.py" in dispatch 87. return handler(request, *args, **kwargs) File "/usr/local/lib/python2.7/site-packages/django/contrib/formtools/wizard/views.py" in post 291. self.storage.set_step_data(self.steps.current, self.process_step(form)) File "/usr/local/www/freenasUI/../freenasUI/system/forms.py" in process_step 125. wizard=self) Exception Type: TypeError at /system/firmwizard/ Exception Value: done() got an unexpected keyword argument 'form_list'
#19 Updated by Philip Robar over 5 years ago
I had the same problem as reported in this bug.
I updated to FreeNAS-9.2.2-ALPHA-202bef9-x64 (2014-03-13 nightly) and the problem seems to have gone away. I tested with Helios LanTest on both OS X 10.9.2 (using SMB2) and Windows 8.1 clients individually and at the same time. LanTest does the following using the Gigabit setting: Create, Open/Close and Remove 300 files; Write/Read a 300 MB file; Lock/Unlock 16000 times; and Read a Directory with 640 files. I ran the test for 5 loops on each client. The number of smbd processes I could see in the Running Processes window remained constant at 3 to 4. As expected, only one or two smbd process started to use the CPU. CPU usage ramped up as expected while the test was running and then fell back to zero soon after the test finished. Memory usage remained constant throughout the test. (2.3 GB free out of 4 GB total). Load peaked at around 1 when testing with both clients at the same time using a 2.2 GHz Celeron 450, Realtek RTL8111D NIC and a WD My Book USB3 disk connected (with USB 3 enabled) as the target disk.
Let me know if you would like me to do some other type of testing.
#20 Updated by Philip Robar over 5 years ago
- File smbd.core.zip added
I have to withdraw my statement that this is fixed.
After watching several TV shows and a movie on both my Mac and Windows 8.1 machines I noticed that the 10s of seconds delay when I delete a file via the Windows client was back. I then watched another TV show and also got the same long delay when I deleted it. Then I noticed that the first show hadn't actually been deleted. I sshed into my server and saw that neither show had been deleted.
I opened a browser window to my server and opened the Running Processes window and found 15 smbd processes all using 5 to 6 percent of the CPU and a load of 15. Stopping the CIFS service didn't kill the smbd processes, so I did a killall -9 which seemed to work, but when I started the service back up and copied a file over from my Mac two smbd processes started using 50% of the CPU each. After a couple more file copies I had 4 smbd processes, each using 25% of the CPU.
So I detached my USB 3 drive (because FreeNAS/FreeBSD hangs during shutdown when it detaches it), rebooted and imported my drive. The next day I copied a couple of TV shows over to the server from my Mac and an smbd process went to 100% of the CPU and stayed there. I shut down CIFS and killed the smbd process, restarted CIFS and copied/deleted a couple of gigabyte files over to the server from Windows. A little while later I had 7 smbd processes with 100% of the CPU divided up between them. I've attached the core dump from a killall -ABRT of smbd.
The only thing interesting I've noticed in the console window are these entries from the day before I noticed the problem again.
Mar 14 17:53:07 freenas kernel: pid 20160 (smbd), uid 1001: exited on signal 11
Mar 14 19:29:52 freenas kernel: pid 23630 (smbd), uid 1001: exited on signal 11
Mar 14 19:30:02 freenas kernel: pid 23770 (smbd), uid 1001: exited on signal 11
I'm going to upgrade to the 2014-03-15 FreeNAS-9.2.2-ALPHA-776c3f8-x64 nightly. Let me know if there's something else I can do.
In case it matters here's my server:
2.2 GHz Celeron 450 (Conroe-L)
4 GB DDR2 non-ECC RAM
Intel DG41TY motherboard
Realtek RTL8111D NIC
Bytec BT-PEU310 USB 3 PCIe card
WD My Book 3 TB USB 3 disk
#21 Updated by Philip Robar over 5 years ago
Update using 2014-03-15 FreeNAS-9.2.2-ALPHA-776c3f8-x64 nightly: If I've just mount an SMB share from my server everything is OK. I can also run the Helios LAN Test I mentioned above repeatedly and not have a problem, but as soon as I do something as simple as browse a directory or two from either Windows 8.1 or OS X 10.9.2 smbd pegs at 100% and just stays there.
#22 Updated by Anthony Takata over 5 years ago
- File smbd.7z added
I'm using FreeBSD 9.2-RELEASE-p3 (FREENAS.i386) #0 r262572+38751c8: Thu Mar 20 21:12:09 PDT 2014
This issue occurs on my Windows 7x64 SP1 machine, but not on my Windows 8.1x64 machine.
All I need to do is browse around (effectively listing directories on a mounted share) and I can get a smbd process to lock up.
Interesting line related to the stuck process from smbstatus:
4383 65534 DENY_NONE 0x100081 RDONLY NONE /mnt/external/DISK1 . Fri Mar 21 21:47:57 2014
Also, the computer that's having issues is using a known user, all of the other (working) machines are using failed logins (aka guest)
I am unable to obtain a core file by targeting the process that's taking up cycles.
However, using the killall command did result in a core file being generated, which I have attached.
I'm going to try and force guest access for the affected PC and see if that's a factor or not.
#23 Updated by Anthony Takata over 5 years ago
- File log.smbd.txt added
Also, attaching some interesting lines from log.smbd
Seems the issue happens whenever windows requests change notifications.
With this in mind, I set
change notify = no
in the service config under Auxiliary parameters, and the issue appears fixed!
Further testing will be needed, but this discovery is making me kinda excited!
#24 Updated by Mike L over 5 years ago
- File smbd.core.7z added
Also getting random 100% CPU usage that requires killing the process, core attached (hopefully I did this right). Running FreeNAS-188.8.131.52-RELEASE-x64 (dc0c46b)
Harry's change notify = no did not appear to work for me, had another instance of 100% pop up after I tried that.
#26 Updated by Anonymous over 5 years ago
For people who can trigger the 100% CPU use problem, please try adding this to your service Additional Parameters section, restart CIFS, then try to reproduce the issue:
kernel change notify = no
This is different than "change notify" which controls the client side of this feature but may not entirely disable the offending code.
Note that this change may have a negative performance impact -- if its particularly noticeable, please let us know.
If you test changes that we are making to try and address the underlying issue, please be sure to remove this option so the code we want to test is actually tested.
#28 Updated by Chris Lewis over 5 years ago
Yes, that parameter has stopped the 100% CPU use. It only happened when browsing FreeNAS shares using Explorer (on Windows 7) - copying files from command line using robocopy did not trigger the issue. Anyway, I have not been able to reproduce the problem since making the change. I'm running FreeNAS-184.108.40.206-RELEASE-x86. Thanks for all your work.