X8DT3 with two external JBOD chassis.
We are using 11 nightly on one of our production system. It happened second time, that the freenas box hung, all disk IO hung, even we could not log into the server via ssh. The network stack was running, it answered to pings and tcp connection could be established (tcp 22), but the login process did not finish. On the console we could scroll back the console buffer, but could not log in, even the freebsd login prompt did not appear after hitting an enter. The only way to recover was to power cycle the server. I have not met such kind of lockup before, and as I would just like to ask some ideas how to move forward catching this issue.
The server is a X8DT3 supermicro board, with 192G ram, with 16 disks inside the chassis forming one zfs pool, and attached two jbod chassis' through an external sas hba, with two SAS2008 controllers on it.
Howewer, the console does not suggest me that the external sas connection might have any problems, issues, as nothing is logged. If I even just remove a disk I got relevant messages, but during these hangs nothing is printed.
In the external chassis' there are 24 disks forming a 4x6 raidz2 zfs pool, with two ssds partitioned as log+cache. logs are mirrored, the cache is striped. (i know this is not recommended, but should only have performance effects, not cause of a lockup). The server servers nfs for xen hosts, serves some iscsi shares, receives zfs replications from other boxes, and also sends replications from its own datasets.
Unfortunately that is all, I cannot provide more right now. I could not even force the server to make a panic and a crash dump, as it did not give me a console.
We had issues with this box earlier as the arc shrunk each day to nearly half of its size, and according to kernel code and bug reports, that was normal and was caused by kernel memory fragmentation. Could not be this problem related to memory fragmentation? Now i've set vfs.zfs.arc_free_target to 16G to have more free memory if this could relate.
What can I try next time?
#2 Updated by Richard Kojedzinszky almost 2 years ago
The system went for weeks without being used for production, and we decided to migrate our services to that. Also, unfortunately we cannot afford to mirror all our production services to simulate them on a test environment, so real load's effect will pop up when migrated to the new load.
So, we have investigated more, and it seems that for some reason freenas got using swap partitions, heavily, find it attached. You will see that on Monday at 12:00 pm very heavy swap activity began, and lasted until the box locked up. The same repeated two days later. Meantime also there are minimal swapping activities, but since the last lockup, there is none, because of the zfs arc free target setting. The box is just serving nfs and iscsi, there are no jails, no samba, no ftp or other activities. It just creates snapshots, sends them, and also receives from other boxes. So, at 12.00 pm Monday nothing should have happened, but normal service load. Now the arc free target is at 8GB, we still see no swap activity so far. But I assume such a behaviour should not occur anytime.
We are monitoring with these settings. Will report back later on it.
#3 Updated by Richard Kojedzinszky almost 2 years ago
We are investigating the problem. The arc_free_target seems to help, but meanwhile I noticed that the g_bio uma zone is using up around 65G of memory, and usage is just increasing. I suspect that may be a leakage somewhere, and may be that is the answer to our low ARC usage.
I've attached a debug already, please help investigating this leak.