Project

General

Profile

Bug #27235

Unknown lockup

Added by Richard Kojedzinszky almost 2 years ago. Updated almost 2 years ago.

Status:
Closed: Duplicate
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

X8DT3 with two external JBOD chassis.

ChangeLog Required:
No

Description

We are using 11 nightly on one of our production system. It happened second time, that the freenas box hung, all disk IO hung, even we could not log into the server via ssh. The network stack was running, it answered to pings and tcp connection could be established (tcp 22), but the login process did not finish. On the console we could scroll back the console buffer, but could not log in, even the freebsd login prompt did not appear after hitting an enter. The only way to recover was to power cycle the server. I have not met such kind of lockup before, and as I would just like to ask some ideas how to move forward catching this issue.

The server is a X8DT3 supermicro board, with 192G ram, with 16 disks inside the chassis forming one zfs pool, and attached two jbod chassis' through an external sas hba, with two SAS2008 controllers on it.

Howewer, the console does not suggest me that the external sas connection might have any problems, issues, as nothing is logged. If I even just remove a disk I got relevant messages, but during these hangs nothing is printed.

In the external chassis' there are 24 disks forming a 4x6 raidz2 zfs pool, with two ssds partitioned as log+cache. logs are mirrored, the cache is striped. (i know this is not recommended, but should only have performance effects, not cause of a lockup). The server servers nfs for xen hosts, serves some iscsi shares, receives zfs replications from other boxes, and also sends replications from its own datasets.

Unfortunately that is all, I cannot provide more right now. I could not even force the server to make a panic and a crash dump, as it did not give me a console.

We had issues with this box earlier as the arc shrunk each day to nearly half of its size, and according to kernel code and bug reports, that was normal and was caused by kernel memory fragmentation. Could not be this problem related to memory fragmentation? Now i've set vfs.zfs.arc_free_target to 16G to have more free memory if this could relate.

What can I try next time?

nas-b-swap-io.svg (250 KB) nas-b-swap-io.svg swap io usage Richard Kojedzinszky, 12/15/2017 04:38 AM

Related issues

Is duplicate of FreeNAS - Bug #27270: Fix memory leakResolved2017-12-15

History

#1 Updated by Dru Lavigne almost 2 years ago

  • Status changed from Unscreened to 15

Ouch: why a nightly on a production system?

We'll need you to attach a debug from that system in order to start investigating. Create it after the system is back up after a hang.

#2 Updated by Richard Kojedzinszky almost 2 years ago

The system went for weeks without being used for production, and we decided to migrate our services to that. Also, unfortunately we cannot afford to mirror all our production services to simulate them on a test environment, so real load's effect will pop up when migrated to the new load.

So, we have investigated more, and it seems that for some reason freenas got using swap partitions, heavily, find it attached. You will see that on Monday at 12:00 pm very heavy swap activity began, and lasted until the box locked up. The same repeated two days later. Meantime also there are minimal swapping activities, but since the last lockup, there is none, because of the zfs arc free target setting. The box is just serving nfs and iscsi, there are no jails, no samba, no ftp or other activities. It just creates snapshots, sends them, and also receives from other boxes. So, at 12.00 pm Monday nothing should have happened, but normal service load. Now the arc free target is at 8GB, we still see no swap activity so far. But I assume such a behaviour should not occur anytime.

We are monitoring with these settings. Will report back later on it.

#3 Updated by Richard Kojedzinszky almost 2 years ago

We are investigating the problem. The arc_free_target seems to help, but meanwhile I noticed that the g_bio uma zone is using up around 65G of memory, and usage is just increasing. I suspect that may be a leakage somewhere, and may be that is the answer to our low ARC usage.

I've attached a debug already, please help investigating this leak.

#4 Updated by Richard Kojedzinszky almost 2 years ago

  • File debug-nas-b-20171218134637.tgz added

This may relate to #27270.

g_bio shows innocently huge numbers:

# vmstat -z -H | grep g_bio
g_bio: 376, 0,187861606, 2934,3330974699, 0, 0

#5 Updated by Dru Lavigne almost 2 years ago

  • Status changed from 15 to Unscreened
  • Assignee changed from Release Council to Alexander Motin
  • Private changed from No to Yes
  • Seen in changed from 11.1-BETA1 to Master - FreeNAS Nightlies

#6 Updated by Richard Kojedzinszky almost 2 years ago

Bisecting 11.1-stable and 11.0-stable for the leak, found that since commit a0dddc24c905013363838bb04e79443d81a2d765 g_bio is leaking. Hope that helps.

#7 Updated by Alexander Motin almost 2 years ago

  • Is duplicate of Bug #27270: Fix memory leak added

#8 Updated by Alexander Motin almost 2 years ago

  • Status changed from Unscreened to Closed: Duplicate

This looks like a consequence of memory leak reported in #27270.

#9 Updated by Dru Lavigne almost 2 years ago

  • File deleted (debug-nas-b-20171218134637.tgz)

#10 Updated by Dru Lavigne almost 2 years ago

  • Target version set to N/A
  • Private changed from Yes to No

Also available in: Atom PDF