Project

General

Profile

Bug #38105

Do not count ARC against available RAM

Added by Disk Didler 10 months ago. Updated 9 months ago.

Status:
Done
Priority:
No priority
Assignee:
Andrew Walker
Category:
Services
Target version:
Seen in:
Severity:
Low
Reason for Closing:
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

I seem to be getting a large quantity of emails for alerts for things which haven't been a problem previously.
I have 6x5TB disks (30TB) RAIDZ2 (20TB) and 16GB of memory (note, this system worked for 18 months with only 8GB)

Example, last night

freenas.local needs attention
system.ram
CHART
ram available = 7.8%


system.ram
CHART
ram available = 3.25% 
estimated amount of RAM available for userspace processes, without causing swapping
ALARM
ram
FAMILY
Escalated to CRITICAL
SEVERITY
system.swapio
CHART
30min ram swapped out (alarm was raised for 36 minutes and 1 second)
the amount of memory swapped in the last 30 minutes, as a percentage of the system RAM
ALARM
swap
FAMILY
Recovered from CRITICAL
SEVERITY

Etc etc

I don't really have a problem with this, if it's a genuine issue, but this system hasn't been unstable in over 14 months (power supply) so the reality is, the system seems to be recovering from this memory issue which (I assume) has always been there?

Maybe thresholds should be adjusted. Infact, I thought ZFS uses all the ram it can get, implying it'll always have these issues.
Any thoughts?
(I'm also getting one for "inodes" too)


Related issues

Related to FreeNAS - Bug #52195: Emails regarding the amount of memory swapped in the last 30 minutes, as a percentage of the system RAMClosed
Related to FreeNAS - Bug #64716: Suppress default Netdata RAM usage warningDone

History

#1 Updated by Dru Lavigne 9 months ago

  • Private changed from No to Yes
  • Reason for Blocked set to Need additional information from Author

Please attach a debug to assist the dev in diagnosing the cause.

#2 Updated by Disk Didler 9 months ago

  • Private changed from Yes to No

#3 Updated by Rick Connor 9 months ago

  • File debug.tgz added

Having same issue. Here is my Debug file. Hope this helps

#4 Updated by Dru Lavigne 9 months ago

  • Category changed from OS to Services
  • Assignee changed from Release Council to John Hixson
  • Target version changed from Backlog to 11.2-RC2

#5 Updated by Dru Lavigne 9 months ago

  • Reason for Blocked deleted (Need additional information from Author)

#6 Updated by Disk Didler 9 months ago

  • File debug.tgz added

I've got one too now.
Ex 09:26am, my time:


system.swapio
CHART
30min ram swapped out (alarm was raised for 40 minutes and 1 second)
the amount of memory swapped in the last 30 minutes, as a percentage of the system RAM
ALARM
swap
FAMILY
Recovered from CRITICAL
SEVERITY
Tue Jul 17 09:26:49 AEST 2018 
(alarm was raised for 40 minutes and 1 second)
TIME

09:35am this morning

freenas.local is critical
system.swapio
CHART
30min ram swapped out = 116.6% of RAM 
the amount of memory swapped in the last 30 minutes, as a percentage of the system RAM
ALARM
swap
FAMILY
CRITICAL
SEVERITY
Tue Jul 17 09

Etc

If I'm low on ram, I don't know, but she worked fine 18months with 8GB, another 12 to 18 months with memory doubled.

#7 Updated by Dru Lavigne 9 months ago

  • Target version changed from 11.2-RC2 to 11.2-BETA3

#8 Updated by John Hixson 9 months ago

  • Assignee changed from John Hixson to Andrew Walker

#10 Updated by Andrew Walker 9 months ago

FreeBSD got swapio metrics in netdata version 1.10
https://github.com/firehol/netdata/commit/3af09cbe54cd28b105cf2f19954434b11409bb86#diff-4f2c78f788edcb99de4d2922b386dce5

We updated to version 1.10 per redmine ticket here: https://redmine.ixsystems.com/issues/30864

Your systems appear to be swapping heavily:

First system ---- Swap: 10G Total, 2707M Used, 7532M Free, 26% Inuse
Second system ---- Swap: 6144M Total, 1114M Used, 5030M Free, 18% Inuse

So this is an expected notification in this situation. You can silence them by editing /etc/local/netdata/health.d/swap.conf and setting the 'to' line to: "silent" then restarting netdata.
Once you have verified that the behavior is what you want, you can make the same changes to /conf/base/etc/local/netdata/health.d/swap.conf.

#11 Updated by Disk Didler 9 months ago

So you mean me or the other guys debug?

This machine has no VMS, it ran fine for nearly two years with only 8GB.

I doubled the memory to try running a VM, which I ended up not doing.

Knowing it works with 8 fine, one would think 16 is enough?

#12 Updated by Andrew Walker 9 months ago

Regarding RAM usage:
Similar story. FreeBSD got metrics in Netdata 1.10:
https://github.com/firehol/netdata/commit/0791c0d790be35fdc43d2fc74a70ad2891ca87f2

Algorithm for generating ram.available:

calc: ($free + $inactive + $used_ram_to_ignore) * 100 / ($free + $active + $inactive + $wired + $cache + $buffers)

used_ram_to_ignore is calculated in the following way:

   alarm: used_ram_to_ignore
      on: system.ram
      os: linux
   hosts: *
    calc: ($zfs.arc_size.arcsz = nan)?(0):($zfs.arc_size.arcsz)
   every: 10s
    info: the amount of memory that is reported as used, but it is actually capable for resizing itself based on the system needs (eg. ZFS ARC)

Note the "os:linux" it's possible that this value isn't being used in the FreeBSD calculation (i.e. it's always 0). Can you provide the following rrd file: /var/db/collectd/rrd/localhost/zfs_arc/cache_size-arc.rrd

#13 Updated by Rick Connor 9 months ago

Andrew Walker wrote:

FreeBSD got swapio metrics in netdata version 1.10
https://github.com/firehol/netdata/commit/3af09cbe54cd28b105cf2f19954434b11409bb86#diff-4f2c78f788edcb99de4d2922b386dce5

We updated to version 1.10 per redmine ticket here: https://redmine.ixsystems.com/issues/30864

Your systems appear to be swapping heavily:
[...]

So this is an expected notification in this situation. You can silence them by editing /etc/local/netdata/health.d/swap.conf and setting the 'to' line to: "silent" then restarting netdata.
Once you have verified that the behavior is what you want, you can make the same changes to /conf/base/etc/local/netdata/health.d/swap.conf.

This makes no sense. Had no issues with it before, but Ok. I'll just turn Netdata OFF. I really like the feature, but I shouldn't have to edit .conf files to get it to work or stop sending me notification every 5 mins. In my eyes I think something is wrong with Netdata so I can't depend on it anymore if I edit files to silent it since I never had to before.

I have no issues with CLI and with editing use it everyday, but other people are not, so they are going to end up breaking things. You might want to make a sticky on the forum about this and how to fix it.

#14 Updated by Andrew Walker 9 months ago

Rick Connor wrote:

Andrew Walker wrote:

FreeBSD got swapio metrics in netdata version 1.10
https://github.com/firehol/netdata/commit/3af09cbe54cd28b105cf2f19954434b11409bb86#diff-4f2c78f788edcb99de4d2922b386dce5

We updated to version 1.10 per redmine ticket here: https://redmine.ixsystems.com/issues/30864

Your systems appear to be swapping heavily:
[...]

So this is an expected notification in this situation. You can silence them by editing /etc/local/netdata/health.d/swap.conf and setting the 'to' line to: "silent" then restarting netdata.
Once you have verified that the behavior is what you want, you can make the same changes to /conf/base/etc/local/netdata/health.d/swap.conf.

This makes no sense. Had no issues with it before, but Ok. I'll just turn Netdata OFF. I really like the feature, but I shouldn't have to edit .conf files to get it to work or stop sending me notification every 5 mins. In my eyes I think something is wrong with Netdata so I can't depend on it anymore if I edit files to silent it since I never had to before.

I have no issues with CLI and with editing use it everyday, but other people are not, so they are going to end up breaking things. You might want to make a sticky on the forum about this and how to fix it.

Apologies for the terse response previously. I'm still investigating the issue. I believe this is new alerting behavior introduced in the new version of netdata which was put into 11.2. The swap alert is correct behavior, but I think we need to make netdata alerting configurable through the webui (unless I'm missing something on the netdata page). This would be a new feature and not immediately fix your problem. The CLI changes were a suggestion to get you through another day (although it may be worthwhile for you to investigate the high swap utilization on your server).

#15 Updated by Rick Connor 9 months ago

No Problem... I understand there is sometimes a difficult Language miscommunications between DEV's and End Users.

I'm not sure which way to go on this. I as an end user feel its a Netdata issues since I've been running Netdata since day 1 of it being added to FreeNAS without any issues. Now all the sudden I have issues with my FreeNAS box? or Memory issues? I searched the forums for "high swap utilization" not 1 mention of it on there so I'm lost at what to do.

See from the End User side we just see after the upgrade we get messages we never got before. So we brush it off as Netdata software issue because we can't see our FreeNAS box having this issue from day 1 of building it and never got an error or warning about High Swap Utilization before.

So I'm not sure if I have an Issue or not. No mention of high swap utilization in forums, but I get emails. See how it's confusing for an End User?

Thank You for all the work you put into FreeNAS I know it can be frustration to deal with us sometimes. lol.

#16 Updated by Andrew Walker 9 months ago

Okay. Fixed the RAM alert on my test system:

root@catherder:/usr/local/etc/netdata/health.d # diff -u ram.conf.orig ram.conf
--- ram.conf.orig    2018-07-19 16:30:05.469086542 -0400
+++ ram.conf    2018-07-19 16:27:21.613835425 -0400
@@ -3,7 +3,7 @@

    alarm: used_ram_to_ignore
       on: system.ram
-      os: linux
+      os: linux freebsd
    hosts: *
     calc: ($zfs.arc_size.arcsz = nan)?(0):($zfs.arc_size.arcsz)
    every: 10s

#17 Updated by Andrew Walker 9 months ago

#18 Updated by Andrew Walker 9 months ago

  • Status changed from Unscreened to In Progress

#19 Updated by Disk Didler 9 months ago

Hi Andrew,

Not worried about the terse response, you guys are programmers, not PR :)

My concern primarily is the posts on the forums, reddit, of people complaining. End of the day, people are whining a lot lately about FreeNAS which worries me. So the smoother it is for all the better.

I would imagine either being able to change the threshold or disable the alert would solve this issue.
I'm glad we have this monitoring available but if it's hassling users on perfectly stable systems, that's not ideal.

(I've had 0 crashes on my system in over 18 months, even on BETA1, the last crash I replaced the entire system, as I suspect it was PSU or Motherboard)

#20 Updated by Andrew Walker 9 months ago

Disk Didler wrote:

Hi Andrew,

Not worried about the terse response, you guys are programmers, not PR :)

My concern primarily is the posts on the forums, reddit, of people complaining. End of the day, people are whining a lot lately about FreeNAS which worries me. So the smoother it is for all the better.

I would imagine either being able to change the threshold or disable the alert would solve this issue.
I'm glad we have this monitoring available but if it's hassling users on perfectly stable systems, that's not ideal.

Try the above patch in (Response 16) to fix the erroneous alerts regarding RAM usage. There are instructions in (Response 10) above to silence alerts about swapping. I believe there are plans to make netdata configurable, but they're not going to be in 11.2. So unfortunately, you may have to manually edit the config files to silence the alerts about swapping.

#21 Updated by Rick Connor 9 months ago

Maybe this has something to do with Netdata. Read Post 14 I think he might be onto something. If there is an VM memory allocation problem (Which I have VM's) this could be what's triggering this alert from Netdata. I'm not a programmer, but this seems tied together somehow. Maybe there is nothing wrong with Netdata?

https://forums.freenas.org/index.php?threads/11-2-beta1-cant-get-vms-to-start.68491/

Can a DEV look into this? And if it is an issue a ticket created?

#22 Updated by Rick Connor 9 months ago

Maybe this has something to do with Netdata. Read Post #14 I think he might be onto something. If there is an VM memory allocation problem (Which I have VM's) this could be what's triggering this alert from Netdata. I'm not a programmer, but this seems tied together somehow. Maybe there is nothing wrong with Netdata?

https://forums.freenas.org/index.php?threads/11-2-beta1-cant-get-vms-to-start.68491/

Can a DEV look into this? And if it is an issue create a ticket?

#24 Updated by John Hixson 9 months ago

  • Status changed from In Progress to Ready for Testing

Andrew Walker wrote:

PR against Master: https://github.com/freenas/ports/pull/124

merged.

#25 Updated by Dru Lavigne 9 months ago

  • File deleted (debug.tgz)

#26 Updated by Dru Lavigne 9 months ago

  • File deleted (debug.tgz)

#27 Updated by Dru Lavigne 9 months ago

  • Subject changed from FreeNAS 11.2 BETA1, many netdata alerts to Do not count ARC against available RAM
  • Target version changed from 11.2-BETA3 to 11.2-BETA2
  • Needs Doc changed from Yes to No
  • Needs Merging changed from Yes to No

#28 Updated by Disk Didler 9 months ago

Perhaps I should've made this clear in the initial post, but this is more than just a memory issue.

Should I log a new job?

In the past 3 days, I have 30 alerts.
Such as:

10min disk utilization - disk_util.ada4
ipv4 tcphandshake last collected secs - ipv4.tcphandshake
ipv4 udperrors last collected secs - ipv4.udperrors
30min ram swapped out - system.swapio
10min cpu usage - system.cpu

To be clear, I don't mind, if they're valid I guess, obviously.
I just want to be sure. The system itself seems very robust (The base OS?) because she continues to recover from these apparent errors and as I stated previously, overall my stability with FreeNAS the past 18 months has been exceptional.

#29 Updated by Disk Didler 9 months ago

A new one today.

"10s received packets storm - net_packets.bge0"
Haven't seen that one before.

#30 Updated by Dru Lavigne 9 months ago

Please create a separate ticket for the remaining messages.

#32 Updated by Bonnie Follweiler 9 months ago

  • Status changed from Ready for Testing to Passed Testing
  • Needs QA changed from Yes to No

Test Passed in 11.2-MASTER-201807300838

#33 Updated by Dru Lavigne 9 months ago

  • Status changed from Passed Testing to Done

#34 Updated by Dru Lavigne 6 months ago

  • Related to Bug #52195: Emails regarding the amount of memory swapped in the last 30 minutes, as a percentage of the system RAM added

#35 Updated by Dru Lavigne 4 months ago

  • Related to Bug #64716: Suppress default Netdata RAM usage warning added

Also available in: Atom PDF