Strange ARC size info
Is it normal to have report of ARC Size like this (see attachement) ?
It started to report like this after update 9.10.1 U4 -> 9.10.2
#8 Updated by Jan Brońka over 4 years ago
It means - I have in production now 9.10.1 U4 (4 ESXI hosts working on FN via SCSI, 20x LUN 600GB each) and tried to check if 9.10.2 provides interesting answers for my needs. So I started to make test installation for few days to check new version. I observated some strange behaviours like this one described in this thread but also significantly bigger CPU overload and mainly in USER space (UI consume it), there were problems with update (UI update failed I had to use command line), after reboot boot hanged two times (very early) and then (wihtout any change from my side next reboot pass).
Let me conclude this thread - I have L2ARC SSD driver 200GB and from time I connect it I always had reported in L2ARC/ARC view line max 200GB which for me is logical... with 9.10.2 I saw quickly 800GB and after next few hours 1,5TB and increasing slowely.
Nice to know by SSD become bigger and bigger :) but do not want invoice for this later :)))
#9 Updated by Suraj Ravichandran over 4 years ago
- Status changed from 15 to Unscreened
- Assignee changed from Suraj Ravichandran to Marcelo Araujo
- Priority changed from No priority to Important
- Target version set to 9.10.3
@Marcello do you mind taking a look at this (I do not have much experience with ARC and L2ARC).
If not then you can hand it back to me.
Also lemme know if its just a collectd issue, I can help you with that.
#12 Updated by Lutz Reinegger over 4 years ago
Lutz Reinegger wrote:
I can confirm: I see the same issue on my FreeNAS 9.10.2-U1 box. As far as I can tell the issue did not exist in 9.10.1-U4.
The reporting functions around memory ARC // L2ARC seem to report ARC and L2ARC sizes which actually exceed the existing physical RAM and/or SSD disks for L2ARC in my box after some amount of uptime.
I observed this behaviour as well in the admin web interface of FreeNAS as in command line tools like "top" and "arc_summary.py".
A reboot does not seem to change this outright strange behaviour.
As requested before I will upload my debug data, for the time being I will not fall back to 9.10.1-U4 but keep my box running on 9.10.2-U1, so I can reproduce the situation. (although I am a bit nervous).
#14 Updated by Lutz Reinegger over 4 years ago
- File arc.png arc.png added
- File debug-freenas-20170127211307.tar added
- File memory.png memory.png added
- File top.png top.png added
- File zpool-iostat.png zpool-iostat.png added
@Marcelo: Great. Thanks for the update.
Meanwhile I was able to reproduce the error following the steps listed here:
1.) Perform a reboot, the debug file I attached earlier (yesterday) was created just after reboot
2.) Perform a read intensive I/O operation. I my case I used s3cmd to sync a directory tree containing tens of thousands of files to amazon S3 service
3.) At this point the behaviour of the machine turns weird, here are the clues I could collect:
- ARC size reported as 16 GB on a machine with 16 GB of RAM in FreeNAS web interface, whereas memory reported with 4 GB of "free" memory, this does not add up (see screenshot arc.png and memory.png)
- L2ARC size reported as 129.5 GB on a 128 GB cache device (SSD) (see attached screenshot arc.png)
- Using the command "top" I see 16 GB of ARC allocated and 4 GB free, again on a 16 GB RAM machine (see top.png)
- Using the command "zpool iostat v 1" I see the size of the cache device as "16.0 E", 16 Exabyte? (see attached file pool-iostat.png)
- Using the command "arc_summary.py" I see the L2ARC as "DEGRADED", general stats do not add up, as described above (see arc_summary.txt)
- Something seems to be simply wrong over here concerning the ARC and L2ARC stats
- General behaviour of the machine is strange (io patterns, performance, process running)
I created a new debug file as well, see attachment.
Meanwhile I seriously consider to back out my system back to 9.10.1-U4.
#16 Updated by Richard Kojedzinszky over 4 years ago
- File arc-hit.png arc-hit.png added
- File arc-reqs.png arc-reqs.png added
- File arc-size.png arc-size.png added
- Priority changed from Important to Critical
It seems to be a critical bug, as with the accounting mismatch the ARC size increases, the adaptive algorithm tries to shrink the cache to lower and lower, and ARC hit ratio has dropped to nearly 0.
#21 Updated by Richard Kojedzinszky over 4 years ago
Unfortunately I must say that this is more than a reporting issue, as it confuses the arc sizing adaptive algorithm as well, ending in a reduced poor performance. A short workaround is to remove L2 cache devices and reboot the box having this issue.
#23 Updated by Kris Moore over 4 years ago
- Status changed from Closed: Third party to resolve to Unscreened
- Assignee changed from Marcelo Araujo to Alexander Motin
- Seen in changed from Unspecified to 9.10.2-U1
Alexander, we have several of these tickets now about weird Arc usage / reporting. Looks like you brought in the patch. Something to investigate here?
#25 Updated by Richard Kojedzinszky over 4 years ago
The bug is easy to reproduce:
1. need a pool with l2 cache
2. need data on which does not fit in memory
3. read that data endlessly
I've actually read it in a random way, and the first noticeable symptom was that overall ARC size shown in 'top' was nearly constant, while the MFU and MRU sizes began to shrink. With the mentioned commit reverted they sum up to the total ARC size, and in time this does not change.
#26 Updated by Chris Torek over 4 years ago
Possibly related (though I haven't even looked through the bug report): https://www.listbox.com/member/archive/274414/2017/02/sort/time_rev/page/1/entry/0:80/20170212150158:19DB0668-F15E-11E6-947A-E981910FB0E3/
#31 Updated by Richard Kojedzinszky over 4 years ago
The proposed patch does not fix all, the l2 accounting mismatch can be reproduced very easily:
I did it in a vm with 2G ram, a pool with 512M l2 cache. And then, just create a 2G file in the pool. As it was a vm, the l2 cache was relatively fast, and in 10-20 seconds it reached it 16.0E size.
With the illumos commit 6950 reverted this symptom disappeared, seems to be working fine.
#33 Updated by Josh Wisman over 4 years ago
- Hardware Configuration updated (diff)
Is there any proposed work around besides removing the L2Arc or rebooting when it fills? My system is pretty severely impacted with out L2Arc.
Is there any data or testing I can provide to add value or help with the resolution? I am not a developer, but pretty adept otherwise.
#36 Updated by Remon V. over 4 years ago
Richard Kojedzinszky wrote:
Is there any work in progress regarding this issue?
I'm also waiting for this update. It's been a month and this is one critical bug in freenas. Still no word about any progress. To bad my knowledge comes short digging in to this. So I hope someone will find a solution for this.
#38 Updated by Kris Moore over 4 years ago
- Status changed from Fix In Progress to Resolved
Looks like this made it in:
#44 Updated by Jan Brońka over 4 years ago
Are you sure target version is 11 ?
Well I think all persons having 9.10.2 U1 or lower (specially in production) will rather not update to U2, U3 and upper.
So, IMHO 9.10.2 will be rather dead branch... there is no upper version (over 9.10.2 stabil enough to put in production) and rather will not be soon.
#45 Updated by Alexander Motin over 4 years ago
FreeNAS 11.0 is just a new name for 9.10.3, exposing the fact of switch to FreeBSD 11 OS. While technically FreeBSD 10, used by FreeNAS 9.10.x, is technically still supported, it receives less and less attention from developers since no new releases planned there, and so there is no significant reasons to stick with it.
#48 Updated by Lutz Reinegger about 4 years ago
- % Done changed from 0 to 100
I would like to confirm: This issue has been resolved as of FreeNAS 11.0-RC.
After I upgraded my box to 11.0-RC I am no longer able to reproduce this error.
A big "THANK YOU!!!" to all the people involved in fixing this issue.
:-) :-) :-)