Project

General

Profile

Bug #19953

Strange ARC size info

Added by Jan Brońka over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Critical
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

Is it normal to have report of ARC Size like this (see attachement) ?
It started to report like this after update 9.10.1 U4 -> 9.10.2

snip_20170102111832.png (20.1 KB) snip_20170102111832.png Jan Brońka, 01/02/2017 02:22 AM
arc.png (10.1 KB) arc.png Lutz Reinegger, 01/27/2017 01:32 PM
memory.png (14.8 KB) memory.png Lutz Reinegger, 01/27/2017 01:32 PM
top.png (211 KB) top.png Lutz Reinegger, 01/27/2017 01:32 PM
zpool-iostat.png (25 KB) zpool-iostat.png Lutz Reinegger, 01/27/2017 01:32 PM
arc-hit.png (14.8 KB) arc-hit.png Richard Kojedzinszky, 02/06/2017 03:19 PM
arc-reqs.png (18.1 KB) arc-reqs.png Richard Kojedzinszky, 02/06/2017 03:19 PM
arc-size.png (11.6 KB) arc-size.png Richard Kojedzinszky, 02/06/2017 03:19 PM
8293
8732
8733
8734
8735
8885
8886
8887

Related issues

Related to FreeNAS - Bug #21098: l2arc growing and causing ctl_datamove errorsClosed: Duplicate2017-02-11
Has duplicate FreeNAS - Bug #21319: Strange statisticsClosed: Duplicate2017-02-20

History

#1 Updated by Jan Brońka over 4 years ago

8293

#2 Updated by Bonnie Follweiler over 4 years ago

  • Assignee set to Kris Moore

#3 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

  • Status changed from Unscreened to Closed: Behaves correctly

Graph looks right. Unused l2arc is wasted l2arc, and it appears to be using it in this case.

#4 Updated by Jan Brońka over 4 years ago

800GB ? after few days it was 1,5TB... my L2ARC SSD disk is 200GB.
FN9.10.1 U4 and bellow reports max 200GB as long as I'm using FN (few years).

#5 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

  • Status changed from Closed: Behaves correctly to Unscreened
  • Assignee changed from Kris Moore to Suraj Ravichandran

Ahh, we didn't have that detail before...

Can you provide a full debug? System -> Advanced -> Save Debug

#6 Updated by Jan Brońka over 4 years ago

Cannot as I back my system to FN9.10.1 U4. FN9.10.2 does not pass validation.

#7 Updated by Suraj Ravichandran over 4 years ago

  • Status changed from Unscreened to 15

@Jan what do you mean by this statement "Cannot as I back my system to FN9.10.1 U4. FN9.10.2 does not pass validation."

#8 Updated by Jan Brońka over 4 years ago

It means - I have in production now 9.10.1 U4 (4 ESXI hosts working on FN via SCSI, 20x LUN 600GB each) and tried to check if 9.10.2 provides interesting answers for my needs. So I started to make test installation for few days to check new version. I observated some strange behaviours like this one described in this thread but also significantly bigger CPU overload and mainly in USER space (UI consume it), there were problems with update (UI update failed I had to use command line), after reboot boot hanged two times (very early) and then (wihtout any change from my side next reboot pass).

Let me conclude this thread - I have L2ARC SSD driver 200GB and from time I connect it I always had reported in L2ARC/ARC view line max 200GB which for me is logical... with 9.10.2 I saw quickly 800GB and after next few hours 1,5TB and increasing slowely.
Nice to know by SSD become bigger and bigger :) but do not want invoice for this later :)))

#9 Updated by Suraj Ravichandran over 4 years ago

  • Status changed from 15 to Unscreened
  • Assignee changed from Suraj Ravichandran to Marcelo Araujo
  • Priority changed from No priority to Important
  • Target version set to 9.10.3

@Marcello do you mind taking a look at this (I do not have much experience with ARC and L2ARC).

If not then you can hand it back to me.

Also lemme know if its just a collectd issue, I can help you with that.

Thanks

#10 Updated by Marcelo Araujo over 4 years ago

  • Status changed from Unscreened to Screened

#11 Updated by Lutz Reinegger over 4 years ago

  • File debug-freenas-20170126232403.tar added

#12 Updated by Lutz Reinegger over 4 years ago

Lutz Reinegger wrote:

I can confirm: I see the same issue on my FreeNAS 9.10.2-U1 box. As far as I can tell the issue did not exist in 9.10.1-U4.
The reporting functions around memory ARC // L2ARC seem to report ARC and L2ARC sizes which actually exceed the existing physical RAM and/or SSD disks for L2ARC in my box after some amount of uptime.
I observed this behaviour as well in the admin web interface of FreeNAS as in command line tools like "top" and "arc_summary.py".
A reboot does not seem to change this outright strange behaviour.
As requested before I will upload my debug data, for the time being I will not fall back to 9.10.1-U4 but keep my box running on 9.10.2-U1, so I can reproduce the situation. (although I am a bit nervous).

#13 Updated by Marcelo Araujo over 4 years ago

I'm aware of this PR, I will check it next week!
Thanks for all reports.

Best,

#14 Updated by Lutz Reinegger over 4 years ago

8732
8733
8734
8735

@Marcelo: Great. Thanks for the update.

Meanwhile I was able to reproduce the error following the steps listed here:
1.) Perform a reboot, the debug file I attached earlier (yesterday) was created just after reboot
2.) Perform a read intensive I/O operation. I my case I used s3cmd to sync a directory tree containing tens of thousands of files to amazon S3 service
3.) At this point the behaviour of the machine turns weird, here are the clues I could collect:
- ARC size reported as 16 GB on a machine with 16 GB of RAM in FreeNAS web interface, whereas memory reported with 4 GB of "free" memory, this does not add up (see screenshot arc.png and memory.png)
- L2ARC size reported as 129.5 GB on a 128 GB cache device (SSD) (see attached screenshot arc.png)
- Using the command "top" I see 16 GB of ARC allocated and 4 GB free, again on a 16 GB RAM machine (see top.png)
- Using the command "zpool iostat v 1" I see the size of the cache device as "16.0 E", 16 Exabyte? (see attached file pool-iostat.png)
- Using the command "arc_summary.py" I see the L2ARC as "DEGRADED", general stats do not add up, as described above (see arc_summary.txt)
- Something seems to be simply wrong over here concerning the ARC and L2ARC stats
- General behaviour of the machine is strange (io patterns, performance, process running)

I created a new debug file as well, see attachment.

Meanwhile I seriously consider to back out my system back to 9.10.1-U4.

#15 Updated by Lutz Reinegger over 4 years ago

  • File arc_summary.txt added

#16 Updated by Richard Kojedzinszky over 4 years ago

8885
8886
8887

It seems to be a critical bug, as with the accounting mismatch the ARC size increases, the adaptive algorithm tries to shrink the cache to lower and lower, and ARC hit ratio has dropped to nearly 0.

#17 Updated by Richard Kojedzinszky over 4 years ago

It seems to only happen if an L2 cache is involved.

#18 Updated by Richard Kojedzinszky over 4 years ago

Meanwhile, to be able to disable compressed arc

https://github.com/freenas/os/pull/21

should be applied.

#19 Updated by Richard Kojedzinszky over 4 years ago

  • Category changed from 55 to 200

Unfortunately it seems to have no effect regarding the accounting when L2 cache devices are used.

#20 Updated by Marcelo Araujo over 4 years ago

  • Status changed from Screened to Closed: Third party to resolve

FreeBSD is incorrectly reporting some stats from compressed ARC. Will need to be fixed upstream.

#21 Updated by Richard Kojedzinszky over 4 years ago

Unfortunately I must say that this is more than a reporting issue, as it confuses the arc sizing adaptive algorithm as well, ending in a reduced poor performance. A short workaround is to remove L2 cache devices and reboot the box having this issue.

#22 Updated by Richard Kojedzinszky over 4 years ago

We have reverted the '6950 ARC should cache compressed data' commit on top of freebsd10, and since then the cache usage is as it was before. L2 caches dont overflow, and in-memory ARC accounting is sane as well.

#23 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

  • Status changed from Closed: Third party to resolve to Unscreened
  • Assignee changed from Marcelo Araujo to Alexander Motin
  • Seen in changed from Unspecified to 9.10.2-U1

Alexander, we have several of these tickets now about weird Arc usage / reporting. Looks like you brought in the patch. Something to investigate here?

#24 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

  • Related to Bug #21098: l2arc growing and causing ctl_datamove errors added

#25 Updated by Richard Kojedzinszky over 4 years ago

The bug is easy to reproduce:

1. need a pool with l2 cache
2. need data on which does not fit in memory
3. read that data endlessly

I've actually read it in a random way, and the first noticeable symptom was that overall ARC size shown in 'top' was nearly constant, while the MFU and MRU sizes began to shrink. With the mentioned commit reverted they sum up to the total ARC size, and in time this does not change.

#26 Updated by Chris Torek over 4 years ago

#27 Updated by Richard Kojedzinszky over 4 years ago

I've applied that patch, and since then on the test lab everything seems to be normal: the l2 cache has sane size, and in-memory ARC statistics sum up well, no accounting mismatch so far. Someone else to confirm?

#28 Updated by Richard Kojedzinszky over 4 years ago

Is there any updates regarding the issue? Is it safe to use Andriy's patch?

#29 Updated by Alexander Motin over 4 years ago

  • Status changed from Unscreened to Fix In Progress

The patch is now on review in OpenZFS: https://github.com/openzfs/openzfs/pull/300
I'll merge it as soon as it will be accepted.

#30 Updated by Richard Kojedzinszky over 4 years ago

I applied the proposed patch, and on our test lab the L1 cache seems to be working fine, the L2 device statistics just shows 16.0E free space again. I dont know if it is related or not.

#31 Updated by Richard Kojedzinszky over 4 years ago

The proposed patch does not fix all, the l2 accounting mismatch can be reproduced very easily:

I did it in a vm with 2G ram, a pool with 512M l2 cache. And then, just create a 2G file in the pool. As it was a vm, the l2 cache was relatively fast, and in 10-20 seconds it reached it 16.0E size.

With the illumos commit 6950 reverted this symptom disappeared, seems to be working fine.

#32 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

  • Has duplicate Bug #21319: Strange statistics added

#33 Updated by Josh Wisman over 4 years ago

  • Hardware Configuration updated (diff)

Is there any proposed work around besides removing the L2Arc or rebooting when it fills? My system is pretty severely impacted with out L2Arc.

Is there any data or testing I can provide to add value or help with the resolution? I am not a developer, but pretty adept otherwise.

#34 Updated by Josh Wisman over 4 years ago

  • Hardware Configuration updated (diff)

#35 Updated by Richard Kojedzinszky over 4 years ago

Is there any work in progress regarding this issue?

#36 Updated by Remon V. over 4 years ago

Richard Kojedzinszky wrote:

Is there any work in progress regarding this issue?

I'm also waiting for this update. It's been a month and this is one critical bug in freenas. Still no word about any progress. To bad my knowledge comes short digging in to this. So I hope someone will find a solution for this.

#37 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

Mav: has this gone into head/stable yet?

#38 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

  • Status changed from Fix In Progress to Resolved

#39 Updated by Richard Kojedzinszky over 4 years ago

I've written in #31 that this simple patch does not fix all. I will test it later today. Will that be enough if I simply apply it on freebsd10 branch?

#40 Updated by Alexander Motin over 4 years ago

Richard, it would be better if you could test recent nightly builds, since they are based on recent FreeBSD 11, which is the target for the next release.

#41 Updated by Richard Kojedzinszky over 4 years ago

It seems to be working fine.

Also, on freebsd10 applying
MFC r314274
MFC r314913
together fixes the problem, at least testing for an hour does not show the issue.

Regards,

#42 Updated by Alexander Motin over 4 years ago

Thank you for the feedback.

#43 Avatar?id=14398&size=24x24 Updated by Kris Moore over 4 years ago

  • Target version changed from 9.10.3 to 11.0

#44 Updated by Jan Brońka over 4 years ago

Are you sure target version is 11 ?
Well I think all persons having 9.10.2 U1 or lower (specially in production) will rather not update to U2, U3 and upper.
So, IMHO 9.10.2 will be rather dead branch... there is no upper version (over 9.10.2 stabil enough to put in production) and rather will not be soon.

#45 Updated by Alexander Motin over 4 years ago

FreeNAS 11.0 is just a new name for 9.10.3, exposing the fact of switch to FreeBSD 11 OS. While technically FreeBSD 10, used by FreeNAS 9.10.x, is technically still supported, it receives less and less attention from developers since no new releases planned there, and so there is no significant reasons to stick with it.

#47 Updated by Vaibhav Chauhan about 4 years ago

  • Target version changed from 11.0 to 11.0-RC

#48 Updated by Lutz Reinegger about 4 years ago

  • % Done changed from 0 to 100

I would like to confirm: This issue has been resolved as of FreeNAS 11.0-RC.
After I upgraded my box to 11.0-RC I am no longer able to reproduce this error.

A big "THANK YOU!!!" to all the people involved in fixing this issue.

:-) :-) :-)

#49 Updated by Dru Lavigne over 3 years ago

  • File deleted (debug-freenas-20170126232403.tar)

#50 Updated by Dru Lavigne over 3 years ago

  • File deleted (debug-freenas-20170127211307.tar)

#51 Updated by Dru Lavigne over 3 years ago

  • File deleted (arc_summary.txt)

Also available in: Atom PDF