Project

General

Profile

Feature #23785

FreeNAS sending SNMP traps/notifications

Added by Ricardo Larranaga almost 3 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
No priority
Assignee:
Vladimir Vinogradenko
Category:
Middleware
Target version:
Estimated time:
(Total: 0.00 h)
Severity:
Low
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

Description

It would be great if FreeNAS could send SNMP traps/notifications the same way it sends Emails. i would be willing to help Starting work on this.


Subtasks

Feature #26679: Add System -> Alerts for configuring alert frequencyDoneVladimir Vinogradenko
Feature #26693: Choice of alert systems not ideal for many users - add GSM and other options.ClosedVladimir Vinogradenko

History

#1 Updated by William Grzybowski almost 3 years ago

  • Status changed from Unscreened to Screened
  • Target version set to 11.2-BETA1

Interesting.

Do you have any how to make that work?
What do you need to know to start working on this? ;)

#2 Avatar?id=14398&size=24x24 Updated by Kris Moore almost 3 years ago

+1. Let me know if we can assist in any way.

#3 Updated by Ricardo Larranaga almost 3 years ago

Yeah, sure.

I downloaded a copy of freenas to spin on a vm. I see that alerts right now are handled with emails. Would you guys be able to point to where is that done in the code?

Txs.

#4 Updated by William Grzybowski almost 3 years ago

Ricardo Larranaga wrote:

Yeah, sure.

I downloaded a copy of freenas to spin on a vm. I see that alerts right now are handled with emails. Would you guys be able to point to where is that done in the code?

Txs.

Right now it is at https://github.com/freenas/freenas/blob/master/gui/system/alert.py#L337

We have been meaning to refactor that and put it in the middleware, but this is where its currently at.

#5 Updated by Ricardo Larranaga almost 3 years ago

Sounds good, ill take a look at it.
Regards

#6 Updated by Ricardo Larranaga almost 3 years ago

Hi guys: I have been looking into this, and i wanted to share with you the overall idea, and ask a couple of questions if possible.

First part to make this work would be simply to add a couple of new entries on the sqlite database under the snmp section. Entries would be:
traps_enabled
trap_destinations (All the hosts that would receive the traps)
trap_type (v1 or 2)

With the corresponding entries in the snmp configuration GUI. I believe the best way to send traps would be to import pysnmp into the project, that way the middleware itself can send a trap whenever its needed.

Example of sending a trap with pysnmp:

http://pysnmp.sourceforge.net/examples/current/v3arch/oneliner/agent/ntforg/trap-v2c-with-mib-lookup.html

Now, we would define the traps in FREENAS-MIB. (Any reason the enterprise name used for freenas in snmp is not associated with either ixsystems or freenas? Just curious).

To define the traps, we would have to find out what to alarm on. For traps that would come from freenas alerting system that would be a matter of browsing the code and choosing which alerts do we want to also send a trap notification (things like failover come to my mind). i am already looking, and ill ask for output a little bit later.

There is a second place i would love to be able to send traps from, and that is the zfs events. freebsd comes loaded with a configuration that reacts on zfs events through devd.ZFS defines events like "device removed" and "checksum error" that would be perfect for this. On an event, you either call the middleware to send the trap (probably for the best, that way you have a centralized entity that manages your notifications) or call a script to send a trap.
While i would initially piggyback into all events from zfs, on second thought, things like checksum errors and read/write errors, would be better to alert by parsing the zpool status output periodically, in order not to flow an snmp manager with traps when there is bursts of errors. But devices being removed and things like that, i would prefer to get them straight from the events.

The problem with this last approach is that i have been testing a little bit, and i dont get logs for all the events. I am still working on it, but meant to ask, do you guys know anyone that would be knowledgeable on that part of zfs? that would save time. I have tried the zfs and freebsd irc channels, and ill try the lists a little bit later.
I am also aware that freebsd 11 comes with a new daemon (zfsd) that its supposed to react on zfs events. Do you know anything about this daemon, and why it was developed, as it looks like it overlaps with devd functionality?

Let me know what you think.
Cheers

#7 Updated by Ricardo Larranaga almost 3 years ago

Also, from the last description of zfs events (not conclusive that part of zfs is not very well documented), these are the ones i find interesting (initially) for the system to send traps (comments/recommendations welcomed):

config.sync - Issued every time a vdev change have been done to the pool.
zpool - Issued when a pool cannot be imported.
zpool.destroy - Issued when a pool is destroyed.
vdev.unknown - Issued when the vdev is unknown. Such as trying to clear device errors on a vdev that have failed/been kicked from the system/pool and is no longer available.
vdev.corrupt_data - Issued when corrupt data have been detected on a vdev.
vdev.no_replicas - Issued when there are no more replicas to sustain the pool. This would lead to the pool being DEGRADED.
vdev.bad_guid_sum - Issued when a missing device in the pool have been detected.
vdev.remove - Issued when a vdev is detached from a mirror (or a spare detached from a vdev where it have been used to replace a failed drive - only works if the original drive have been readded).
vdev.clear - Issued when clearing device errors in a pool. Such as running zpool clear on a device in the pool.
vdev.spare - Issued when a spare have kicked in to replace a failed device.
log_replay - Issued when the intent log cannot be replayed. The can occur in the case of a missing or damaged log device.
resilver.start - Issued when a resilver is started.
resilver.finish - Issued when the running resilver have finished.
scrub.start - Issued when a scrub is started on a pool.
scrub.finish - Issued when a pool have finished scrubbing.

Depending on the amount of events generated for these, we could also do:

io - Issued when there is an I/O error in a vdev in the pool.
data - Issued when there have been data errors in the pool.
delay - Issued when an I/O was slow to complete as defined by the zio_delay_max module option.
io_failure - Issued when there is an I/O failure in a vdev in the pool.
checksum - Issued when a checksum error have been detected.

#8 Updated by William Grzybowski almost 3 years ago

Hi,

This all looks good.

Now, we would define the traps in FREENAS-MIB. (Any reason the enterprise name used for freenas in snmp is not associated with either ixsystems or freenas? Just curious).

No reason I can remember, it was probably just overlooked at the time it was created.
I am also aware that freebsd 11 comes with a new daemon (zfsd) that its supposed to react on zfs events. Do you know anything about this daemon, and why it was developed, as it looks like it overlaps with devd functionality?

zfsd does not overlap with devd, its mostly an userland process because of its complexity and it will act for a number of sittuations like kicking hotspares, detaching disks and resilver completes, etc.
The problem with this last approach is that i have been testing a little bit, and i dont get logs for all the events.

Which events and how are you checking for them? It could be its just not configured in devd.

---
As far as implementation detail I think we could use a middlewared plugin listening to these events via unix socket.

#9 Updated by Ricardo Larranaga almost 3 years ago

"Which events and how are you checking for them? It could be its just not configured in devd."

I looked to test for events that were defined. The first one i tried to test is the checksum event, and the way i did it is using the corruption test from the freebsd handbook.
So, spinned up a freebsd 11 machine, created a zfs mirror pool, exported the pool, dd'd random data into one of the pools, then re imported the pool. Since zpool status show checksum errors when you re import the pool, i was expecting to see events of the sort logged by devd too. but i got no checksum events. I found a more generic way to match ZFS events in devd:

notify 10 {
match "system" "ZFS";
action "logger -p kern.err 'ZFS notice: type=$type version=$version class=$class pool_guid=$pool_guid vdev_guid=$vdev_guid'";
action "echo 'ZFS notice: type=$type version=$version class=$class pool_guid=$pool_guid vdev_guid=$vdev_guid' | mail -s 'ZFS Event' zfs";
};

so i am going to test that instead of the zfs.conf configuration. Maybe some event names changed names and that is why it did not match.

zfsd does not overlap with devd, its mostly an userland process because of its complexity and it will act for a number of sittuations like kicking hotspares, detaching disks and resilver completes, etc.

Yeah, i didn't explain myself enough. I didn't mean they overlapped, but it looks to me that for this specific application they do have overlapping behaviours, as they both react to zfs events. i saw in zfsd configuration that most actions are configured to things as you mentioned. Kick a resilver, etc. But that could be easily accomplished with devd too right?

Regards

#10 Updated by William Grzybowski almost 3 years ago

You're right about both reacting to ZFS events but to me its simply a consumer specialized for a few cases. I am not sure how it is implemented but I think it just uses the devdctl interface.

#11 Updated by William Grzybowski over 2 years ago

  • Status changed from Screened to Unscreened
  • Assignee changed from William Grzybowski to Vladimir Vinogradenko

Vladimir, is this something you can take a look at?

#12 Updated by Vladimir Vinogradenko over 2 years ago

  • Status changed from Unscreened to 15
I've thought of:
  • Decoupling issuing alerts from sending alerts
  • Sending alerts through a bunch of AlertService's (currently: E-Mail, SNMP traps; I've initially proposed to have Twillo/Slack/etc in the future, but now I see that we already have consul that handles that)
  • Having some configurable «matrix» like
    +-------+------+--------+
    |   _   | SNMP | E-Mail |
    +-------+------+--------+
    | io    | all  |  all   |
    | nfs   | all  |  warn  |
    | smb   | all  |  warn  |
    | zpool | all  |  crit  |
    +-------+------+--------+
    

This can also be combined with moving alertd to middleware.

#13 Updated by Vladimir Vinogradenko over 2 years ago

  • Assignee changed from Vladimir Vinogradenko to William Grzybowski

#14 Updated by William Grzybowski over 2 years ago

  • Status changed from 15 to Screened
  • Assignee changed from William Grzybowski to Vladimir Vinogradenko

Yes, I like that! Its past time we do some complete refactoring of the alert system, there is a lot of room for improvement. We already have (System - Alert Service), it needs to be made more modular and integrated to alert system.

#15 Updated by Dru Lavigne about 2 years ago

  • Status changed from Screened to Not Started
  • Target version changed from 11.2-BETA1 to 11.2-RC2

#16 Updated by Vladimir Vinogradenko about 2 years ago

  • Status changed from Not Started to Broken
  • Reason for Blocked set to Waiting for feedback

This is partially done. SNMP traps are now sent when new alert is created or gone.

William, what do you think about sending devd events (I suggest all of them) as SNMP traps when appropriate checkbox in SNMP config is enabled? What other events should be sent?

#17 Updated by Vladimir Vinogradenko about 2 years ago

  • Status changed from Broken to Blocked

#18 Updated by William Grzybowski about 2 years ago

  • Status changed from Blocked to Not Started
  • Target version changed from 11.2-RC2 to 11.3
  • Reason for Blocked deleted (Waiting for feedback)

The zfs events part is gonna have to wait at least until 11.3, we have more important bugs to take care for noe, unfortunately.

#19 Avatar?id=13649&size=24x24 Updated by Ben Gadd almost 2 years ago

  • Target version changed from 11.3 to Backlog

#20 Updated by Vladimir Vinogradenko almost 2 years ago

  • Severity set to Low

#22 Avatar?id=14398&size=24x24 Updated by Kris Moore about 1 year ago

  • Status changed from Not Started to Closed

Also available in: Atom PDF