Table of contents
The Event Management module owes its good functioning to the correlation engine integrated into the module. The following article explains the various configurations to apply to better exploit the offered features.
The role of the correlation engine is to transform received email messages into events, to associate the right CI in cause, to define its categorization, to automatically create an incident if the conditions are met and to try to determine from a list of affected CI, the probable cause of failure.
- At least one system capable of delivering mail notifications via SMTP
- At least one mailbox destined to receive notifications from the surveillance system, accessible via POP3
- At least one system account capable of executing the email import task via via MailIntegration
- The Configurations module connected to the surveillance system in an analog way
- At least one assignee or a group assigned to events processing
Items to configure
Service range of a CI
Destined to determine the periods while the CI must be functional. The correlation engine will create an incident if an outage is signaled during the service range. The correlation engine will continue to monitor the outage until the CI goes into its service range (causing the incident creation) or until the surveillance system indicates that it went back to normal.
The service range are defined in the Tools > Reference Data Management… > CI > Time Range menu.
- They require at least one French and English description
- You can indicate days and hours to define the range
The option "Include holidays" allows to consider or ignore holidays defined in Tools > Options
- If needed, you can overlap the daily service range to the next day. For example, Monday from 6h00AM to Tuesday 1h00AM
- Once the service range is configured, each monitored CI will need a service range.
Identification of a person in charge of the CI maintenance
Allows to identify the group or the assignee responsible for the maintenance of the CI type.
The configuration of the person in charge of the maintenance is done through Tools > Reference Data Management… > CI > your CI type > Maintenance tab. The correlation engine will assign generated events and incidents to the group responsible for the CI. Each CI will inherit configurations for his type, but it could also be defined directly in its file.
Relationships between different CI types
When multiple events are generated, the correlation engine will try to determine the source cause by looking at existing relationships between CI.
Before establishing a relationship between two CI, you must :
- Define the relationship types via the menu Tools > Reference Data Management > CI > Relationship types
- Configure the accepted relationships between two given CI types via the menu Tools > Reference Data Management… > CI > your CI type > Relationships tab
For the correlation engine to follow a relation, it must be identified as "Required to operate". To reverse the relationship, click on the red and green arrows icon.
Preparing CI for Event Management
The configured equipments and CI in your surveillance system must have identical names to be recognized by Octopus. The correlation engine must be able to follow an established relationship in Octopus to determine the source of a failure. Relationships must be established the same direction than the one your surveillance system can observe.
3 types of event exist :
Being used to identify events containing factual data not requiring any immediate or long term action. Events of this type are automatically marked as Processed.
Being used to identify events containing status warnings, without being exceptions, that could require some evaluation or monitoring by an assignee without becoming an incident. These events must be treated manually.
Being used to identify events requiring an immediate processing (in form of an incident) from the person/group responsible of the CI maintenance. At the incident creation, the system automatically links the event to the incident, and any other event that could apply. The correlation engine will do a surveillance of the outages return.
The categorization is used to regroup events of the same type to facilitate the sorting of the work needed to be done, or data to analyze in a report. Once the categories created through the menu Tools > Reference Data Management... > Event > Categories, they will become available for selection from a drop-down list at the creation or modification of events.
The email address used to receive notifications from the surveillance system will be indicated in the Source field.
Rules of Event Processing
- The Event Management module deals with events based on configurable rules
- A rule regroups general information, criteria to meet, automatic detection of CI in cause and the recovery detection in case of an exception
- Recovery detection is only for exceptions
- Rules are evaluated based on their rank (1 first)
- First one to meet the criteria will be used
- Creation of processing rules executes via the menu Tools > Reference Data Management… > Events > Rules
What you need to know:
- From the moment an event is created and associated to a CI, this CI is automatically flagged as unresponsive.
- If the same processing rule generates a new email for this CI, no other event will be created, no incident and no activity will be added to the event and the incident that had been previously created.
- When the CI is once again functional, the event management for this CI will resume its normal course.
Processing rule parameter
A processing rule requires at least a name, a type and a criterion.
If multiple rules of the same rank have overlapping criteria, an event without categorization will be created.
- The Email section allows defining criteria based on :
- any line contained in the message header
- CI Detection section is used to try to associate the right CI with the even or incident. You must select :
- a CI type and at least one CI field to active it.
- in which field the name search must be done (sender, subject, content or header).
- the system will scan the content of the specified field to match CI attributes selected.
- the automatic discovery will only work if the system can identify uniquely a candidate CI. Ambiguous information in the field will render detection inactive.
- Contains or Automatic searches
- Default search pattern will automatically scan the field and try to uniquely identify the specified information.
- When using the default search, there is no need to specify a wildcard character.
- If more than one information is found in such a way as to render the field ambiguous, an event without either category or CI will be created.
- Regular expression searches (regex)
- Regular expressions allow you to specify how the system should find the information contained in the field. It is normally used to tackle ambiguity in a rule.
- Typical use of a regular expression is to find a specific line within an email with a known format.
- Regular expression uses the .net specification.
- The regular expression must be built as to provide only the information pertinent in the lookup.
- Use RegexStorm.net, an online assistance ressource, to view an example of regular expression search.
- Test data section allows opening an already existing email message to help with a configuration of a rule. A green checkmark will appear if the criteria correspond to the opened email message
Octopus allows, for the same Outage rule, to follow both outage events than recovery detection. The Recovery detection tab contains the same features than the outage detection tab.
- CI detection parameters are automatically inherited from the outage detection
- The Waiting for recovery indicates when an exception event has a detection recovery configured and that the recovery has not been signaled by the surveillance system
- When new events are created for an exception waiting for recovery, the engine will simply ignore them
Automated incident creation
When an email is received and corresponds to the criteria of an Exception type of rule, the system created the incident with the following information :
- Incident source is indicated as Event
- CI in cause of the incident is the CI recognized by the event
- Incident is assigned to the person responsible of the recognized CI maintenance
- Site is the default Octopus site
- Requester and User are the default user configured in Octopus
- Incident subject is the email subject
- Detailed description is the email content
- Incident is created without any priority
- System will create the incident only while the service range is in effect
- If it is received outside of the service range, the system will wait for the start of the service range to create the incident, unless a recovery detection is perceived
- If the recovery email is received before the incident gets modified or taken in charge, the incident will be set as resolved. The incident should not have an assignee to be resolved automatically.
Example of the functioning of a correlation engine
The surveillance system detects that two servers (S1, S2) and a switch (C1) have a failure. It sends the appropriate outage alerts.
- Octopus receives the email that are then converted in events, each containing its own CI in cause
- The correlation engine determines that the three equipments are experiencing a failure in their respective service range, they are therefore eligible to incidents creation
- The correlation engine analyzes the events and determines that the probable cause of the outage is the switch failure since it has required relationships with both servers
- Only one incident is created, it is assigned to the group responsible of the switch maintenance, but the three events are linked to it
Incident creation delay
To be able to make the correlation of the incidents, the engine waits a certain time from the reception of the first outage email before creating the incident. You must configure the creation delays in the options, taking in consideration the time required by your surveillance system.
Recovery for incidents hosting multiple events
If events are still indicated as Waiting for recovery after the incident resolution, a new incident is created
Incident creation for single events
If multiple exception events are received during the delay and that it is impossible to link them, a new incident will get created for each single event. If you have configured a threshold for single events and that it has been reached, only one incident will get created. The incident will get assigned to the assignee group common to all CI, otherwise the system will use the responsible default group configured in the options.
Thank you, your message has been sent.