Alarm Configuration Mechanism Introduction
Overview
Currently, openUBMC supports sensor event alarms and system event alarms. Sensor event alarms follow the IPMI protocol specification and are further divided into threshold sensor alarms and discrete sensor alarms. These alarms can be parsed using the sel command in the IPMI protocol. In addition, to meet increasingly diversified and complex alarm requirements, openUBMC designs and implements a more granular alarm mechanism, that is, system event alarms. This mechanism does not depend on the IPMI protocol and supports event reporting through the Redfish interface, providing higher flexibility and comprehensiveness.
Note that running the sel clear command in IPMI clears both the sensor and system event alarms displayed on the WebUI.
For details about the sensor event alarm configuration mechanism, see the Sensor Adaptation Guide. This document describes the configuration mechanism for system event alarms developed by openUBMC.
Alarm Process
System event alarms can be classified into CSR-configured alarms and event RPC alarms based on whether the alarms depend on specific hardware. The former is triggered by hardware, and the latter is generated by software events. The following figure shows the internal logic process for monitoring, reporting, and processing the hardware and system status within openUBMC.
Alarm Configuration
Alarm configuration includes static configuration and CSR configuration.
Static Configuration
All system event alarms, whether triggered through CSR or RPC, must rely on preset static configuration information. Static configuration is performed in the vpd repository. This configuration defines the fixed attributes of alarms and cannot be modified in principle. The following two entries are included:
- Event definition: includes key attributes such as the event code, severity, and reporting channel.
- Event description: supports Chinese and English by default and includes event description templates, repair suggestion templates, event impacts, and event causes.
The following table lists the fields involved in static configuration and their meanings.
| Field | Description |
|---|---|
| EventKeyId | Event definition |
| EventName | Event name (Ensure that the name is unique in the vpd repository and does not conflict with other alarms. Otherwise, event subscription will be affected.) |
| EventType | Event type:0 for system event, 1 for maintenance event, and 2 for running event |
| SeverityId | Severity:0 for Normal, 1 for Minor, 2 for Major, and 3 for Critical |
| EventCode | Event code |
| OldEventCode | Old event code |
| ActionId | Event action:0: No Action 1: Power off the host 2: Restart the host 3:Power cycle the host |
| LifeCycleId | Event lifecycle |
| ReportChannel | Event reporting channel mask. Bit 6 indicates whether to record the alarm (1 for yes and 0 for no). If the alarm is not recorded, the event is displayed only in the historical record and not on the current alarm page. This is used to support special scenarios. In normal cases, use 1. |
| Description | Event description (in the external display, consecutive spaces and spaces before punctuation such as commas, semicolons, periods are removed in the event description.) |
| Suggestion | Event suggestion |
| Influence | Event impact |
| Cause | Event cause |
| DeassertFlag | Whether the event can be cleared. This parameter is to specify whether a Deassert event is required (0 for no and 1 for yes). In other words, it controls whether a historical record is generated when the alarm is cleared. It does not mean that the alarm cannot be cleared. |
Example of Adding an Alarm Configuration
Add an alarm type, for example, memory training failure, in vpd/vendor/event_def.json.
Precautions
EventDefinitionandEventDescriptionmust match each other byEventKeyId.- Use
{BMC}for theopenUBMCstring in the description. The system automatically replaces this string. - After adding an alarm type, update the version number. For example, change
1.0.43to1.0.44.
"Version": "1.0.44",
"EventDefinition": [
{
"EventCode": "0x01000019",
"ReportChannel": 65535,
"OldEventCode": "",
"EventType": 0,
"LifeCycleId": 0,
"DeassertFlag": 1,
"EventKeyId": "Memory.MemoryTrainFailure",
"SeverityId": 3,
"ActionId": 0,
"EventName": "MemoryTrainFailure"
}
],
"EventDescription": [
{
"Suggestion": {
"En": "1. Power off the server and check whether there is damage or poor contact between the component and its slot.@#AB;2. Replace the component and check for alarms.",
"Zh": "1、下电后检查该部件与其插槽是否存在损坏或接触不良现象。@#AB;2、更换该部件并进一步观察。"
},
"EventKeyId": "Memory.MemoryTrainFailure",
"Description": {
"En": "%1 %2 %3 triggered an uncorrectable error, %4.",
"Zh": "%1 %2 DIMM%3已触发内存训练失败,%4。"
},
"Influence": {
"En": "The system failed to start up.",
"Zh": "可能导致系统无法正常启动。"
},
"Cause": {
"En": "1. The memory module is faulty.@#AB;2. The slot of the memory module on the mainboard is faulty.",
"Zh": "1、内存故障。@#AB;2、主板内存条槽位故障。"
}
}
]Update the alarm list for the model. For openUBMC, after the alarm type is added, update the vpd/vendor/Huawei/Server/Kunpeng/openUBMC/event/eventDefList.txt alarm list. This list specifies the alarms to be loaded from event_def.json. You only need to add EventKeyId. In this example, add Memory.MemoryTrainFailure.
CSR Configuration of Events
CSR configurations represent the dynamic information of an alarm and support flexible configuration and modification. For events or alarms with specific hardware forms, implement them directly through CSR configuration. In openUBMC, the CSR configuration for power events differs from that of common events. The following describes the configuration methods for both.
CSR Configuration of Common Events
The following table describes the fields and meanings for common events.
| Field | Description |
|---|---|
| EventKeyId | Event ID, which is used to match the static configuration of events. |
| Reading | Alarm reading, generally configured as the synchronization syntax of other values. For common events, the value can be used directly, for example, a temperature reading. For other types of events, such as certificate expiration, the value is represented as 0 or 1. |
| Condition | Alarm threshold |
| OperatorId | Comparison operator. The following eight comparison methods are available:1: less than, 2: less than or equal to, 3: greater than, 4: greater than or equal to, 5: equal to, 6: not equal to,7: rising edge (0 to 1 triggers, 1 to 0 recovers), 8: falling edge (1 to 0 triggers, 0 to 1 recovers). |
| Hysteresis | Hysteresis threshold is used when an alarm is cleared. If the value is 0, the alarm is cleared immediately. This value functions as a tolerance. |
| Enabled | Enabling status of an event, or the masking status. |
| Component | Associated component object. For details about the component definition, see FruData. |
| DescArgx/SuggArgx | (Optional) Event description/suggestion parameters, used for message formatting, supporting only string format. A maximum of 10 items can be configured (via the SR expression format). |
| AdditionalInfo | (Optional) Additional information about an event, serving as the Nth dynamic parameter. Multiple parameters can be included by listing their indices (such as '1,2'). This parameter is used for distinguishing between different events during FD reporting. For example, if the alarms are the same except for the slot, the slot is used for differentiation. Note whether this field needs to be configured for a new alarm. |
| LedFaultCode | (Optional) LED error code, which can be a fixed value or a dynamic value. The value of x is the Instance part in Component. |
| InvalidReadingIgnore | (Optional) Whether to ignore invalid values. 1: enabled; 0: disabled. If 1 is set, readings that equal to InvalidReading are ignored. |
| InvalidReading | (Optional) Invalid value to be ignored. |
CSR Configuration Example
- Configure CSRs by strictly following the hardware topology. For all hardware-triggered system events, configure them in the CSR of the corresponding hardware entity. For example, for alarms generated by the CLU (fan board), refer to the SR files in the
vpd/vendor/Huawei/TianChi/CLUdirectory. - Before configuring an event, check if the corresponding
Componentobject exists. If not, configure theComponentobject first. Generally, you only need to configure one Event object because theComponentclass belongs to the FRU data scope and is usually configured byFrudata.
Note: Objects in platform.sr can be referenced across files. Therefore, for objects that should exist uniquely in theory (such as Component_ComBMC and Component_ComSystem), reuse the existing definitions to avoid code redundancy. The following example registers a FanSpeedDeviation event on the fan board. (If a field is not configured, the default value is used. Configure fields as required.)
{
"Objects": {
"Event_Fan1FStatus": { // Event is the class name. All event classes are distributed to the event module for processing. Fan1FStatus is the name. An object name in a single file must be unique. The complete resource name is combined by the self-discovery mechanism based on the SR file, for example, Event_Fan1FStatus_00.
"EventKeyId": "Fan.FanSpeedDeviation",
"Reading": "<=/Fan_1.FrontStatus",
"Condition": 0,
"OperatorId": 6,
"Enabled": true,
"DescArg1": "#/Fan_1.FanId",
"DescArg2": "front",
"Component": "#/Component_Fan1",
"AdditionalInfo" : "1,2",
"LedFaultCode": "F01"
},
// Except for reading operations that use synchronization syntax, all other attributes use reference syntax. Because the synchronization syntax relies on a polling interval, it may cause information loss or missed updates in the alarm description.
// Configure the Event_Fan1FStatus event object based on the existing fan object. The following lists the dependent objects.
"Component_Fan1": {
"FruId": 255,
"Instance": "<=/Fan_1.FanId",
"Type": 4,
"Name": "Fan1",
"Presence": "<=/Fan_1.FrontPresence",
"Health": 0,
"PowerState": 1,
"UniqueId": "N/A",
"Manufacturer": "",
"GroupId": 1,
"Location": "<=/Component_CLU.Name",
"NodeId": "0"
},
"Fan_1": {
"FanId": 1,
"Slot": 1,
"Coefficient": 1,
"FrontPresence": "<=/Scanner_Fan1_Presence.Value",
"RearPresence": "<=/Scanner_Fan1_Presence.Value",
"FrontSpeed": "<=/Scanner_Fan1_FSpeed.Value",
"RearSpeed": "<=/Scanner_Fan1_RSpeed.Value",
"HardwarePWM": "#/Accessor_Fan1_PWM.Value",
"SystemId": 1,
"FrontStatus": 0,
"RearStatus": 0,
"MaxSupportedPWM": 255,
"IdentifySpeedLevel": 35,
"Position": "CLU",
"PowerGood": "#/Scanner_PowerGood.Value"
},
"Component_CLU": {
"FruId": 255,
"Instance": 255,
"Type": 196,
"Name": "CLU${Slot}",
"Presence": 1,
"Health": 0,
"PowerState": 1,
"BoardId": 65535,
"UniqueId": "N/A",
"Manufacturer": "",
"GroupId": 1,
"Location": "chassis"
},
"Scanner_Fan1_FSpeed": {
"Chip": "#/Smc_FanBoardSMC",
"Offset": 402657025,
"Size": 4,
"Mask": 4294901760,
"Type": 0,
"Period": 1000,
"Debounce": "None",
"Value": 0
},
"Scanner_Fan1_RSpeed": {
"Chip": "#/Smc_FanBoardSMC",
"Offset": 402657025,
"Size": 4,
"Mask": 65535,
"Type": 0,
"Period": 1000,
"Debounce": "None",
"Value": 0
},
"Accessor_Fan1_PWM": {
"Chip": "#/Smc_FanBoardSMC",
"Offset": 402657281,
"Size": 1,
"Mask": 255,
"Type": 0,
"Value": 0
},
}
}Note:
If you configureDescArgx/SuggArgxparameters, the system appends the serial number (SN) or part number (PN) information to the event description (this information does not appear in the OMRP interface). This SN/PN data comes from theSerialNumberandPartNumberfields in the associatedComponentobject. If these fields are empty, the system does not display the information.
SN/PN processing logic: Different component types use different data sources. The SN/PN fields inComponentsupport open configuration, allowing components to decide what to display via SR syntax. The system does not display empty configurations.Eventdoes not restrict SN/PN logic and supports addition, deletion, and modification, enabling the SR file to adapt to different product requirements.
Alarm Triggering and Clearance Mechanism
The alarm triggering and clearance logic centers on the evaluation of conditional expressions.
- Alarm triggering
The alarm triggering condition can be simplified into a core expression: Reading OperatorId Condition.
OperatorId: defines the comparison operator. For example, if the value is 6, the normal event CSR configuration table shows that 6 represents the "not equal to" (!=) operator.- Expression evaluation: When the result of comparing
ReadingwithConditionusing the configuredOperatorIdis true, the system reaches the alarm threshold and generates an alarm. If the result is false,Readinghas updated but does not meet the alarm condition.
Example:
In the FanSpeedDeviation example above, the system evaluates the logical result of the expression Reading != Condition (for example, 1 != 0). If the expression returns true, the system generates an alarm.
- Alarm Clearance
Alarm clearance also relies on the evaluation of conditional expressions. In simple scenarios without complex clearance strategies, the system recalculates the conditional expression when the reading changes. If the expression returns false, the system clears the alarm.
Key point: Configuration drives the alarm conditions completely. Whether Reading comes directly from raw device scans or from derived values after logical processing, the system can use it for alarm evaluation.
Alarm Debounce Strategy Configuration
In practice, frequent fluctuations in monitored values can trigger alarms repeatedly, leading to false positives. To improve alarm accuracy and effectiveness, openUBMC includes a debounce mechanism, implemented by configuring the Debounce property of the Scanner object. The system currently supports five debounce strategy types: MidAvg, Median, Cont, ContBin, and None.
| Debounce Type | Description | Parameters | Configuration Example |
|---|---|---|---|
| MidAvg | Mean average | WindowSize: window sizeDefaultValue: default valueIsSigned: whether the value is signed | "MidAvg": { "WindowSize": 6, "DefaultValue": 11 "IsSigned": true } |
| Median | Median filtering | WindowSize: window sizeDefaultValue: default value | "Median": { "WindowSize": 6, "DefaultValue": 11 } |
| Cont | Continuous consistency | Num: Number of debounce cyclesDefaultValue: default value | "Cont": { "Num": 6, "DefaultValue": 11 } |
| ContBin | Binary continuous consistency | NumH: Debounce cycles for high-level inputsNumL: Number of debounce cycles for low-level inputsDefaultValue: default value | "ContBin": { "NumH": 6, "NumL": 6, "DefaultValue": 11 } |
| None | No debounce | DefaultValue: default value | "None": { "DefaultValue": 11 } |
Application principles:
- Temperature monitoring: Use
MedianandMidAvgdebounce. - Status monitoring: Use
ContandContBindebounce. - Voltage monitoring: Use
MidAvgdebounce. - Fault detection: Use
ContBindebounce. Select debounce parameters based on fault severity.
Power Event Configuration
Power events represent a special event type in openUBMC. They provide unified management for a group of related events that share a common type but require different thresholds and corresponding LED error codes. In addition to the general attributes of normal events, power events include a dedicated Mappings field to define the specific behavior of each sub-event within the event cluster.
The Mappings field includes the following key sub-items:
- Mappings.Reading
Represents the trigger threshold (Condition) for a specific sub-event. When the monitored reading reaches this value, the system generates the corresponding event.
- Mappings.LedFaultCode
Represents the LedFaultCode for an event. It specifies the LED number to display when the system generates the event.
- Mappings.DescArgs
Represents the DescArgs for an event, provided as a string array. This field supports up to 10 elements and fills variable information in the event description.
Power Event CSR Configuration Example
The following example registers a power event. For details, see vpd/vendor/Huawei/TianChi/BCU/PsEvent_BC83AMDA_0_soft.sr.
{
"Objects": {
"PowerEvent_BCUPwrFaultMntr": {
"EventKeyId": "System.SystemPowerFailure",
"Component": "#/Component_ComSystem",
"Reading": "<=/Scanner_BCUPwrSigDrop.Value",
"AdditionalInfo": "2",
"Mappings": [
{
"Reading": 136,
"LedFaultCode": "U10",
"DescArgs": [
"",
"BCU_V_VCC_12V0_1"
]
},
{
"Reading": 137,
"LedFaultCode": "U10",
"DescArgs": [
"",
"BCU_V_VCC_12V0_2"
]
},
{
"Reading": 138,
"LedFaultCode": "U10",
"DescArgs": [
"",
"BCU_V_VCC_12V0_3"
]
},
...
{
"Reading": 182,
"LedFaultCode": "U00",
"DescArgs": [
"",
"BCU_V_STBY_1V8"
]
}
]
}
}
}Power Event Alarm Triggering and Clearance Mechanism
The alarm triggering and clearance logic of power events is basically the same as that of common events. The main difference is that power events need to compare the Reading and Mappings.Reading fields to determine whether the alarm triggering or clearance threshold conditions are met.
Event RPC Alarms
Description
Software event configuration applies to system-level or software-level alarm scenarios that are difficult to describe in CSRs. Typically, components determine whether to trigger an alarm or record an event based on real-time operational data. Therefore, do not configure events or alarms with clear hardware forms as software events.
Interface Usage Constraints
- Lifecycle management responsibility: The corresponding component manages the entire lifecycle of software alarms, including their triggering and clearance. Components must perform clearance actions regardless of whether they define
Deassertevents or alarm clearance event codes. - Component matching rule: For software alarms, the system matches the first
Componentobject that meets the conditions based onComponentNameandSubjectType. For components of the same type, the value ofComponentNamemust be globally unique. - Restriction on repeated operations: The system does not allow repeated assert or deassert operations on the same event.
Other Features
- Unique event ID:
ComponentName,EventKeyId, andMessageArgsuniquely identify a software alarm event. - Persistence and status maintenance: Software alarms persist through resets. Therefore, components need to maintain the current alarm status to avoid repeated error reports.
- Interface availability window: The window period for using external interfaces for software alarms is affected by SR distribution. If a component needs to call an interface at the initial stage of service startup, implement a retry mechanism and use protection measures such as
pcallto cope with the situation that the service is not ready.
Configuration Example
The following example shows the software alarm processing logic for link exception events in the network_adapter component. You can refer to the implementation of the check_oam_lost_link_state_alarm function to understand how to use event RPC alarms.
The sample code is stored in network_adapter/src/lualib/event/event_mgmt.lua.
function event_mgmt:add_event(params)
local event_obj
client:ForeachEventsObjects(function(o)
event_obj = o -- This object is unique.
end)
if not event_obj then
log:error('get events object failed')
return
end
local ok, res = pcall(function ()
return event_obj:AddEvent_PACKED(ctx.new(), params):unpack()
end)
if not ok then
log:error('add events failed, %s', res)
return false
end
log:notice('add event successfully, record id [%s]', res)
return true
end
-- Link exception alarm
function event_mgmt:check_oam_lost_link_state_alarm(state, device_name, port_id)
local args = json.encode({device_name, '', 'Port ' .. (port_id + 1)})
local assert = state == 1 -- 0: no alarm, 1: alarm
local alarm_state = alarm_states[args]
if not assert == not alarm_state then -- The default value is nil. The value is negated here.
return
end
local params = {
{'ComponentName', 'Port'.. (port_id + 1)}, -- The port resource collaboration interface ID starts from 0, and the component ID starts from 1.
{'State', assert and 'true' or 'false'},
{'EventKeyId', 'Port.PortOAMLostLink'},
{'MessageArgs', args},
{'SystemId', ''},
{'ManagerId', ''},
{'ChassisId', ''},
{'NodeId', ''}
}
local is_ok = self:add_event(params)
if not is_ok then
return false
end
-- Updating local alarm information
alarm_states[args] = assert
self:update_alarm_msg(assert, args, '') --Dynamic parameters are unique and can be used as keys. Therefore, values are not required.
return true
endInterface Calling Demonstration
Software alarms can be added by calling an interface. Invoke AddEvent method of the bmc.kepler.Systems.Events interface at the /bmc/kepler/Systems/:SystemId/Events path of the resource collaboration interface. The interface parameters are as follows:
| Parameter | Description | Description (String Type) |
|---|---|---|
| Component | Event entity name | (Mandatory) Name of the component associated with the event. |
| State | Event status | Mandatory (true/false) |
| EventKeyId | Event ID | Mandatory (same as the static configuration) |
| SubjectType | Event entity type | Optional. If no entity type is provided, the high-order bits of the event code are used for matching. |
| SuggestionArgs | Event suggestion parameters | Optional. The format needs to be converted using json.encode. |
| MessageArgs | Event description parameters | Mandatory. The format needs to be converted using json.encode. If this parameter is not involved, an empty table needs to be passed. |
| SystemId | System ID of the event | See the notes below. |
| ManagerId | Manager ID of the event | See the notes below. |
| ChassisId | Chassis ID of the event | See the notes below. |
| NodeId | Node ID of the event | See the notes below. |
| LeaFaultCode | LED fault code | Optional. |
The values of SystemId, ManagerId, ChassisId, and NodeId are described as follows:
- If the event source comes from the resource collaboration interface object,
SystemId,ManagerId,ChassisId, andNodeIdmust be synchronized. - If there is no event source, select one of
SystemId,ManagerId, andChassisIdbased on the resource category.NodeIdcan be empty.
Calling example
- The following example uses the
busctltool to callAddEventto add an event.
busctl --user call bmc.kepler.event /bmc/kepler/Systems/1/Events bmc.kepler.Systems.Events AddEvent 'a{ss}a(ss)' 3 Interface cli UserName Administrator ClientAddr 127.0.0.1 8 ComponentName 'BMC' State 'true' EventKeyId 'BMC.InsecureCryptographicAlgorithm' 'MessageArgs' '["test"]' 'SystemId' '1' 'ManagerId' '1' 'ChassisId' '1' 'NodeId' '1'Calling the Event RPC Interface
In openUBMC, triggering alarms based on actual service status is a common requirement. The following section describes the implementation method for calling event interfaces through RPC.
In service logic implementation, to call an event interface, first add a dependency on the bmc.kepler.Systems.Events interface in the mds/service.json file of the corresponding component. If the dependency is not configured, the system cannot call the resource for that interface in the code. Then, call the interface provided by client.lua. Take the bios repository as an example:
- Add the interface dependency in the
requiredfield. client.luaautomatically generates functions for calling the corresponding resource collaboration interface methods, obtaining objects, and subscribing to signals.- Create software alarm events in
bios/src/lualib/infrastructure/event.lua.
local log = require 'mc.logging'
local context = require 'mc.context'
local client = require 'bios.client'
local skynet = require 'skynet'
local Event = {}
local STATE_ERROR = 'PropertyValueError: Incorrect value of property State.'
local function add_event(param)
local record
local event_obj = client:GetEventsEventsObject()
if not event_obj then
log:error('[bios]get events object failed')
return false
end
local ok, err = pcall(function ()
record = event_obj:AddEvent_PACKED(context.new(), param):unpack()
end)
if not ok then
local err_str = string.format('%s', err)
if err_str == STATE_ERROR then
log:info('[bios]event exist, no need add')
return true
end
log:notice('[bios] add event(%s) fail, err: %s', record, err)
return false
end
log:debug('[bios]add event(%s) successfully', record)
return true
end
local RETRY_TIMES = 2
local function retry_add_event(param)
for _ = 1, RETRY_TIMES do
local ok, ret = pcall(add_event, param)
if ok and ret then
return true
end
skynet.sleep(50)
end
return false
end
--- Generating a software alarm.
function Event.generate_event(msg)
local param = {}
for key, value in pairs(msg) do
param[#param + 1] = { key, value }
end
local res = retry_add_event(param)
if not res then
error('[bios]generate event fail')
end
end
return Event- Call the software alarm interface to generate an alarm.