Alarm Configuration Mechanism Introduction
更新时间: 2025/11/21
在Gitcode上查看源码

Alarm Configuration Mechanism Introduction

Overview

Currently, openUBMC supports sensor event alarms and system event alarms. Sensor event alarms follow the IPMI protocol specification and are further divided into threshold sensor alarms and discrete sensor alarms. These alarms can be parsed using the sel command in the IPMI protocol. In addition, to meet increasingly diversified and complex alarm requirements, openUBMC designs and implements a more granular alarm mechanism, that is, system event alarms. This mechanism does not depend on the IPMI protocol and supports event reporting through the Redfish interface, providing higher flexibility and comprehensiveness.

Note that running the sel clear command in IPMI clears both the sensor and system event alarms displayed on the WebUI.

For details about the sensor event alarm configuration mechanism, see the Sensor Adaptation Guide. This document describes the configuration mechanism for system event alarms developed by openUBMC.

Alarm Process

System event alarms can be classified into CSR-configured alarms and event RPC alarms based on whether the alarms depend on specific hardware. The former is triggered by hardware, and the latter is generated by software events. The following figure shows the internal logic process for monitoring, reporting, and processing the hardware and system status within openUBMC.

Alarm Configuration

Alarm configuration includes static configuration and CSR configuration.

Static Configuration

All system event alarms, whether triggered through CSR or RPC, must rely on preset static configuration information. Static configuration is performed in the vpd repository. This configuration defines the fixed attributes of alarms and cannot be modified in principle. The following two entries are included:

  • Event definition: includes key attributes such as the event code, severity, and reporting channel.
  • Event description: supports Chinese and English by default and includes event description templates, repair suggestion templates, event impacts, and event causes.

The following table lists the fields involved in static configuration and their meanings.

FieldDescription
EventKeyIdEvent definition
EventNameEvent name (Ensure that the name is unique in the vpd repository and does not conflict with other alarms. Otherwise, event subscription will be affected.)
EventTypeEvent type:
0 for system event, 1 for maintenance event, and 2 for running event
SeverityIdSeverity:
0 for Normal, 1 for Minor, 2 for Major, and 3 for Critical
EventCodeEvent code
OldEventCodeOld event code
ActionIdEvent action:
0: No Action 1: Power off the host
2: Restart the host 3:Power cycle the host
LifeCycleIdEvent lifecycle
ReportChannelEvent reporting channel mask. Bit 6 indicates whether to record the alarm (1 for yes and 0 for no). If the alarm is not recorded, the event is displayed only in the historical record and not on the current alarm page. This is used to support special scenarios. In normal cases, use 1.
DescriptionEvent description (in the external display, consecutive spaces and spaces before punctuation such as commas, semicolons, periods are removed in the event description.)
SuggestionEvent suggestion
InfluenceEvent impact
CauseEvent cause
DeassertFlagWhether the event can be cleared. This parameter is to specify whether a Deassert event is required (0 for no and 1 for yes). In other words, it controls whether a historical record is generated when the alarm is cleared. It does not mean that the alarm cannot be cleared.

Example of Adding an Alarm Configuration

Add an alarm type, for example, memory training failure, in vpd/vendor/event_def.json.

Precautions

  • EventDefinition and EventDescription must match each other by EventKeyId.
  • Use {BMC} for the openUBMC string in the description. The system automatically replaces this string.
  • After adding an alarm type, update the version number. For example, change 1.0.43 to 1.0.44.
json
"Version": "1.0.44",
"EventDefinition": [
 {
 "EventCode": "0x01000019",
 "ReportChannel": 65535,
 "OldEventCode": "",
 "EventType": 0,
 "LifeCycleId": 0,
 "DeassertFlag": 1,
 "EventKeyId": "Memory.MemoryTrainFailure",
 "SeverityId": 3,
 "ActionId": 0,
 "EventName": "MemoryTrainFailure"
 }
],
"EventDescription": [
 {
 "Suggestion": {
 "En": "1. Power off the server and check whether there is damage or poor contact between the component and its slot.@#AB;2. Replace the component and check for alarms.",
 "Zh": "1、下电后检查该部件与其插槽是否存在损坏或接触不良现象。@#AB;2、更换该部件并进一步观察。"
 },
 "EventKeyId": "Memory.MemoryTrainFailure",
 "Description": {
 "En": "%1 %2 %3 triggered an uncorrectable error, %4.",
 "Zh": "%1 %2 DIMM%3已触发内存训练失败,%4。"
 },
 "Influence": {
 "En": "The system failed to start up.",
 "Zh": "可能导致系统无法正常启动。"
 },
 "Cause": {
 "En": "1. The memory module is faulty.@#AB;2. The slot of the memory module on the mainboard is faulty.",
 "Zh": "1、内存故障。@#AB;2、主板内存条槽位故障。"
 }
 }
]

Update the alarm list for the model. For openUBMC, after the alarm type is added, update the vpd/vendor/Huawei/Server/Kunpeng/openUBMC/event/eventDefList.txt alarm list. This list specifies the alarms to be loaded from event_def.json. You only need to add EventKeyId. In this example, add Memory.MemoryTrainFailure.

CSR Configuration of Events

CSR configurations represent the dynamic information of an alarm and support flexible configuration and modification. For events or alarms with specific hardware forms, implement them directly through CSR configuration. In openUBMC, the CSR configuration for power events differs from that of common events. The following describes the configuration methods for both.

CSR Configuration of Common Events

The following table describes the fields and meanings for common events.

FieldDescription
EventKeyIdEvent ID, which is used to match the static configuration of events.
ReadingAlarm reading, generally configured as the synchronization syntax of other values. For common events, the value can be used directly, for example, a temperature reading. For other types of events, such as certificate expiration, the value is represented as 0 or 1.
ConditionAlarm threshold
OperatorIdComparison operator. The following eight comparison methods are available:
1: less than, 2: less than or equal to, 3: greater than, 4: greater than or equal to, 5: equal to, 6: not equal to,
7: rising edge (0 to 1 triggers, 1 to 0 recovers), 8: falling edge (1 to 0 triggers, 0 to 1 recovers).
HysteresisHysteresis threshold is used when an alarm is cleared. If the value is 0, the alarm is cleared immediately. This value functions as a tolerance.
EnabledEnabling status of an event, or the masking status.
ComponentAssociated component object. For details about the component definition, see FruData.
DescArgx/SuggArgx(Optional) Event description/suggestion parameters, used for message formatting, supporting only string format. A maximum of 10 items can be configured (via the SR expression format).
AdditionalInfo(Optional) Additional information about an event, serving as the Nth dynamic parameter. Multiple parameters can be included by listing their indices (such as '1,2'). This parameter is used for distinguishing between different events during FD reporting. For example, if the alarms are the same except for the slot, the slot is used for differentiation. Note whether this field needs to be configured for a new alarm.
LedFaultCode(Optional) LED error code, which can be a fixed value or a dynamic value. The value of x is the Instance part in Component.
InvalidReadingIgnore(Optional) Whether to ignore invalid values. 1: enabled; 0: disabled. If 1 is set, readings that equal to InvalidReading are ignored.
InvalidReading(Optional) Invalid value to be ignored.
CSR Configuration Example
  • Configure CSRs by strictly following the hardware topology. For all hardware-triggered system events, configure them in the CSR of the corresponding hardware entity. For example, for alarms generated by the CLU (fan board), refer to the SR files in the vpd/vendor/Huawei/TianChi/CLU directory.
  • Before configuring an event, check if the corresponding Component object exists. If not, configure the Component object first. Generally, you only need to configure one Event object because the Component class belongs to the FRU data scope and is usually configured by Frudata.

Note: Objects in platform.sr can be referenced across files. Therefore, for objects that should exist uniquely in theory (such as Component_ComBMC and Component_ComSystem), reuse the existing definitions to avoid code redundancy. The following example registers a FanSpeedDeviation event on the fan board. (If a field is not configured, the default value is used. Configure fields as required.)

json
{
 "Objects": {
 "Event_Fan1FStatus": { // Event is the class name. All event classes are distributed to the event module for processing. Fan1FStatus is the name. An object name in a single file must be unique. The complete resource name is combined by the self-discovery mechanism based on the SR file, for example, Event_Fan1FStatus_00.
 "EventKeyId": "Fan.FanSpeedDeviation",
 "Reading": "<=/Fan_1.FrontStatus",
 "Condition": 0,
 "OperatorId": 6,
 "Enabled": true,
 "DescArg1": "#/Fan_1.FanId",
 "DescArg2": "front",
 "Component": "#/Component_Fan1",
 "AdditionalInfo" : "1,2",
 "LedFaultCode": "F01"
 },
 // Except for reading operations that use synchronization syntax, all other attributes use reference syntax. Because the synchronization syntax relies on a polling interval, it may cause information loss or missed updates in the alarm description.
 // Configure the Event_Fan1FStatus event object based on the existing fan object. The following lists the dependent objects.
 "Component_Fan1": {
 "FruId": 255,
 "Instance": "<=/Fan_1.FanId",
 "Type": 4,
 "Name": "Fan1",
 "Presence": "<=/Fan_1.FrontPresence",
 "Health": 0,
 "PowerState": 1,
 "UniqueId": "N/A",
 "Manufacturer": "",
 "GroupId": 1,
 "Location": "<=/Component_CLU.Name",
 "NodeId": "0"
 },
 "Fan_1": {
 "FanId": 1,
 "Slot": 1,
 "Coefficient": 1,
 "FrontPresence": "<=/Scanner_Fan1_Presence.Value",
 "RearPresence": "<=/Scanner_Fan1_Presence.Value",
 "FrontSpeed": "<=/Scanner_Fan1_FSpeed.Value",
 "RearSpeed": "<=/Scanner_Fan1_RSpeed.Value",
 "HardwarePWM": "#/Accessor_Fan1_PWM.Value",
 "SystemId": 1,
 "FrontStatus": 0,
 "RearStatus": 0,
 "MaxSupportedPWM": 255,
 "IdentifySpeedLevel": 35,
 "Position": "CLU",
 "PowerGood": "#/Scanner_PowerGood.Value"
 },
 "Component_CLU": {
 "FruId": 255,
 "Instance": 255,
 "Type": 196,
 "Name": "CLU${Slot}",
 "Presence": 1,
 "Health": 0,
 "PowerState": 1,
 "BoardId": 65535,
 "UniqueId": "N/A",
 "Manufacturer": "",
 "GroupId": 1,
 "Location": "chassis"
 },
 "Scanner_Fan1_FSpeed": {
 "Chip": "#/Smc_FanBoardSMC",
 "Offset": 402657025,
 "Size": 4,
 "Mask": 4294901760,
 "Type": 0,
 "Period": 1000,
 "Debounce": "None",
 "Value": 0
 },
 "Scanner_Fan1_RSpeed": {
 "Chip": "#/Smc_FanBoardSMC",
 "Offset": 402657025,
 "Size": 4,
 "Mask": 65535,
 "Type": 0,
 "Period": 1000,
 "Debounce": "None",
 "Value": 0
 },
 "Accessor_Fan1_PWM": {
 "Chip": "#/Smc_FanBoardSMC",
 "Offset": 402657281,
 "Size": 1,
 "Mask": 255,
 "Type": 0,
 "Value": 0
 },
 }
}

Note:
If you configure DescArgx/SuggArgx parameters, the system appends the serial number (SN) or part number (PN) information to the event description (this information does not appear in the OMRP interface). This SN/PN data comes from the SerialNumber and PartNumber fields in the associated Component object. If these fields are empty, the system does not display the information.
SN/PN processing logic: Different component types use different data sources. The SN/PN fields in Component support open configuration, allowing components to decide what to display via SR syntax. The system does not display empty configurations. Event does not restrict SN/PN logic and supports addition, deletion, and modification, enabling the SR file to adapt to different product requirements.

Alarm Triggering and Clearance Mechanism

The alarm triggering and clearance logic centers on the evaluation of conditional expressions.

  1. Alarm triggering

The alarm triggering condition can be simplified into a core expression: Reading OperatorId Condition.

  • OperatorId: defines the comparison operator. For example, if the value is 6, the normal event CSR configuration table shows that 6 represents the "not equal to" (!=) operator.
  • Expression evaluation: When the result of comparing Reading with Condition using the configured OperatorId is true, the system reaches the alarm threshold and generates an alarm. If the result is false, Reading has updated but does not meet the alarm condition.

Example:

In the FanSpeedDeviation example above, the system evaluates the logical result of the expression Reading != Condition (for example, 1 != 0). If the expression returns true, the system generates an alarm.

  1. Alarm Clearance

Alarm clearance also relies on the evaluation of conditional expressions. In simple scenarios without complex clearance strategies, the system recalculates the conditional expression when the reading changes. If the expression returns false, the system clears the alarm.

Key point: Configuration drives the alarm conditions completely. Whether Reading comes directly from raw device scans or from derived values after logical processing, the system can use it for alarm evaluation.

Alarm Debounce Strategy Configuration

In practice, frequent fluctuations in monitored values can trigger alarms repeatedly, leading to false positives. To improve alarm accuracy and effectiveness, openUBMC includes a debounce mechanism, implemented by configuring the Debounce property of the Scanner object. The system currently supports five debounce strategy types: MidAvg, Median, Cont, ContBin, and None.

Debounce TypeDescriptionParametersConfiguration Example
MidAvgMean averageWindowSize: window size
DefaultValue: default value
IsSigned: whether the value is signed
"MidAvg": {
  "WindowSize": 6,
  "DefaultValue": 11
  "IsSigned": true
}
MedianMedian filteringWindowSize: window size
DefaultValue: default value
"Median": {
  "WindowSize": 6,
  "DefaultValue": 11
}
ContContinuous consistencyNum: Number of debounce cycles
DefaultValue: default value
"Cont": {
  "Num": 6,
  "DefaultValue": 11
}
ContBinBinary continuous consistencyNumH: Debounce cycles for high-level inputs
NumL: Number of debounce cycles for low-level inputs
DefaultValue: default value
"ContBin": {
  "NumH": 6,
  "NumL": 6,
  "DefaultValue": 11
}
NoneNo debounceDefaultValue: default value"None": {
  "DefaultValue": 11
}

Application principles:

  • Temperature monitoring: Use Median and MidAvg debounce.
  • Status monitoring: Use Cont and ContBin debounce.
  • Voltage monitoring: Use MidAvg debounce.
  • Fault detection: Use ContBin debounce. Select debounce parameters based on fault severity.

Power Event Configuration

Power events represent a special event type in openUBMC. They provide unified management for a group of related events that share a common type but require different thresholds and corresponding LED error codes. In addition to the general attributes of normal events, power events include a dedicated Mappings field to define the specific behavior of each sub-event within the event cluster.
The Mappings field includes the following key sub-items:

  • Mappings.Reading

Represents the trigger threshold (Condition) for a specific sub-event. When the monitored reading reaches this value, the system generates the corresponding event.

  • Mappings.LedFaultCode

Represents the LedFaultCode for an event. It specifies the LED number to display when the system generates the event.

  • Mappings.DescArgs

Represents the DescArgs for an event, provided as a string array. This field supports up to 10 elements and fills variable information in the event description.

Power Event CSR Configuration Example

The following example registers a power event. For details, see vpd/vendor/Huawei/TianChi/BCU/PsEvent_BC83AMDA_0_soft.sr.

json
{
 "Objects": {
 "PowerEvent_BCUPwrFaultMntr": {
 "EventKeyId": "System.SystemPowerFailure",
 "Component": "#/Component_ComSystem",
 "Reading": "<=/Scanner_BCUPwrSigDrop.Value",
 "AdditionalInfo": "2",
 "Mappings": [
 {
 "Reading": 136,
 "LedFaultCode": "U10",
 "DescArgs": [
 "",
 "BCU_V_VCC_12V0_1"
 ]
 },
 {
 "Reading": 137,
 "LedFaultCode": "U10",
 "DescArgs": [
 "",
 "BCU_V_VCC_12V0_2"
 ]
 },
 {
 "Reading": 138,
 "LedFaultCode": "U10",
 "DescArgs": [
 "",
 "BCU_V_VCC_12V0_3"
 ]
 },
 ...
 {
 "Reading": 182,
 "LedFaultCode": "U00",
 "DescArgs": [
 "",
 "BCU_V_STBY_1V8"
 ]
 }
 ]
 }
 }
}

Power Event Alarm Triggering and Clearance Mechanism

The alarm triggering and clearance logic of power events is basically the same as that of common events. The main difference is that power events need to compare the Reading and Mappings.Reading fields to determine whether the alarm triggering or clearance threshold conditions are met.

Event RPC Alarms

Description

Software event configuration applies to system-level or software-level alarm scenarios that are difficult to describe in CSRs. Typically, components determine whether to trigger an alarm or record an event based on real-time operational data. Therefore, do not configure events or alarms with clear hardware forms as software events.

Interface Usage Constraints

  • Lifecycle management responsibility: The corresponding component manages the entire lifecycle of software alarms, including their triggering and clearance. Components must perform clearance actions regardless of whether they define Deassert events or alarm clearance event codes.
  • Component matching rule: For software alarms, the system matches the first Component object that meets the conditions based on ComponentName and SubjectType. For components of the same type, the value of ComponentName must be globally unique.
  • Restriction on repeated operations: The system does not allow repeated assert or deassert operations on the same event.

Other Features

  • Unique event ID: ComponentName, EventKeyId, and MessageArgs uniquely identify a software alarm event.
  • Persistence and status maintenance: Software alarms persist through resets. Therefore, components need to maintain the current alarm status to avoid repeated error reports.
  • Interface availability window: The window period for using external interfaces for software alarms is affected by SR distribution. If a component needs to call an interface at the initial stage of service startup, implement a retry mechanism and use protection measures such as pcall to cope with the situation that the service is not ready.

Configuration Example

The following example shows the software alarm processing logic for link exception events in the network_adapter component. You can refer to the implementation of the check_oam_lost_link_state_alarm function to understand how to use event RPC alarms.

The sample code is stored in network_adapter/src/lualib/event/event_mgmt.lua.

lua
function event_mgmt:add_event(params)
 local event_obj
 client:ForeachEventsObjects(function(o)
 event_obj = o -- This object is unique.
 end)
 if not event_obj then
 log:error('get events object failed')
 return
 end

 local ok, res = pcall(function ()
 return event_obj:AddEvent_PACKED(ctx.new(), params):unpack()
 end)
 if not ok then
 log:error('add events failed, %s', res)
 return false
 end

 log:notice('add event successfully, record id [%s]', res)
 return true
end

-- Link exception alarm
function event_mgmt:check_oam_lost_link_state_alarm(state, device_name, port_id)
 local args = json.encode({device_name, '', 'Port ' .. (port_id + 1)})
 local assert = state == 1 -- 0: no alarm, 1: alarm
 local alarm_state = alarm_states[args]
 if not assert == not alarm_state then -- The default value is nil. The value is negated here.
 return
 end

 local params = {
 {'ComponentName', 'Port'.. (port_id + 1)}, -- The port resource collaboration interface ID starts from 0, and the component ID starts from 1.
 {'State', assert and 'true' or 'false'},
 {'EventKeyId', 'Port.PortOAMLostLink'},
 {'MessageArgs', args},
 {'SystemId', ''},
 {'ManagerId', ''},
 {'ChassisId', ''},
 {'NodeId', ''}
 }
 local is_ok = self:add_event(params)
 if not is_ok then
 return false
 end
 -- Updating local alarm information
 alarm_states[args] = assert
 self:update_alarm_msg(assert, args, '') --Dynamic parameters are unique and can be used as keys. Therefore, values are not required.
 return true
end

Interface Calling Demonstration

Software alarms can be added by calling an interface. Invoke AddEvent method of the bmc.kepler.Systems.Events interface at the /bmc/kepler/Systems/:SystemId/Events path of the resource collaboration interface. The interface parameters are as follows:

ParameterDescriptionDescription (String Type)
ComponentEvent entity name(Mandatory) Name of the component associated with the event.
StateEvent statusMandatory (true/false)
EventKeyIdEvent IDMandatory (same as the static configuration)
SubjectTypeEvent entity typeOptional. If no entity type is provided, the high-order bits of the event code are used for matching.
SuggestionArgsEvent suggestion parametersOptional. The format needs to be converted using json.encode.
MessageArgsEvent description parametersMandatory. The format needs to be converted using json.encode. If this parameter is not involved, an empty table needs to be passed.
SystemIdSystem ID of the eventSee the notes below.
ManagerIdManager ID of the eventSee the notes below.
ChassisIdChassis ID of the eventSee the notes below.
NodeIdNode ID of the eventSee the notes below.
LeaFaultCodeLED fault codeOptional.

The values of SystemId, ManagerId, ChassisId, and NodeId are described as follows:

  1. If the event source comes from the resource collaboration interface object, SystemId, ManagerId, ChassisId, and NodeId must be synchronized.
  2. If there is no event source, select one of SystemId, ManagerId, and ChassisId based on the resource category. NodeId can be empty.

Calling example

  • The following example uses the busctl tool to call AddEvent to add an event.
powershell
busctl --user call bmc.kepler.event /bmc/kepler/Systems/1/Events bmc.kepler.Systems.Events AddEvent 'a{ss}a(ss)' 3 Interface cli UserName Administrator ClientAddr 127.0.0.1 8 ComponentName 'BMC' State 'true' EventKeyId 'BMC.InsecureCryptographicAlgorithm' 'MessageArgs' '["test"]' 'SystemId' '1' 'ManagerId' '1' 'ChassisId' '1' 'NodeId' '1'

Calling the Event RPC Interface

In openUBMC, triggering alarms based on actual service status is a common requirement. The following section describes the implementation method for calling event interfaces through RPC.

In service logic implementation, to call an event interface, first add a dependency on the bmc.kepler.Systems.Events interface in the mds/service.json file of the corresponding component. If the dependency is not configured, the system cannot call the resource for that interface in the code. Then, call the interface provided by client.lua. Take the bios repository as an example:

  • Add the interface dependency in the required field.
  • client.lua automatically generates functions for calling the corresponding resource collaboration interface methods, obtaining objects, and subscribing to signals.
  • Create software alarm events in bios/src/lualib/infrastructure/event.lua.
lua
local log = require 'mc.logging'
local context = require 'mc.context'
local client = require 'bios.client'
local skynet = require 'skynet'

local Event = {}

local STATE_ERROR = 'PropertyValueError: Incorrect value of property State.'

local function add_event(param)
 local record
 local event_obj = client:GetEventsEventsObject()
 if not event_obj then
 log:error('[bios]get events object failed')
 return false
 end

 local ok, err = pcall(function ()
 record = event_obj:AddEvent_PACKED(context.new(), param):unpack()
 end)
 if not ok then
 local err_str = string.format('%s', err)
 if err_str == STATE_ERROR then
 log:info('[bios]event exist, no need add')
 return true
 end
 log:notice('[bios] add event(%s) fail, err: %s', record, err)
 return false
 end
 log:debug('[bios]add event(%s) successfully', record)
 return true
end

local RETRY_TIMES = 2
local function retry_add_event(param)
 for _ = 1, RETRY_TIMES do
 local ok, ret = pcall(add_event, param)
 if ok and ret then
 return true
 end
 skynet.sleep(50)
 end
 return false
end

--- Generating a software alarm.
function Event.generate_event(msg)
 local param = {}
 for key, value in pairs(msg) do
 param[#param + 1] = { key, value }
 end

 local res = retry_add_event(param)
 if not res then
 error('[bios]generate event fail')
 end
end

return Event
  • Call the software alarm interface to generate an alarm.