长时间 AC,概率性出现板卡已加载但传感器未注册问题分析
更新时间: 2026/05/28
在Gitcode上查看源码问题背景
- 单板类型:NA;
- 软件版本:1230基线版本;
- 涉及功能:传感器显示;
- 触发条件:按照社区白牌包制作流程,制作的白牌包在web页面进行升级,升级完成后app.log查看日志。
- 业务表现:长时间 AC,预期板卡已加载且传感器注册;实际上概率性出现板卡已加载但传感器未注册。
问题链接
问题复现步骤
AC 起来后,自动化脚本检查发现 PCIe/OCP 相关传感器未出现,消失的是这些传感器:
关键日志信息
19:55 做 AC:
text2026-02-25 19:55:51 CLI,Administrator@192.168.109.48:50888,fructrl,Set FRU0 to ACCycle successfully20:01 时卡的 sr 已加载:
text2026-02-25 20:01:17.712144 hwdiscovery NOTICE: hwcomponent.lua(313): [self-discovery] name: Connector_PCIE_SLOT2_01010103, position: 0101010302, current: 1, previous: 0,uptime: 269 s 2026-02-25 20:01:17.820435 hwdiscovery NOTICE: init.lua(201): position: 0101010302, get csr data from /opt/bmc/sr/14140130_100010e2_10004010.sr, format version: 3.00, data version: 3.00 2026-02-25 20:01:17.829969 hwdiscovery NOTICE: hwcomponent.lua(209): position: 0101010302, load sr data successfully, uptime: 269 s, cost: 20ms 2026-02-25 20:01:17.837994 hwdiscovery NOTICE: hwcomponent.lua(230): position: 0101010302, start to process sr data, source: /opt/bmc/sr/14140130_100010e2_10004010.sr, format version: 3.00, data version: 3.00, uptime: 269 s 2026-02-25 20:01:17.970283 hwdiscovery NOTICE: hwcomponent.lua(313): [self-discovery] name: Connector_PCIE_SLOT3_01010103, position: 0101010303, current: 1, previous: 0,uptime: 269 s 2026-02-25 20:01:18.049693 storage NOTICE: mctp_service.lua(94): [Storage] mctp prepare finished. bmc_eid = 12 bmc_phy = 768 state = true 2026-02-25 20:01:18.093276 hwdiscovery NOTICE: sdr.lua(76): position: 0101010302, get objects, count: 39 2026-02-25 20:01:18.107033 maca NOTICE: init.lua(531): bmc.kepler.bios unlocked ForceResetLocked status 2026-02-25 20:01:18.126559 hwdiscovery NOTICE: init.lua(201): position: 0101010303, get csr data from /opt/bmc/sr/14140130_808657b0_80860002.sr, format version: 3.00, data version: 3.00 2026-02-25 20:01:18.128557 hwdiscovery NOTICE: hwcomponent.lua(209): position: 0101010303, load sr data successfully, uptime: 270 s, cost: 10ms 2026-02-25 20:01:18.133372 hwdiscovery NOTICE: hwcomponent.lua(230): position: 0101010303, start to process sr data, source: /opt/bmc/sr/14140130_808657b0_80860002.sr, format version: 3.00, data version: 3.00, uptime: 270 s 2026-02-25 20:01:18.181420 hwdiscovery NOTICE: hwcomponent.lua(313): [self-discovery] name: Connector_OCP_1_0101, position: , current: 1, previous: 0,uptime: 270 s 2026-02-25 20:01:18.219446 hwdiscovery NOTICE: parser_work.lua(76): position: 0101010302, process sr data successfully, uptime: 270 s, cost: 350ms 2026-02-25 20:01:18.308729 hwdiscovery NOTICE: init.lua(201): position: , get csr data from /opt/bmc/sr/14220247_15b3101f_1f242011.sr, format version: 3.00, data version: 3.00 2026-02-25 20:01:18.315986 hwdiscovery NOTICE: hwcomponent.lua(209): position: 010107, load sr data successfully, uptime: 270 s, cost: 20ms 2026-02-25 20:01:18.318517 hwdiscovery NOTICE: hwcomponent.lua(230): position: 010107, start to process sr data, source: /opt/bmc/sr/14220247_15b3101f_1f242011.sr, format version: 3.00, data version: 3.00, uptime: 270 s 2026-02-25 20:01:18.337640 hwdiscovery NOTICE: hwcomponent.lua(313): [self-discovery] name: Connector_OCP_2_0101, position: , current: 1, previous: 0,uptime: 270 s 2026-02-25 20:01:18.466059 hwdiscovery NOTICE: sdr.lua(76): position: 0101010303, get objects, count: 26 2026-02-25 20:01:18.480409 hwdiscovery NOTICE: init.lua(201): position: , get csr data from /opt/bmc/sr/14220247_14e416d7_1f242013.sr, format version: 3.00, data version: 3.00 2026-02-25 20:01:18.499224 hwdiscovery NOTICE: hwcomponent.lua(209): position: 010108, load sr data successfully, uptime: 270 s, cost: 30ms 2026-02-25 20:01:18.508022 hwdiscovery NOTICE: hwcomponent.lua(230): position: 010108, start to process sr data, source: /opt/bmc/sr/14220247_14e416d7_1f242013.sr, format version: 3.00, data version: 3.00, uptime: 270 s 2026-02-25 20:01:18.529628 hwdiscovery NOTICE: parser_work.lua(76): position: 0101010303, process sr data successfully, uptime: 270 s, cost: 200ms sensor 模块此时也接收到了框架的对象分发 ```text 2026-02-25 20:01:21.336397 sensor NOTICE: object_manage.lua(332): add objects callback, path: /bmc/kepler/ObjectGroup/0101010302, life cycle id: 1, count: 7, took 1060ms, uptime: 273 s 2026-02-25 20:01:21.344539 sensor NOTICE: object_manage.lua(342): add objects completely, path: /bmc/kepler/ObjectGroup/0101010302, life cycle id: 1, took 0ms, uptime: 273 s 2026-02-25 20:01:21.373723 sensor NOTICE: object_manage.lua(312): start to add objects, path: /bmc/kepler/ObjectGroup/0101010303, life cycle id: 1, count: 2, uptime: 273 s 2026-02-25 20:01:21.763484 sensor NOTICE: object_manage.lua(332): add objects callback, path: /bmc/kepler/ObjectGroup/0101010303, life cycle id: 1, count: 2, took 390ms, uptime: 273 s 2026-02-25 20:01:21.770718 sensor NOTICE: object_manage.lua(342): add objects completely, path: /bmc/kepler/ObjectGroup/0101010303, life cycle id: 1, took 0ms, uptime: 273 s 2026-02-25 20:01:21.841223 sensor NOTICE: object_manage.lua(312): start to add objects, path: /bmc/kepler/ObjectGroup/010107, life cycle id: 1, count: 3, uptime: 273 s 2026-02-25 20:01:22.366241 sensor NOTICE: object_manage.lua(332): add objects callback, path: /bmc/kepler/ObjectGroup/010107, life cycle id: 1, count: 3, took 530ms, uptime: 274 s 2026-02-25 20:01:22.367582 sensor NOTICE: object_manage.lua(342): add objects completely, path: /bmc/kepler/ObjectGroup/010107, life cycle id: 1, took 0ms, uptime: 274 s 2026-02-25 20:01:22.395031 sensor NOTICE: object_manage.lua(312): start to add objects, path: /bmc/kepler/ObjectGroup/010108, life cycle id: 1, count: 3, uptime: 274 s但是,在 sensor.log 里这段时间却没有看到任何传感器注册,只能看到相关卡的 fru 注册了,fru 和卡都是在同一份 sr 上的,理论上 fru 注册了传感器肯定也得注册,但是却没有:
text[2026-02-25 20:00:15] discrete sdr DiscreteSensor_PwrCapStatus_01010A of host 0 is registered. [2026-02-25 20:00:15] discrete event [DiscreteEvent_PSRedundancy_01010A] is registered [2026-02-25 20:00:15] discrete event [DiscreteEvent_PwrCapFailed_01010A] is registered [2026-02-25 20:00:16] Entity [Entity_Fan_01010305] of host 0 is registered [2026-02-25 20:00:16] sensor ThresholdSensor_FanRspeed_01010305 of host 0 is registered, number is 152. [2026-02-25 20:00:16] threshold sdr ThresholdSensor_FanRspeed_01010305 of host 0 is registered. [2026-02-25 20:01:23] fru sdr XC385(9) of host 0 is registered by id. [2026-02-25 20:01:23] fru sdr XC386(7) of host 0 is registered by id. [2026-02-25 20:09:39] fru sdr ExpBoard1(1) of host 0 is registered by obj. [2026-02-25 20:09:41] bmc global enables[RecvMsgIntrptEnabled] = 1 [2026-02-25 20:09:41] bmc global enables[EvtMsgBufFullIntrptEnabled] = 1 [2026-02-25 20:09:41] bmc global enables[EvtMsgBufEnabled] = 1 [2026-02-25 20:09:41] bmc global enables[SELEnabled] = 1正常场景下,是有这种打印的,异常场景下却没有:
text[2026-02-25 20:12:52] Entity [Entity_RaidCard_0101010302] of host 0 is registered [2026-02-25 20:12:52] sensor ThresholdSensor_PCIeBBUTemp_0101010302 of host 0 is registered, number is 153. [2026-02-25 20:12:52] threshold sdr ThresholdSensor_PCIeBBUTemp_0101010302 of host 0 is registered. [2026-02-25 20:12:55] fru sdr XC385(9) of host 0 is registered by id. [2026-02-25 20:12:55] sensor ThresholdSensor_RaidCardTemp_0101010302 of host 0 is registered, number is 154. [2026-02-25 20:12:55] threshold sdr ThresholdSensor_RaidCardTemp_0101010302 of host 0 is registered. [2026-02-25 20:12:55] sensor DiscreteSensor_BBUStatus_0101010302 of host 0 is registered, number is 155. [2026-02-25 20:12:55] discrete sdr DiscreteSensor_BBUStatus_0101010302 of host 0 is registered. [2026-02-25 20:12:56] fru sdr XC386(7) of host 0 is registered by id. [2026-02-25 20:12:56] discrete event [DiscreteEvent_BBUStatus_0_0101010302] is registered [2026-02-25 20:12:56] discrete event [DiscreteEvent_BBUStatus_2_0101010302] is registered [2026-02-25 20:12:56] discrete event [DiscreteEvent_BBUStatus_1_0101010302] is registered [2026-02-25 20:12:57] Entity [Entity_PCIeCard_0101010303] of host 0 is registered [2026-02-25 20:12:57] sensor ThresholdSensor_Temp_0101010303 of host 0 is registered, number is 156. [2026-02-25 20:12:57] threshold sdr ThresholdSensor_Temp_0101010303 of host 0 is registered. [2026-02-25 20:12:58] Entity [Entity_OCPCard_010108] of host 0 is registered [2026-02-25 20:12:58] sensor ThresholdSensor_OCPTemp_010108 of host 0 is registered, number is 157. [2026-02-25 20:12:58] threshold sdr ThresholdSensor_OCPTemp_010108 of host 0 is registered. [2026-02-25 20:12:58] sensor ThresholdSensor_OpticalModuleTemp_010108 of host 0 is registered, number is 158. [2026-02-25 20:12:58] threshold sdr ThresholdSensor_OpticalModuleTemp_010108 of host 0 is registered. [2026-02-25 20:12:59] Entity [Entity_OCPCard_010107] of host 0 is registered [2026-02-25 20:12:59] sensor ThresholdSensor_OCPTemp_010107 of host 0 is registered, number is 159. [2026-02-25 20:12:59] threshold sdr ThresholdSensor_OCPTemp_010107 of host 0 is registered. [2026-02-25 20:13:00] sensor ThresholdSensor_OpticalModuleTemp_010107 of host 0 is registered, number is 160. [2026-02-25 20:13:00] threshold sdr ThresholdSensor_OpticalModuleTemp_010107 of host 0 is registered.
定位过程
怀疑存在某种概率性时序问题,sensor 模块在这段时间也没有明显报错
当前代码逻辑
on_add_object_complete 回调(第 415-420 行):
luamdb_manager.on_add_object_complete(self.bus, function (position) if not cache_task_alive then skynet.fork_once(self.register_cache_objs_task, self) end unregistered_positions[position] = true end)register_cache_objs_task 任务(第 513-538 行):
luafunction sensor_service:register_cache_objs_task() local ticks = 0 if not cache_task_alive then cache_task_alive = true end local position while ticks < 60 do -- 连续 60次等待 无待注册的对象则认为所有对象已注册完成 position, _ = next(unregistered_positions) if not position then ticks = ticks + 1 goto continue end register_cache_entities(self, position) register_cache_sensors(self, position) register_cache_discrete_events(self, position) unregistered_positions[position] = nil ::continue:: skynet.sleep(100) end if cache_task_alive then cache_task_alive = false end endBug 场景重现
(1). T0: on_add_object_complete(position1) 被调用:
cache_task_alive = false,启动 Task1 unregistered_positions = {position1 = true} Task1 开始运行,cache_task_alive = true(2). T1: Task1 执行到第 59 次循环(ticks = 59):
此时 unregistered_positions 为空 ticks 增加到 60 执行 skynet.sleep(100),Task1 进入睡眠(3). T2: 在 Task1 睡眠期间,on_add_object_complete(position2) 被调用:
cache_task_alive = true(Task1 还在运行) 不会启动新任务! unregistered_positions = {position2 = true}(4). T3: Task1 从睡眠中醒来:
检查 ticks < 60?false(ticks = 60) 退出循环 cache_task_alive = false(5). 结果:position2 永远不会被注册,传感器丢失!
问题原因
设计上存在一个竞态条件:
- on_add_object_complete 只在 cache_task_alive = false 时启动新任务
- register_cache_objs_task 使用固定 60 次循环,不管是否有新的position 加入
- 当任务即将结束时(ticks 接近 60),新加入的 position 会被遗漏
解决方案
传感器注册阶段处理已优化,见优化传感器对象注册性能