长时间 AC,概率性出现板卡已加载但传感器未注册问题分析
更新时间: 2026/05/28
在Gitcode上查看源码

问题背景

- 单板类型:NA;
- 软件版本:1230基线版本;
- 涉及功能:传感器显示;
- 触发条件:按照社区白牌包制作流程,制作的白牌包在web页面进行升级,升级完成后app.log查看日志。
- 业务表现:长时间 AC,预期板卡已加载且传感器注册;实际上概率性出现板卡已加载但传感器未注册。

问题链接

社区论坛

问题复现步骤

AC 起来后,自动化脚本检查发现 PCIe/OCP 相关传感器未出现,消失的是这些传感器:

关键日志信息

  1. 19:55 做 AC:

    text
    2026-02-25 19:55:51 CLI,Administrator@192.168.109.48:50888,fructrl,Set FRU0 to ACCycle successfully
  2. 20:01 时卡的 sr 已加载:

    text
    2026-02-25 20:01:17.712144 hwdiscovery NOTICE: hwcomponent.lua(313): [self-discovery] name: Connector_PCIE_SLOT2_01010103, position: 0101010302, current: 1, previous: 0,uptime: 269 s
    2026-02-25 20:01:17.820435 hwdiscovery NOTICE: init.lua(201): position: 0101010302, get csr data from /opt/bmc/sr/14140130_100010e2_10004010.sr, format version: 3.00, data version: 3.00
    2026-02-25 20:01:17.829969 hwdiscovery NOTICE: hwcomponent.lua(209): position: 0101010302, load sr data successfully, uptime: 269 s, cost: 20ms
    2026-02-25 20:01:17.837994 hwdiscovery NOTICE: hwcomponent.lua(230): position: 0101010302, start to process sr data, source: /opt/bmc/sr/14140130_100010e2_10004010.sr, format version: 3.00, data version: 3.00, uptime: 269 s
    2026-02-25 20:01:17.970283 hwdiscovery NOTICE: hwcomponent.lua(313): [self-discovery] name: Connector_PCIE_SLOT3_01010103, position: 0101010303, current: 1, previous: 0,uptime: 269 s
    2026-02-25 20:01:18.049693 storage NOTICE: mctp_service.lua(94): [Storage] mctp prepare finished. bmc_eid = 12 bmc_phy = 768 state = true
    2026-02-25 20:01:18.093276 hwdiscovery NOTICE: sdr.lua(76): position: 0101010302, get objects, count: 39
    2026-02-25 20:01:18.107033 maca NOTICE: init.lua(531): bmc.kepler.bios unlocked ForceResetLocked status
    2026-02-25 20:01:18.126559 hwdiscovery NOTICE: init.lua(201): position: 0101010303, get csr data from /opt/bmc/sr/14140130_808657b0_80860002.sr, format version: 3.00, data version: 3.00
    2026-02-25 20:01:18.128557 hwdiscovery NOTICE: hwcomponent.lua(209): position: 0101010303, load sr data successfully, uptime: 270 s, cost: 10ms
    2026-02-25 20:01:18.133372 hwdiscovery NOTICE: hwcomponent.lua(230): position: 0101010303, start to process sr data, source: /opt/bmc/sr/14140130_808657b0_80860002.sr, format version: 3.00, data version: 3.00, uptime: 270 s
    2026-02-25 20:01:18.181420 hwdiscovery NOTICE: hwcomponent.lua(313): [self-discovery] name: Connector_OCP_1_0101, position: , current: 1, previous: 0,uptime: 270 s
    2026-02-25 20:01:18.219446 hwdiscovery NOTICE: parser_work.lua(76): position: 0101010302, process sr data successfully, uptime: 270 s, cost: 350ms
    2026-02-25 20:01:18.308729 hwdiscovery NOTICE: init.lua(201): position: , get csr data from /opt/bmc/sr/14220247_15b3101f_1f242011.sr, format version: 3.00, data version: 3.00
    2026-02-25 20:01:18.315986 hwdiscovery NOTICE: hwcomponent.lua(209): position: 010107, load sr data successfully, uptime: 270 s, cost: 20ms
    2026-02-25 20:01:18.318517 hwdiscovery NOTICE: hwcomponent.lua(230): position: 010107, start to process sr data, source: /opt/bmc/sr/14220247_15b3101f_1f242011.sr, format version: 3.00, data version: 3.00, uptime: 270 s
    2026-02-25 20:01:18.337640 hwdiscovery NOTICE: hwcomponent.lua(313): [self-discovery] name: Connector_OCP_2_0101, position: , current: 1, previous: 0,uptime: 270 s
    2026-02-25 20:01:18.466059 hwdiscovery NOTICE: sdr.lua(76): position: 0101010303, get objects, count: 26
    2026-02-25 20:01:18.480409 hwdiscovery NOTICE: init.lua(201): position: , get csr data from /opt/bmc/sr/14220247_14e416d7_1f242013.sr, format version: 3.00, data version: 3.00
    2026-02-25 20:01:18.499224 hwdiscovery NOTICE: hwcomponent.lua(209): position: 010108, load sr data successfully, uptime: 270 s, cost: 30ms
    2026-02-25 20:01:18.508022 hwdiscovery NOTICE: hwcomponent.lua(230): position: 010108, start to process sr data, source: /opt/bmc/sr/14220247_14e416d7_1f242013.sr, format version: 3.00, data version: 3.00, uptime: 270 s
    2026-02-25 20:01:18.529628 hwdiscovery NOTICE: parser_work.lua(76): position: 0101010303, process sr data successfully, uptime: 270 s, cost: 200ms
    
    sensor 模块此时也接收到了框架的对象分发
    
    ```text
    2026-02-25 20:01:21.336397 sensor NOTICE: object_manage.lua(332): add objects callback, path: /bmc/kepler/ObjectGroup/0101010302, life cycle id: 1, count: 7, took 1060ms, uptime: 273 s
    2026-02-25 20:01:21.344539 sensor NOTICE: object_manage.lua(342): add objects completely, path: /bmc/kepler/ObjectGroup/0101010302, life cycle id: 1, took 0ms, uptime: 273 s
    2026-02-25 20:01:21.373723 sensor NOTICE: object_manage.lua(312): start to add objects, path: /bmc/kepler/ObjectGroup/0101010303, life cycle id: 1, count: 2, uptime: 273 s
    
    2026-02-25 20:01:21.763484 sensor NOTICE: object_manage.lua(332): add objects callback, path: /bmc/kepler/ObjectGroup/0101010303, life cycle id: 1, count: 2, took 390ms, uptime: 273 s
    2026-02-25 20:01:21.770718 sensor NOTICE: object_manage.lua(342): add objects completely, path: /bmc/kepler/ObjectGroup/0101010303, life cycle id: 1, took 0ms, uptime: 273 s
    2026-02-25 20:01:21.841223 sensor NOTICE: object_manage.lua(312): start to add objects, path: /bmc/kepler/ObjectGroup/010107, life cycle id: 1, count: 3, uptime: 273 s
    
    2026-02-25 20:01:22.366241 sensor NOTICE: object_manage.lua(332): add objects callback, path: /bmc/kepler/ObjectGroup/010107, life cycle id: 1, count: 3, took 530ms, uptime: 274 s
    2026-02-25 20:01:22.367582 sensor NOTICE: object_manage.lua(342): add objects completely, path: /bmc/kepler/ObjectGroup/010107, life cycle id: 1, took 0ms, uptime: 274 s
    2026-02-25 20:01:22.395031 sensor NOTICE: object_manage.lua(312): start to add objects, path: /bmc/kepler/ObjectGroup/010108, life cycle id: 1, count: 3, uptime: 274 s
  3. 但是,在 sensor.log 里这段时间却没有看到任何传感器注册,只能看到相关卡的 fru 注册了,fru 和卡都是在同一份 sr 上的,理论上 fru 注册了传感器肯定也得注册,但是却没有:

    text
    [2026-02-25 20:00:15] discrete sdr DiscreteSensor_PwrCapStatus_01010A of host 0 is registered.
    [2026-02-25 20:00:15] discrete event [DiscreteEvent_PSRedundancy_01010A] is registered
    [2026-02-25 20:00:15] discrete event [DiscreteEvent_PwrCapFailed_01010A] is registered
    [2026-02-25 20:00:16] Entity [Entity_Fan_01010305] of host 0 is registered
    [2026-02-25 20:00:16] sensor ThresholdSensor_FanRspeed_01010305 of host 0 is registered, number is 152.
    [2026-02-25 20:00:16] threshold sdr ThresholdSensor_FanRspeed_01010305 of host 0 is registered.
    [2026-02-25 20:01:23] fru sdr XC385(9) of host 0 is registered by id.
    [2026-02-25 20:01:23] fru sdr XC386(7) of host 0 is registered by id.
    [2026-02-25 20:09:39] fru sdr ExpBoard1(1) of host 0 is registered by obj.
    [2026-02-25 20:09:41] bmc global enables[RecvMsgIntrptEnabled] = 1
    [2026-02-25 20:09:41] bmc global enables[EvtMsgBufFullIntrptEnabled] = 1
    [2026-02-25 20:09:41] bmc global enables[EvtMsgBufEnabled] = 1
    [2026-02-25 20:09:41] bmc global enables[SELEnabled] = 1
  4. 正常场景下,是有这种打印的,异常场景下却没有:

    text
    [2026-02-25 20:12:52] Entity [Entity_RaidCard_0101010302] of host 0 is registered
    [2026-02-25 20:12:52] sensor ThresholdSensor_PCIeBBUTemp_0101010302 of host 0 is registered, number is 153.
    [2026-02-25 20:12:52] threshold sdr ThresholdSensor_PCIeBBUTemp_0101010302 of host 0 is registered.
    [2026-02-25 20:12:55] fru sdr XC385(9) of host 0 is registered by id.
    [2026-02-25 20:12:55] sensor ThresholdSensor_RaidCardTemp_0101010302 of host 0 is registered, number is 154.
    [2026-02-25 20:12:55] threshold sdr ThresholdSensor_RaidCardTemp_0101010302 of host 0 is registered.
    [2026-02-25 20:12:55] sensor DiscreteSensor_BBUStatus_0101010302 of host 0 is registered, number is 155.
    [2026-02-25 20:12:55] discrete sdr DiscreteSensor_BBUStatus_0101010302 of host 0 is registered.
    [2026-02-25 20:12:56] fru sdr XC386(7) of host 0 is registered by id.
    [2026-02-25 20:12:56] discrete event [DiscreteEvent_BBUStatus_0_0101010302] is registered
    [2026-02-25 20:12:56] discrete event [DiscreteEvent_BBUStatus_2_0101010302] is registered
    [2026-02-25 20:12:56] discrete event [DiscreteEvent_BBUStatus_1_0101010302] is registered
    [2026-02-25 20:12:57] Entity [Entity_PCIeCard_0101010303] of host 0 is registered
    [2026-02-25 20:12:57] sensor ThresholdSensor_Temp_0101010303 of host 0 is registered, number is 156.
    [2026-02-25 20:12:57] threshold sdr ThresholdSensor_Temp_0101010303 of host 0 is registered.
    [2026-02-25 20:12:58] Entity [Entity_OCPCard_010108] of host 0 is registered
    [2026-02-25 20:12:58] sensor ThresholdSensor_OCPTemp_010108 of host 0 is registered, number is 157.
    [2026-02-25 20:12:58] threshold sdr ThresholdSensor_OCPTemp_010108 of host 0 is registered.
    [2026-02-25 20:12:58] sensor ThresholdSensor_OpticalModuleTemp_010108 of host 0 is registered, number is 158.
    [2026-02-25 20:12:58] threshold sdr ThresholdSensor_OpticalModuleTemp_010108 of host 0 is registered.
    [2026-02-25 20:12:59] Entity [Entity_OCPCard_010107] of host 0 is registered
    [2026-02-25 20:12:59] sensor ThresholdSensor_OCPTemp_010107 of host 0 is registered, number is 159.
    [2026-02-25 20:12:59] threshold sdr ThresholdSensor_OCPTemp_010107 of host 0 is registered.
    [2026-02-25 20:13:00] sensor ThresholdSensor_OpticalModuleTemp_010107 of host 0 is registered, number is 160.
    [2026-02-25 20:13:00] threshold sdr ThresholdSensor_OpticalModuleTemp_010107 of host 0 is registered.

定位过程

怀疑存在某种概率性时序问题,sensor 模块在这段时间也没有明显报错

  1. 当前代码逻辑

    on_add_object_complete 回调(第 415-420 行):

    lua
    mdb_manager.on_add_object_complete(self.bus, function (position)
        if not cache_task_alive then
            skynet.fork_once(self.register_cache_objs_task, self)
        end
        unregistered_positions[position] = true
    end)

    register_cache_objs_task 任务(第 513-538 行):

    lua
    function sensor_service:register_cache_objs_task()
        local ticks = 0
        if not cache_task_alive then
            cache_task_alive = true
        end
        local position
        while ticks < 60 do  -- 连续 60次等待 无待注册的对象则认为所有对象已注册完成
            position, _ = next(unregistered_positions)
            if not position then
                ticks = ticks + 1
                goto continue
            end
            
            register_cache_entities(self, position)
            register_cache_sensors(self, position)
            register_cache_discrete_events(self, position)
            unregistered_positions[position] = nil
            
            ::continue::
            skynet.sleep(100)
        end
        if cache_task_alive then
            cache_task_alive = false
        end
    end
  2. Bug 场景重现

    (1). T0: on_add_object_complete(position1) 被调用:

     cache_task_alive = false,启动 Task1
     unregistered_positions = &#123;position1 = true&#125;
     Task1 开始运行,cache_task_alive = true
    

    (2). T1: Task1 执行到第 59 次循环(ticks = 59):

     此时 unregistered_positions 为空
     ticks 增加到 60
     执行 skynet.sleep(100),Task1 进入睡眠
    

    (3). T2: 在 Task1 睡眠期间,on_add_object_complete(position2) 被调用:

     cache_task_alive = true(Task1 还在运行)
     不会启动新任务!
     unregistered_positions = &#123;position2 = true&#125;
    

    (4). T3: Task1 从睡眠中醒来:

     检查 ticks < 60?false(ticks = 60)
     退出循环
     cache_task_alive = false
    

    (5). 结果:position2 永远不会被注册,传感器丢失!

问题原因

设计上存在一个竞态条件:

- on_add_object_complete 只在 cache_task_alive = false 时启动新任务
- register_cache_objs_task 使用固定 60 次循环,不管是否有新的position 加入
- 当任务即将结束时(ticks 接近 60),新加入的 position 会被遗漏

解决方案

传感器注册阶段处理已优化,见优化传感器对象注册性能

PR链接

优化传感器对象注册性能