RAID卡下SSD盘预估寿命信息显示异常问题分析
更新时间: 2026/05/28
在Gitcode上查看源码

问题背景

  • 单板类型:NA
  • 软件版本:NA
  • 涉及功能:RAID管理。
  • 触发条件:使用不支持预估寿命信息的SSD盘。
  • 业务表现:预期SSD盘的预估寿命信息正常显示;实际有时候显示为–-,有时候不显示该属性。

问题复现步骤

  1. 使用不支持预估寿命信息的SSD盘,通过web页面查看RAID下的SSD盘的预估寿命信息。
  2. 使用“新的”的SSD盘,通过web页面查看RAID下的SSD盘的预估寿命信息。
  3. 使用“老的”预估寿命信息的SSD盘,通过web页面查看RAID下的SSD盘的预估寿命信息。

关键日志信息

app.log报错如下:

text
1456  2026-03-09 18:36:12.689786 storage NOTICE: nvme_object.lua(1290): support nvme mi over mctp, Slot: 6, support: 1
  1457: 2026-03-09 18:36:12.695137 storage NOTICE: nvme_mi_mctp.lua(137): DeviceName: Disk6, DevBus: 39, DevDevice: 0, DevFunction: 0, nvme position: 010103090701
  1458  2026-03-09 18:36:12.697590 storage NOTICE: nvme_mi_mctp.lua(70): get nvme phy_addr: 39
  1459  2026-03-09 18:36:12.707150 storage NOTICE: init.lua(164): endpoint for phyaddr: 39, msg_type: 4 exist, no need to wait for new endpoint creation
  1460: 2026-03-09 18:36:12.708877 storage NOTICE: nvme_mi_mctp.lua(180): creat disk6 endpoint successfully
  1461  2026-03-09 18:36:12.715184 storage NOTICE: nvme_object.lua(1290): support nvme mi over mctp, Slot: 7, support: 1
  1468: 2026-03-09 18:36:12.761512 storage NOTICE: nvme_mi_command.lua(187): get Disk6 controller id:0 successfully
  1472: 2026-03-09 18:36:12.856630 storage NOTICE: nvme_object.lua(1173): Disk6 not support uuid by id_ctrl err:false
  1476: 2026-03-09 18:36:12.898951 storage NOTICE: nvme_object.lua(1207): Disk6 not support hw defined smart log without uuid index
  1640  2026-03-09 18:48:50.643107 storage NOTICE: rpc_service_subhealth.lua(191): start get drives estimatedlifespan diag info
  1641: 2026-03-09 18:48:50.721624 storage NOTICE: rpc_service_subhealth.lua(252): set drive:Disk6 estimatedremaininglifespan:4294967294
  1642: 2026-03-09 18:48:50.724905 metric_analyzer NOTICE: diagnose.lua(969): Disk6 set EstimatedRemainingLifespan 4294967294 successfully
  1643  2026-03-09 18:48:50.733784 storage NOTICE: rpc_service_subhealth.lua(252): set drive:Disk5 estimatedremaininglifespan:4294967294

  4724  2026-03-09 23:27:27.650941 storage ERROR: tasks.lua(78): task [Drive.update_nvme_by_mctp.Drive_8_01010309] error: ...e/lualib/nvme/nvme_mi_protocol/nvme_mi_admin_command.lua:173: attempt to perform arithmetic on field 'data_units_read_h' (a nil value)
  4725: 2026-03-09 23:28:03.720153 metric_analyzer NOTICE: lifespan_diagnose.lua(66): Rest db lfsp disk data of Disk6
  4726: 2026-03-09 23:28:03.722664 metric_analyzer NOTICE: diagnose.lua(1568): Disk6 is not present or changed, clear db
  4745: 2026-03-09 23:28:03.865686 storage NOTICE: pd_identify_service.lua(128): Disk6 del
  4746: 2026-03-09 23:28:03.869342 storage NOTICE: drive_object.lua(614): Disk6 is_nvme turn false
  4756  2026-03-09 23:28:04.005062 thermal_mgmt NOTICE: object_manage.lua(252): remove objects completely, path: /bmc/kepler/ObjectGroup/010103090701
  4757: 2026-03-09 23:28:04.025769 storage NOTICE: rpc_service_subhealth.lua(252): set drive:Disk6 estimatedremaininglifespan:4294967295
  4758: 2026-03-09 23:28:04.059143 metric_analyzer NOTICE: diagnose.lua(969): Disk6 set EstimatedRemainingLifespan 4294967295 successfully
  4853  2026-03-09 23:28:32.152155 storage NOTICE: drive_object.lua(620): Start update NVME info, id:6 identify_pd:false.
  4854: 2026-03-09 23:28:32.153126 storage NOTICE: drive_object.lua(621): Disk6 is_nvme turn true
  4855  2026-03-09 23:28:32.154253 thermal_mgmt WARNING: dev_object_manage.lua(163): cannot find adapter, object_name: CoolingRequirement_1_60_010103090701

  4885  2026-03-09 23:28:33.360360 storage NOTICE: nvme_object.lua(272): NVMe6 load nvme-mi
  4886: 2026-03-09 23:28:33.750804 storage NOTICE: nvme_object.lua(494): get disk6 NVMe MI 1.0
  4887: 2026-03-09 23:28:33.878589 storage NOTICE: nvme_object.lua(494): get disk6 NVMe MI 1.0
  4888  2026-03-09 23:28:37.153377 storage NOTICE: tasks.lua(72): task [Drive.update_nvme_by_mctp.Drive_7_01010309] start

  4891  2026-03-09 23:28:37.464766 storage NOTICE: nvme_object.lua(1290): support nvme mi over mctp, Slot: 6, support: 1
  4892: 2026-03-09 23:28:37.471883 storage NOTICE: nvme_mi_mctp.lua(137): DeviceName: Disk6, DevBus: 39, DevDevice: 0, DevFunction: 0, nvme position: 010103090701
  4893  2026-03-09 23:28:37.475477 storage NOTICE: nvme_mi_mctp.lua(70): get nvme phy_addr: 39
  4894  2026-03-09 23:28:37.485537 storage NOTICE: init.lua(164): endpoint for phyaddr: 39, msg_type: 4 exist, no need to wait for new endpoint creation
  4895: 2026-03-09 23:28:37.487690 storage NOTICE: nvme_mi_mctp.lua(180): creat disk6 endpoint successfully
  4896  2026-03-09 23:28:52.493425 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
  4897: 2026-03-09 23:28:53.523331 storage NOTICE: nvme_mi_command.lua(187): get Disk6 controller id:0 successfully
  4898: 2026-03-09 23:28:53.585557 storage NOTICE: nvme_object.lua(1173): Disk6 not support uuid by id_ctrl err:false
  4899: 2026-03-09 23:28:53.625102 storage NOTICE: nvme_object.lua(1207): Disk6 not support hw defined smart log without uuid index
  4900  2026-03-09 23:28:53.655012 storage ERROR: app_preloader.lua(215): ...ps/storage/lualib/nvme/nvme_mi_protocol/nvme_mi_mctp.lua:75: app(storage/service/main) count(1) pcall failed(...e/lualib/nvme/nvme_mi_protocol/nvme_mi_admin_command.lua:173: attempt to perform arithmetic on field 'data_units_read_h' (a nil value))

定位过程

获取不到SMART信息的盘预估寿命会一直显示"–-",拔插后预估寿命又变成不显示。从日志来看,拔出前metric_analyzer赋初值0XFFFFFFFE,拔出后重置数据库赋值0XFFFFFFFF,再次插入后没有赋值,这就导致插拔前后显示不一致。

问题原因

  1. EstimatedRemainingLifespan默认值0XFFFFFFFF。
    • 对于不支持的盘会一直是0XFFFFFFFF,web上不显示。
    • 对于支持的盘显会先在30min内刷成0XFFFFFFFE,web显示--。
  2. 当能够正常获取到硬盘的smart信息时:
    • 如果是“新盘”24h后会呈现预估寿命,由新盘算法计算得到。
    • 如果是“老盘”需要48h后会呈现预估寿命。
  3. 预估寿命计算周期(已BMC启动为时间起点,每24h计算一次),插拔盘会清除历史预估寿命。

解决方案

  • 如果是不支持的盘,需要更换支持的盘;
  • 如果是“新盘”24h后会呈现预估寿命,由新盘算法计算得到
  • 如果是“老盘”需要48h后会呈现预估寿命