06h-查询设备状态(Get-Device-Status)
修订记录
| openUBMC版本号 | 修订日期 | 修订内容 |
|---|---|---|
| 25.06 | 2025/06/26 | 初稿,新增命令详情 |
基本信息
功能说明
查询设备状态。
权限
命令信息
请求信息
| 参数(字节) | 域取值说明 |
|---|---|
| NetFn | 30h |
| CMD | 92h |
| 1:3 | Manufacturer ID,LS Byte first。长度固定3个字节,例如,某厂家ID是2011,对应十六进制为0x0007DB,则字节序为第1个字节为DBh、第2个字节为07h、第3个字节为00h。 |
| 4 | Sub command,子命令=06h |
| 5 | Device ID 表1 |
| 6 | reserved. |
响应信息
| 字节顺序 | 域取值说明 |
|---|---|
| 1 | Completion Code |
| 2:4 | Manufacturer ID,LS Byte first。长度固定3个字节,例如,某厂家ID是2011,对应十六进制为0x0007DB,则字节序为第1个字节为DBh、第2个字节为07h、第3个字节为00h。 |
| 5 | device status1 |
| 6 | device status2 |
| 7 | device status3 |
| 8 | device status4 |
| 9 | device status5 |
| 10 | device status6 |
命令样例
查询设备状态
请求:
ipmicmd -k "0f 00 MM NN" smi 0响应:
0f MM NN附表1 BIOS上报事件定义 Device Status Code
|部件|BIOS -> BMC事件格式|BIOS| |Device ID|Device Status1|Event|告警描述| |CPU|01h|00h|Firmware Mismatch|系统开机时,检测CPU微码是否有加载,如果有异常,通过BT消息上报异常信息给BMC。| |01h|CPU Mismatch
Device Status2:CPU ID|判断CPU是否在支持范围内。如果不在支持范围内,通过BT消息上报BMC异常。| |02h|CPU BIST Failure
Device Status2:CPU ID
Device Status3:Cache Way Number低8位
Device Status4:Cache Way Number高8位
Device Status3:Isolation core number隔离核数低8位
Device Status3:Isolation core number隔离核数高8位|系统开机时,CPU进行BIST检测,如果有异常,在能开机的情况下,通过BT消息上报异常信息给BMC。
Device Status3~6仅存储产品BMC会解析,计算及TCE产品BMC不会解析,接口兼容。| |04h|Cpu Offline Succeed
Device Status2:CPU ID|cpu offline成功,即将某个CPU成功隔离,属于消息级别,可用于BMC实时监控当前工作CPU是否在线 【 EX 】| |05h|Unsupported opcode
Device Status2:CPU ID|运行时CPU执行不支持的机器码导致错误,该错误为致命错误 【 EX 】| |07h|VMSE Link Failure In Mirror Mode
Device Status2:CPU ID
Device Status3:VMSE Channel
Device Status4:Node ID|内存镜像模式下,SMI2链路失效,属于不可纠正但可恢复错误 【 EX 】| |08h|VMSE Link Failure
Device Status2:CPU ID
Device Status3:VMSE Channel
Device Status4:Node ID|非内存镜像模式下,SMI2链路失效,属于致命错误 【 EX 】| |09h|VMSE Err Observed
Device Status2:CPU ID
Device Status3:VMSE Channel
Device Status4:Node ID|Jordan Creek内部发现错误,属于不可纠正非致命错误 【 EX 】| |0ah|VMSE DIMM Register Parity Error
Device Status2:CPU ID
Device Status3:VMSE Channel
Device Status4:Node ID|VMSE链路发生奇偶校验错误,属于致命错误 【 EX 】| |0bh|VMSE Nb Persistent Counter Reached
Device Status2:CPU ID
Device Status3:VMSE Channel
Device Status4:Node ID|数据从Jordan Creek发往内存控制器的途中产生的错误达到门限值,属于非致命错误 【 EX 】| |0ch|VMSE Sb Persistent Counter Reached
Device Status2:CPU ID
Device Status3:VMSE Channel
Device Status4:Node ID|数据从内存控制器发往Jordan Creek的途中产生的错误达到门限值,属于非致命错误 【 EX 】| |0dh|Qpi Error Detected
Device Status2:CPU ID
Device Status3:QPI ID
Device Status4:MSCOD
Device Status5:Current Tx Lane(start with 1)
Device Status6:Current Rx Lane(start with 1)|QPI链路发生故障
MSCOD:有助于定位具体什么原因引起故障
事件码:
0x00--上电检测到QPI故障(Config error)
0x01--上电自检检测到QPI Link Failure
0x02--上电自检检测到QPI Degrade Failure
Current Tx Lane:默认0x14,如果不等于该值,说明是TX方面的链路有问题
Current Rx Lane:默认0x14,如果不等于该值,说明是RX方面的链路有问题
【 EX 】| |0eh|Socket Boot Isolation
Device Status2:CPU ID|启动阶段,CPU自检不通过进行隔离,将隔离的socket信息反馈给BMC
【 EX 】| |0fh|Core Boot Isolation
Device Status2:CPU ID
Device Status3:Core ID Bitmap
bit0~bit7分别对应Core0~Core7
Device Status4:Core ID Bitmap
bit0~bit7分别对应Core8~Core15
Device Status5:Core ID Bitmap
bit0~bit7分别对应Core16~Core23
……
考虑后续扩展性,不限制Device Status的个数|启动阶段,core自检不通过进行隔离,将隔离的core信息反馈给BMC
【 EX 】| |10h|PCI resource configuration
Device Status2:CPU ID|启动阶段,在受限的slot上插需要IO资源的卡,BIOS会停掉
【 EX 】| |11h|Vmse Config Error
Device Status2:CPU ID
Device Status3:Vmse channel id
Device Status4:Node ID|该错误主要记录启动阶段发送的JC错误,或者CPU与JC间的SMI2通道错误
【 EX 】| |Pcie|02h|00h|Pcie error:
Device Status 2:Bus
Device Status 3:Device
Device Status 4:Function|pcie device fatal error| |F7h|MMIO resource not enough| |F8h|legacy IO resource not enough| |F9h|legacy oprom resource not enough| |FAh|PCIe Bandwitdh Error| |F5h|Pcie error:
Device Status 2:Bus
Device Status 3:Device
Device Status 4:Function
Device Status 5:MaxSpeed
Device Status 6: NegoSpeed|PCie Link Speed Error| |Video|03h|00h|Video device error|显卡芯片异常时,通过BT消息上报BMC| |Memory|04h|00h|Memory config error:
Device Status 2:NODE ID
Device Status 3:channel id
Device Status 4:dimm id
Device Status 5:serverity (0:invalid,1:ok,2:minor,3:major,4:critical)|1、内存插法是否正确检测,如果有异常,请能开机的情况下,通过BT消息上报异常信息给BMC
2、检测不到内存情况下,通过BT消息上报BMC没有内存告警
3、DIMM异常检测。如果启动过程中检测到内存,但在最终training完毕后,把内存屏蔽掉。通过BT消息上报BMC异常,同时屏幕打印有内存无法使用的提示信息
4、启动过程中BIOS通过本事件上报所有MRC故障,因为MRC故障是由intel提供的代码直接检测并产生的,目前intel只提供了代码,并未提供任何文档,BIOS无法知道哪些场景会报什么MRC故障,也无法确定各个MRC故障的严重性级别
5、MRC错误的严重性级别由BIOS上报,作为BMC产生告警级别的参考,目前所有MRC故障都统一上报为紧急,保持与现在的Configuration error级别一致,后续如有其它级别上报,需要BIOS调整| |01h|DDDC Sparing
Device Status 2:NODE ID
Device Status 3:channel id
Device Status 4:dimm id
Device Status 5:rank id|对于满足DDDC特性的RANK(x4颗粒且使能lock-step),当RANK上的可纠正错误超过门限值时,BIOS使用该RANK内的冗余颗粒替换故障颗粒,该错误属于可纠正错误| |02h|RANK Sparing
Device Status 2:NODE ID
Device Status 3:channel id
Device Status 4:dimm id
Device Status 5:error rank id|在使能内存RANK Sparing特性下,当RANK上的可纠正错误超过门限值时,BIOS使用备份RANK替换故障RANK| |03h|Devicce Tagging
Device Status 2:NODE ID
Device Status 3:channel id
Device Status 4:dimm id
Device Status 5:rank id|对于满足DeviceTagging特性的RANK(x4颗粒或者x8颗粒且使能lock-step),当RANK上的可纠正错误超过门限值时,BIOS使用奇偶颗粒替换故障颗粒,该错误属于可纠正错误| |04h|Memory migration
Device Status 2:ErrorNode
Device Status 3:SpareNode
Device Status 4:EventID|内存迁移:
ErrorNode:出错内存板
SpareNode:备份板
EventID:迁移事件码
EventID:
0x0:MIGRATION_SUCCESS
0x1:MIGRATION_CFG_ERR
0x2:MIGRATION_NON_NUMA
0x3:MIGRATION_SMALL_MEMSIZE
0x4:MIGRATION_ONLINE_FAIL
0x5:MIGRATION_COPY_FAIL
0x6:MIGRATION_ONLINE_INPROGRESS
0x7:MIGRATION_ONLINE_SUCCESS
0x8:MIGRATION_INPROGRESS
0x9:MIGRATION_BEGIN
0xa:MIGRATION_CANCEL
0xb:MIGRATION_PATROLSCRUB_START
0xc:MIGRATION_PATROLSCRUB_END| |05h|Mirror FailOver
Device Status 2:NODE ID
Device Status 3:channel id|在使能内存镜像特性场景下,当某个通道内存发生不可纠正错误时,导致镜像失效,属于可纠正错误范畴 【 EX 】| |07h|Memory Board Add Fail
Device Status 2:NODE ID|内存板热插过程中发生错误 【 EX 】| |08h|CE overflow(可纠正错误超过阈值)
Device Status 2:NODE ID
Device Status 3:channel id
Device Status 4:dimm id
Device Status 5:rank id|RAS特性生效后的可纠正错误| |0ah|Spare Rank Info
Device Status 2:NODE ID
Device Status 3:channel id
Device Status 4:dimm id
Device Status 5:rank id|如果使能了rank sparing,则上报备份的RANK信息| |0bh|Spare Memory Board Info
Device Status 2:NODE ID|如果使能了内存迁移,则上报备份的内存板信息 【 EX 】| |0ch|Memory Board Online Succeed
Device Status 2:NODE ID|内存板Online成功
【 EX 】| |0eh|Memory Board Offline Succeed
Device Status 2:NODE ID|内存板Offline成功 【 EX 】| |11h|Mirror Node
Device Status2:LowBitmapOfMirrorNode(node0-7)
Device Status3:HighBitmapOfMirrorNode(node8-15)
Device Status4:LowBitmapOfMirrorNode(node16-23)
Device Status5:HighBitmapOfMirrorNode(node24-31)
Device Status6:LowBitmapOfMirrorNode(node32-39)
Device Status7:HighBitmapOfMirrorNode(node40-47)
Device Status8:LowBitmapOfMirrorNode(node48-55)
Device Status9:HighBitmapOfMirrorNode(node56-64)
(reserved)|内存板进入镜像模式
【 EX 】| |12h|PFAE
Device Status 2:NODE ID
Device Status 3:channel id
Device Status 4:dimm id|内存预故障事件| |20h|FailDimmInfo
Device Status 2:NODE ID
Device Status 3:channel id
Device Status 4 :dimm id
Device Status 5:rank id
Device Status 6:bank group id
Device Status 7:bank id|内存失效事件| |KBC|05h|00h|Unrecoverable PS/2 or USB keyboard failure|KBC Controller启动过程进行自检,如果有问题,需要通过BT消息上报BMC| |Memory Board|06h|00h|Memory Board Memory Controler Fault
Device Status 2:NODE ID
Device Status 3:channel id
Device Status 4:dimm id
Device Status 5:memory control id|【 EX 】| |01h|Memory Board Power Chip Fault
Device Status 2:NODE ID
Device Status 3:channel id
Device Status 4:dimm id
Device Status 5:power chip id|【 EX 】| |02h|Memory Board Hot Remove
Device Status 2:Operate Code
Device Status 3:NODE ID 1
Device Status 4:NODE ID 2|Operate Code:统一为00h,用于下发热移除开始命令
【 EX 】| |02h|Memory Board Hot Remove
Device Status 2:Operate Code
Device Status 3:NODE ID 1
Device Status 4:NODE ID 2|Operate Code:统一为01h,用于BIOS上报热移除开始标记
【 EX 】| |02h|Memory Board Hot Remove
Device Status 2:Operate Code
Device Status 3:NODE ID 1
Device Status 4:NODE ID 2
Device Status 5:Progress Code|Operate Code:统一为02h,用于BIOS上报热移除进度信息
Progress Code:
0x00--内存板下线
0x01--OS迁移完成
0x02--写NC
0x03--BE目录清除
0x04--内存板下电
【 EX 】| |02h|Memory Board Hot Remove
Device Status 2:Operate Code
Device Status 3:NODE ID 1
Device Status 4:NODE ID 2
Device Status 5:Event Code|Operate Code:统一为03h,用于BIOS上报热移除失败信息
Event Code:
0x00--未知错误
0x01--剩余内存不够
【 EX 】| |02h|Memory Board Hot Remove
Device Status 2:Operate Code
Device Status 3:NODE ID 1
Device Status 4: NODE ID 2
Device Status 5:Result Code|Operate Code:统一为03h,用于BIOS上报热移除失败信息
Event Code:
0x00--未知错误
0x01--剩余内存不够
【 EX 】| |02h|Memory Board Hot Remove
Device Status 2:Operate Code
Device Status 3:NODE ID 1
Device Status 4:NODE ID 2
Device Status 5:Result Code|Operate Code:统一为04h,用于BIOS上报热移除结束标记
Result Code:
0x00--成功
0x01--失败
【 EX 】| |03h|Memory Board Hot Plug
Device Status 2: Operate Code
Device Status 3: NODE ID 1
Device Status 4: NODE ID 2|Operate Code:统一为00h,用于下发热添加开始命令
【 EX 】| |03h|Memory Board Hot Plug
Device Status 2: Operate Code
Device Status 3: NODE ID 1
Device Status 4: NODE ID 2|Operate Code:统一为01h,用于BIOS上报热添加开始标记
【 EX 】| |03h|Memory Board Hot Plug
Device Status 2: Operate Code
Device Status 3: NODE ID 1
Device Status 4: NODE ID 2
Device Status 5: Progress Code|Operate Code:统一为02h,用于BIOS上报热添加进度信息
Progress Code:
0x00--内存板下电
0x01--内存板上电初始化
0x02--修改SAD/TAD信息
0x03--OS添加内存上线
【 EX 】| |03h|Memory Board Hot Plug
Device Status 2: Operate Code
Device Status 3: NODE ID 1
Device Status 4: NODE ID 2
Device Status 5: Event Code|Operate Code:统一为03h,用于BIOS上报热添加失败信息
Event Code:
0x00--未知错误
0x01--两块内存板容量不相等
0x02--两块内存板没有同时在位(业务侧角度,同时ds3,ds4中不在位内存板node id表示为0xff)
【 EX 】| |03h|Memory Board Hot Plug
Device Status 2: Operate Code
Device Status 3: NODE ID 1
Device Status 4: NODE ID 2
Device Status 5: Result Code|Operate Code:统一为04h,用于BIOS上报热添加结束标记
Result Code:
0x00--成功
0x01--失败
【 EX 】| |IOH|07h|00h|IOH Fault
Device Status 2: IOH ID|-| |IOH|07h|00h|IOH Fault
Device Status 2: IOH ID|-| |Chassis|09h|00h|General Chassis Intrusion|机箱开箱指示| |Cpu Board|0ah|00h|Cpu Board Hot Remove
Device Status 2: Operate Code
Device Status 3: CPU ID 1
Device Status 4: CPU ID 2|Operate Code:统一为00h,用于下发热移除开始命令
【 EX 】| |00h|Cpu Board Hot Remove
Device Status 2: Operate Code
Device Status 3: CPU ID 1
Device Status 4: CPU ID 2|Operate Code:统一为01h,用于BIOS上报热移除开始标记
【 EX 】| |-|Cpu Board Hot Remove
Device Status 2: Operate Code
Device Status 3: CPU ID 1
Device Status 4: CPU ID 2
Device Status 5: Progress Code|Operate Code:统一为02h,用于BIOS上报热移除进度信息
Progress Code:
0x00--IIO下线开始
0x01--内存下线开始
0x02--CPU下线开始
0x03--CPU下线完成
0x04--处理器板下电开始
【 EX 】| |00h|Cpu Board Hot Remove
Device Status 2: Operate Code
Device Status 3: CPU ID 1
Device Status 4: CPU ID 2
Device Status 5: Event Code|Operate Code:统一为03h,用于BIOS上报热移除失败信息
Error Code:
0x00--未知错误
【 EX 】| |00h|Cpu Board Hot Remove
Device Status 2: Operate Code
Device Status 3: CPU ID 1
Device Status 4: CPU ID 2
Device Status 5: Result Code|Operate Code:统一为04h,用于BIOS上报热移除结束标记
Result Code:
0x00--成功
0x01--失败
【 EX 】| |01h|Cpu Board Hot Plug
Device Status 2: Operate Code
Device Status 3: CPU ID 1
Device Status 4: CPU ID 2|Operate Code:统一为00h,用于下发热添加开始命令| |01h|Cpu Board Hot Plug
Device Status 2: Operate Code
Device Status 3: CPU ID 1
Device Status 4: CPU ID 2|Operate Code:统一为01h,用于BIOS上报热添加开始标记
【 EX 】| |01h|Cpu Board Hot Plug
Device Status 2: Operate Code
Device Status 3: CPU ID 1
Device Status 4: CPU ID 2
Device Status 5: Progress Code|Operate Code:统一为02h,用于BIOS上报热添加进度信息
Progress Code:
0x00--从BMC发送上电OK信息
0x01--Cpu Online完成
【 EX 】| |01h|Cpu Board Hot Plug
Device Status 2: Operate Code
Device Status 3: CPU ID 1
Device Status 4: CPU ID 2
Device Status 5: Event Code|Operate Code:统一为03h,用于BIOS上报热添加失败信息
Error Code:
0x00--未知错误
【 EX 】| |01h|Cpu Board Hot Plug
Device Status 2: Operate Code
Device Status 3: CPU ID 1
Device Status 4: CPU ID 2
Device Status 5: Result Code|Operate Code:统一为04h,用于BIOS上报热添加结束标记
Result Code:
0x00--成功
0x01--失败
【 EX 】| |Computing Unit|0Bh|00h|Computing Unit Hot Plug
Device Status 2: Operate Code
Device Status 3: Cluster ID 1
Device Status 4: Cluster ID 2|Operate Code:统一为00h,用于下发热添加开始命令
【 EX 】| |00h|Computing Unit Hot Plug
Device Status 2: Operate Code
Device Status 3: Cluster ID 1
Device Status 4: Cluster ID 2|Operate Code:统一为01h,用于BIOS上报热添加开始标记
【 EX 】| |00h|Computing Unit Hot Plug
Device Status 2: Operate Code
Device Status 3: Cluster ID 1
Device Status 4: Cluster ID 2
Device Status 5: Progress Code|Operate Code:统一为02h,用于BIOS上报热添加进度信息
Progress Code:
0x00--上电OK信息
0x01--Cpu Online进度
【 EX 】| |00h|Computing Unit Hot Plug
Device Status 2: Operate Code
Device Status 3: Cluster ID 1
Device Status 4: Cluster ID 2
Device Status 5: Event Code|Operate Code:统一为03h,用于BIOS上报热添加失败信息
Error Code:
0x00--未知错误
【 EX 】| |00h|Computing Unit Hot Plug
Device Status 2: Operate Code
Device Status 3: Cluster ID 1
Device Status 4: Cluster ID 2
Device Status 5: Result Code|Operate Code:统一为04h,用于BIOS上报热添加结束标记
Result Code:
0x00--成功
0x01--失败
【 EX 】| |01h|System Merging error detected(系统合一错误):
Device Status 2:Error Code
Device Status 3:Cluster ID|Error Code:
0x01--firmware post error(主BIOS统一上报)
0x02--没有硬分区配置(各BIOS自己上报)
0x03--setup配置信息同步失败(各BIOS自己上报)
0x04--cluster之间的内存频率不一致(主BIOS统一上报)
【 EX 】| |02h|Configuration Error:(配置错误)
Device Status 2:Error Code
Device Status 3:Cluster ID
Device Status 4: CPU ID|Error Code:
0x1--Memory module and DIMMs must be configured in pairs for each(同一个节点内的内存板必须成对)
【 EX 】| |RAID|0Eh|00h|RAID hardware fault.
Device Status2:pcie bus number(root port)
Device Status3:pcie device number(root port)
Device Status4:pcie function number(root port)
Device Status5:error code|RAID卡硬件故障,RAID卡的芯片外围器件(如内存)故障,BMC通过RAID卡error pin无法检测到或无error pin(PCIE RAID标卡),故通过BIOS检测并上报。但是无法支持运行中检测。
Error code:
00h-正常,无故障
其它值-故障| |Boot Error|0Fh|00h|No Boot Device Error|没有启动设备| |01h|Non-bootable diskette left in drive|没有可启动的硬盘| |02h|PXE Server not found|PXE服务器找不到| |03h|Invalid boot sector|非法的启动项| |04h|Timeout waiting for selection|选择启动项超时| |Boot End Flag|10h|01h|Post end flag|BIOS启动完成(1620新增)| |Certificate|11h|00h|NA|BIOS证书正常。| |01h|BIOS certificate is about to expire or has expired。|BIOS证书过期或者即将过期。| |Memory|12h|00h-20h|NA|能力同Memory 04h,该接口上报仅在channel id存在区别,明确为logical channel id| |24h|PPR
Device Status 2: system id(global system id)
Device Status 3: local system id
Device Status 4: socket id(兼容CPU、NPU)
Device Status 5: channel id(logical channel)
Device Status 6: dimm id(同Rank Group ID)
Device Status 7: subchannel id
Device Status 8: device(颗粒)
Device Status 9: rank id
Device Status 10: bank group id
Device Status 11: bank id
Device Status 12: row addr high
Device Status 13: row addr mid
Device Status 14: row addr low
Device Status 15: repair type(0-hPPR;1-sPPR;2-mPPR)|内存PPR事件| |25h|ACLS
Device Status 2: system id(global system id)
Device Status 3: local system id
Device Status 4: socket id(兼容CPU、NPU)
Device Status 5: channel id(logical channel)
Device Status 6: dimm id(同Rank Group ID)
Device Status 7: subchannel id
Device Status 8: device(颗粒)
Device Status 9: rank id
Device Status 10: bank group id
Device Status 11: bank id
Device Status 12: row addr high
Device Status 13: row addr mid
Device Status 14: row addr low
Device Status 15: col addr high
Device Status 16: col addr low
Device Status 17: transfer num(burst id)
Device Status 18: spare index|内存ACLS事件| |Link Error
|13h
|00h|PSULinkAbnormal
Device Status 2: CPU ID
Device Status 3: PSU Type
Device Status 4 :Die
Device Status 5 :Channel|电源PMBUS/AVSBUS访问异常告警
PSU Type:电源类型
0x00--CORE
0x01--UNCODE
0x02--DDR VDD
0x03--Nimbus
0x04--DDR VDDQ
Die:调研供电Die区域
0x00--NA
0x01--TA
0x02--NB
0x03--TB
Channel:电源通路
0x00–PMBUS
0x01--AVSBUS 说明:
不同CPU对应参数存在差异,具体参数映射关系需要与运维人员确认| |01h|Memory I2CI3Clink Abnormal
Device Status 2: CPU ID
Device Status 3: Channel Index
Device Status 4 :DIMM Index|| |VR|14h|00h|VR Power Phase Redundancy
Device Status 2: CPU ID
Device Status 3: PSU Type
Device Status 4 :Die|VR电源相冗余
PSU Type:电源类型
0x00--CORE
0x01--UNCORE
0x02--DDR VDD
0x03--Nimbus
0x04--DDR VDDQ
0x05--CORE_MEM
0x06--UNCORE_MEM
0x07--SIOE
Die:供电Die区域
0x00--NA
0x01--TA
0x02--NB
0x03--TB 说明:
不同CPU对应参数存在差异,具体参数映射关系需要与运维人员确认| |Base OS Boot|15h|06h|boot completed - boot device not specified||