GPU Adaptation Guide for Specification 1.0
更新时间: 2025/12/09
在Gitcode上查看源码

This document describes how to adapt a GPU of specification 1.0 to openUBMC. Specification 1.0 refers to common PCIe card configurations, including CSR configuration and code protocol adaptation.

Introduction to GPUs

A graphics processing unit (GPU) is a hardware device specialized for graphics rendering and parallel computing. Originally designed for accelerating computer graphics processing, GPUs have evolved into essential tools for general-purpose parallel computing. They are widely used in scientific computing, artificial intelligence (AI), and data analysis.

Compared with a central processing unit (CPU), a GPU has more cores and higher memory bandwidth, making it particularly suitable for processing large-scale parallel computing tasks.

GPUs have evolved from simple graphics processing devices into core components for general-purpose parallel computing. They play an irreplaceable role, especially in the fields of artificial intelligence, scientific computing, and data centers. With continuous technological progress, GPUs will continue to drive improvements in computing power and provide powerful support for innovative applications across various industries.

GPU Adaptation Process

Like other components, every GPU has a CSR file. The CSR file tells openUBMC how to identify the GPU, manage its information, and monitor its status.

CSR Information to Include When Adapting a New GPU

  • Fixed information such as the GPU name, model, manufacturer, slot number, serial number (SN), quadruple, processor type, part number, and root BDF
  • Status of the GPU to be monitored, such as the temperature, status, and power consumption
  • Required sensors for the GPU and how to obtain their readings
  • Alarms or events and their reporting conditions, such as temperature, GPU pre-failure, replacement records, PCIe bandwidth, and PCIe rate
  • Sources for obtaining GPU status and alarm information
  • Whether the GPU requires a separate cooling policy
  • Whether the GPU requires adaptation for a new protocol

GPU Identification

  • Configure the self-description record (SR) file name using the format Bom + Id + AuxId. In the SR of the upstream component (such as the carrier board) connected to the GPU, configure the connector for the GPU, and then load the GPU SR file based on the presence signal feedback from the hardware.
  • You are advised to set the SR loading mode IdentifyMode to 2. After the BIOS reports the GPU BDF information, the BMC queries the quadruple information based on the BDF and loads the GPU.

The GPU loading process is as follows:

json
{
    "Connector_PCIE_1": {
            "Bom": "14140130",
            "Slot": 1,
            "Position": 1,
            "Presence": 0,  // After the BIOS reports the presence information, the BMC queries the quadruple information based on the reported BDF, finds the corresponding SR, and updates the presence status.
            "Id": "",
            "AuxId": "",
            "Buses": [
                "I2cMux_IEUChan1"
            ],
            "SystemId": "${SystemId}",
            "ManagerId": "${ManagerId}",
            "ChassisId": "${ChassisId}",
            "IdentifyMode": 2,    // Downstream component identification mode. 2 corresponds to components with unreadable (reported) BoardId.
            "Container": "Component_RiserCard",
            "Type": "PCIe"
        }
}

Management Topology Object

  • Configure all bus information for the GPU.
  • Configure all channel information.
  • Configure all chip information.
json
{
    "ManagementTopology": {
        "Anchor": {
        "Buses": ["I2cMux_Chan"]
        },
        "I2cMux_Chan": {
        "Chips": ["Chip_TempChip"]
        }
    }
}

Component Object

  • Configure FruId of the GPU.
  • Configure the component type of the GPU.
  • Configure the presence information of the GPU.
  • Configure the SN source of the GPU.
  • Configure the name of the GPU.
json
{
    "Component_PCIeCard": {
        "Instance": "<=/PCIeDevice_1.SlotID",
        "Type": 8,  // Component type, corresponding, to COMPONENT_TYPE* in the code.
        "Name": "<=/PCIeDevice_1.DeviceName", // Name.
        "FruId": 255,  // The default value is 255 if there is no FRU. Otherwise, FruId of the FRU object is referenced.
        "Presence": 1, // Presence information.
        "Health": 0, // Health status.
        "PowerState": 1, // Power status.
        "BoardId": 65535,  // Board ID.
        "UniqueId": "N/A", // Unique ID.
        "Manufacturer": "Nvidia",  // Manufacturer.
        "GroupId": 1,
        "ReplaceFlag": 0,    // Replacement flag.
        "PreviousSN": "",  // Previous SN.
        "SerialNumber": "<=/PCIeCard_1.SerialNumber"  // SN.
    }
}

Entity Object

json
{
    "Entity_GPUCard": {
    "Id": 11,
    "Name": "GPUCard",
    "PowerState": 1,
    "Presence": 1,
    "Instance": 101
    }
}

GPU Object

Define the basic information and properties of the GPU.

  • Name
  • ID
  • Position information
  • Manufacturer information
  • Slot number of the device
  • Model information
  • GPU firmware version
  • GPU part number
  • Quadruple information of the GPU
  • BDF information of the GPU
json
{
    "GPU_1": {
        "SystemId": 1,
        "Id": 1,  // GPU ID
        "Name": "RTX A6000",  // GPU name
        "Presence": 1,  // GPU presence state. 0 indicates absence, 1 indicates presence, and 255 indicates an invalid value.
        "Manufacturer": "Nvidia",  // GPU manufacturer
        "Model": "RTX A6000",  // GPU model
        "SN": "",  // GPU SN, generally obtained through the GPU out-of-band management protocol.
        "SocketDesignation": "1",
        "Position": "<=/PCIeDevice_1.Position",  // GPU position information
        "ProcessorType": "2",  // Processor type. For GPUs, this is configured as "2" in the CSR.
        "Health": "#/Component_PCIeCard.Health",  // Health status
        "Slot": "<=/PCIeDevice_1.SlotID",  // Slot number of the device
        "VendorID": 4318,  // Vendor ID of the device
        "DeviceID": 8752,  // Device ID of the device
        "SubVendorID": 4318, // Sub-vendor ID of the device
        "SubDeviceID": 5209,  // Sub-device ID of the device
        "DevBus": "<=/PCIeDevice_1.DevBus",  // Device Bus
        "DevDevice": "<=/PCIeDevice_1.DevDevice",  // Device in device BDF
        "DevFunction": "<=/PCIeDevice_1.DevFunction",  // Device function
        "RefChip": "#/Chip_TempChip",  // Referenced chip object
        "SerialNumber": "#/PCIeCard_1.SerialNumber",  // GPU SN
        "CardFirmwareVersion": "#/PCIeCard_1.FirmwareVersion",  // GPU firmware version
        "CardPartNumber": "0632Y014"  // GPU part number
    }
}

PCIeDevice Object

json
{
    "PCIeDevice_1": {
        "Segment": 1,  // Segment number, applied in multi-PCI bridge scenarios. Each segment corresponds to a PCI bus space.
        "DeviceName": "PCIe Card $ (RTX A6000)",  // Device resource name
        "DiagnosticFault": 0,  // Critical failure diagnostic alarm status
        "PredictiveFault": 0,  // Pre-failure alarm status
        "FunctionClass": 3,  // Function class: 0 for unknown, 1 for RAID, 2 for NIC, 3 for GPU, 4 for storage card (SSD card/M.2 card), 5 for SDI card, 6 for acceleration card, 7 for expansion card (PCIe riser), 8 for FPGA card, 9 for NPU.
        "Position": "",  // Device position (container name)
        "Container": "${Container}",  // PCIe device container reference
        "GroupPosition": "PCIeDevice_${GroupPosition}",  // Primary key of the PCIe device object. It cannot be duplicated in the same CSR.
        "BandwidthReduction": 0,  // Bandwidth reduction alarm reporting event property, reported by the BIOS
        "CorrectableError": 0,  // Correctable error, reported by the BIOS
        "UncorrectableError": 0,  // Uncorrectable error, reported by the BIOS
        "FatalError": 0,  // Fatal error, reported by the BIOS
        "LinkSpeedReduced": 0,  // Event property for the reporting of link speed reduction alarms, reported by the BIOS
        "DeviceType": 8,  // Device type, corresponding to DEVICE_TYPE* in the code
        "PCIeDeviceType": "SingleFunction",  // Device type
        "SlotType": "FullLength",  // Slot type
        "FunctionProtocol": "PCIe",  // Protocol type
        "FunctionType": "Physical",  // Function type
        "SlotID": 1,  // Slot number
        "Bus": 0,  // Root port bus
        "Device": 0,  // Root port device
        "Function": 0,  // Root port function
        "DevBus": 1,  // Device bus
        "DevDevice": 0,  // Device in device BDF
        "DevFunction": 0,   // Device function
        "SocketID": 0  // CPU ID
        }
}

PCIeCard Object

json
{
    "PCIeCard_1": {
        "SlotID": "<=/PCIeDevice_1.SlotID",  // Slot number of the device
        "NodeID": "<=/PCIeDevice_1.SlotID |> string.format('PCIeCard%s',$1)",  // Board node ID
        "Name": "RTX A6000",  // Board name
        "DeviceName": "<=/PCIeDevice_1.DeviceName",  // Device resource name
        "BoardName": "RTX A6000",  // Board name
        "Model": "RTX A6000",  // Device model
        "Description": "RTX A6000",  // Device description
        "FunctionClass": 3,  // Function class: 0 for unknown, 1 for RAID, 2 for NIC, 3 for GPU, 4 for storage card (SSD card/M.2 card), 5 for SDI card, 6 for acceleration card, 7 for expansion card (PCIe riser), 8 for FPGA card, 9 for NPU
        "VendorID": 4318,  // Vendor ID of the device
        "DeviceID": 8752,  // Device ID of the device
        "SubVendorID": 4318, // Sub-vendor ID of the device
        "SubDeviceID": 5209,  // Sub-device ID of the device
        "Position": "<=/PCIeDevice_1.Position",  // Position of the downstream component
        "LaneOwner": "<=/PCIeDevice_1.SocketID",  // Resource ownership. The start value is 1, indicating the CPU to which the current card is connected.
        "FirmwareVersion": "<=/GPU_1.FirmwareVersion",  // Firmware version
        "Manufacturer": "Nvidia",  // Manufacturer
        "PartNumber": "0632Y014",  // Part number
        "Protocol": "",  // Protocol type
        "MaxFrameLen": 64,  // Maximum frame length
        "LinkSpeed": "N/A",  // Current link speed
        "LinkSpeedCapability": "N/A",  // Maximum link speed
        "PcbVersion": "N/A",  // PCB version
        "DevBus": "<=/PCIeDevice_1.DevBus",  // Device Bus
        "DevDevice": "<=/PCIeDevice_1.DevDevice",  // Device in device BDF
        "SerialNumber": "<=/GPU_1.SN",  // SN
        "DevFunction": "<=/PCIeDevice_1.DevFunction"   // Device function
        }
}

Alarm and Event Configuration

For configuration, refer to Sensor Customization and Development. The following example focuses solely on the monitoring configuration of GPU sensors.

  • Set temperature monitoring thresholds for alarms or cooling.

Threshold Sensor Object (ThresholdSensor)

  • Six thresholds
  • Alarm deassertion hysteresis
  • Reading data source
  • Reasonable M and RBExp values to keep calculated sensor readings within the valid one-byte range
json
{
    "ThresholdSensor_GPUTemp": {
        "OwnerId": 32,  // Owner ID of the sensor
        "OwnerLun": 0,  // Owner LUN of the sensor
        "EntityId": "<=/Entity_GPU.Id",  // Entity identifier, associated with Entity.Id
        "EntityInstance": "<=/Entity_GPU.Instance",  // Entity instance, associated with Entity.Instance
        "Initialization": 127,  // Sensor initialization option. For threshold sensors, set it to 127.
        "Capabilities": 104,  // Sensor effective conditions
        "SensorType": 1, // Sensor type. For details, see chapter 42 in IPMI Specification.
        "ReadingType": 1,  // Sensor reading type. For threshold sensors, set it to 1.
        "SensorName": "<=/PCIeDevice_1.SlotID |> string.format('GPU%s Temp',$1)",  // Sensor name
        "Unit": 128,  // Unit. 128 indicates signed, 0 indicates unsigned.
        "BaseUnit": 1,  // Base unit. See the IPMI specification. For temperature sensors, set it to 1.
        "ModifierUnit": 0,  // Sensor unit descriptor
        "Analog": 1,
        "NominalReading": 25,
        "NormalMaximum": 0,
        "NormalMinimum": 0,
        "MaximumReading": 127,  // Maximum sensor reading
        "MinimumReading": 128,  // Minimum sensor reading
        "Reading": "<=/Scanner_GPUTemp.Value",  // Sensor reading
        "ReadingStatus": "<=/Scanner_GPUTemp.Value;<=/Scanner_GPUTemp.Status |> expr($1 >= 255 ? 2 : ($2 == 0 ? 0 : 2))",  // Sensor reading status
        "AssertMask": 128, // Alarm capability mask. For details, see Assertion Event Mask in IPMI Specification.
        "DeassertMask": 28800, // For details, see Deassertion Event Mask in IPMI Specification.
        "ReadingMask": 2056,
        "Linearization": 0,
        "M": 100,  // Linear calculation equation parameter
        "RBExp": 224,  // [7:4] RExp (K2, signed, two's complement), 4-bit; [3:0] BExp (K1, signed, two's complement), 4-bit.
        "UpperNoncritical": 93,  // Upper non-critical threshold
        "PositiveHysteresis": 2,  // Alarm threshold hysteresis
        "NegativeHysteresis": 2  // Alarm threshold hysteresis
    }
}

Event Alarm Objects

json
    "Event_PCIeCardUCE": {  // Uncorrectable fault event
        "Reading": "<=/PCIeDevice_1.DiagnosticFault;<=/PCIeDevice_1.UCEByBIOS |> expr(($1 + $2) == 0 ? 0 : 1)",
        "OperatorId": 5,
        "Enabled": true,
        "AdditionalInfo": "2",
        "DescArg2": "#/Component_PCIeCard.Name",
        "DescArg4": "NA",
        "Component": "#/Component_PCIeCard",
        "EventKeyId": "PCIeCard.PCIeCardUncorrectableErr",
        "Condition": 1,
        "LedFaultCode": "q$"
    },
    "Event_PCIeCardCE": {  // Correctable fault event
        "Reading": "<=/PCIeDevice_1.PredictiveFault",
        "OperatorId": 5,
        "Enabled": true,
        "AdditionalInfo": "2",
        "DescArg2": "#/Component_PCIeCard.Name",
        "Component": "#/Component_PCIeCard",
        "EventKeyId": "PCIeCard.PCIeCardCEHardFailure",
        "Condition": 1,
        "LedFaultCode": "q$"
    },
    "Event_PcieCardReplaceMntr": {  // Replacement event
        "Reading": "<=/Component_PCIeCard.ReplaceFlag",
        "OperatorId": 5,
        "Enabled": true,
        "AdditionalInfo": "1",
        "DescArg1": "#/Component_PCIeCard.Name",
        "DescArg2": "#/Component_PCIeCard.PreviousSN",
        "DescArg3": "#/Component_PCIeCard.SerialNumber",
        "Component": "#/Component_PCIeCard",
        "EventKeyId": "PcieCard.PcieCardReplace",
        "Condition": 1
    },
    "Event_OverTemp": {  // Over-temperature alarm event
        "Reading": "<=/Scanner_GPUTemp.Value |> expr(($1 >= 255) ? 30 : ($1 & 255))",
        "@Default": {
            "Condition": 93
        },
        "OperatorId": 4,
        "Enabled": true,
        "Component": "#/Component_PCIeCard",
        "AdditionalInfo": "2",
        "DescArg2": "#/Component_PCIeCard.Name",
        "DescArg4": "#/Event_OverTemp.Reading |> string.format('%s', $1)",
        "DescArg5": "#/ThresholdSensor_GPUTemp.UpperNoncritical",
        "EventKeyId": "PCIeCard.PCIeCardOverTemp",
        "Condition": "<=/ThresholdSensor_GPUTemp.UpperNoncritical",
        "Hysteresis": "<=/ThresholdSensor_GPUTemp.PositiveHysteresis"
    },
    "Event_TempFail": {  // Temperature acquisition failure event
        "Reading": "<=/Scanner_GPUTemp.Status",
        "OperatorId": 5,
        "Enabled": true,
        "Component": "#/Component_PCIeCard",
        "AdditionalInfo": "2",
        "DescArg2": "#/Component_PCIeCard.Name",
        "EventKeyId": "PcieCard.PCIeCardTempFail",
        "Condition": 1
    },
    "Event_PCIeBandWidth": {  // Bandwidth alarm event
        "Reading": "<=/PCIeDevice_1.BandwidthReduction",
        "OperatorId": 5,
        "Enabled": true,
        "AdditionalInfo": "2",
        "DescArg2": "#/PCIeDevice_1.SlotID",
        "DescArg3": "(RTX A6000)",
        "Component": "#/Component_PCIeCard",
        "EventKeyId": "PCIeCard.PCIeCardBandWidthDecreased",
        "Condition": 1
    },
    "Event_PCIeLinkSpeed": {  // Link speed alarm event
        "Reading": "<=/PCIeDevice_1.LinkSpeedReduced",
        "OperatorId": 5,
        "Enabled": true,
        "AdditionalInfo": "2",
        "DescArg2": "#/PCIeDevice_1.SlotID",
        "DescArg3": "(RTX A6000)",
        "Component": "#/Component_PCIeCard",
        "EventKeyId": "PCIeCard.PCIeCardLinkSpeedReduced",
        "Condition": 1
    }

Cooling Control Objects

  • Target temperature
  • Maximum allowed temperature
json
{
    "CoolingConfig_Basic": {  // CoolingConfig is usually not configured in the board configuration file.
        "SmartCoolingState": "Enabled",  // Whether smart cooling is enabled. The value can be Enabled or Disabled.
        "SmartCoolingMode": "EnergySaving",  // Smart cooling mode. The value can be EnergySaving, HighPerformance, LowNoise, Custom, or LiquidCooling.
        "LevelPercentRange": [20, 100],  // Smart cooling speed range (20 to 100 in the example)
        "InitLevelInStartup": 100,  // Default speed level at startup. The value range is LevelPercentRange.
        "DiskRowTemperatureAvailable": false,  // Whether the drive temperature is available
        "SysHDDsMaxTemperature": 80.0,  // Max temperature threshold for HDDs
        "SysSSDsMaxTemperature": 80.0,  // Max temperature threshold for SSDs
        "SensorLocationSupported": false  // Whether the temperature ocean interface is supported
    },
    "CoolingPolicy_EnergySaving": {
        "PolicyIdx": 6,  // ID for the linear cooling policy, which must be globally unique
        "ExpCondVal": "EnergySaving", // Expected condition. CoolingPolicy takes effect only when the actual condition matches the expected condition. The value can be EnergySaving, HighPerformance, LowNoise, Custom, or LiquidCooling.
        "ActualCondVal": "<=/CoolingConfig_1.SmartCoolingMode", // Actual condition. The value can be EnergySaving, HighPerformance, LowNoise, Custom, or LiquidCooling.
        "TemperatureRangeLow": [-127, 20, 30, 40, 50], // Lower thresholds for temperature intervals in the linear cooling policy
        "TemperatureRangeHigh": [20, 30, 40, 50, 127], // Upper thresholds for temperature intervals in the linear cooling policy
        "SpeedRangeLow": [20, 32, 70, 100], // Lower thresholds for speed intervals in the linear cooling policy
        "SpeedRangeHigh": [20, 32, 70, 100], // Upper thresholds for speed intervals in the linear cooling policy
        "FanType": ["02314BLG 8038+"] // List of fan types. The fan condition is met if any fan type in this list exists.
    },
    "CoolingRequirement_1_7": {
        "RequirementId": 7, // Target cooling policy ID. The ID must be globally unique. Currently, the ID supports 16 valid bits, where the first 8 bits are the base ID and the last 8 bits are the slot ID.
        "TemperatureType": 11, // Type of the target cooling temperature point. The value can be 1 for Cpu, 2 for Outlet, 3 for Disk, 4 for Memory, 5 for PCH, 6 for VRD, 7 for VDDQ, 8 for NPUHbm, 9 for NPUAiCore, 10 for NPUBoard, 11 for Inlet, 12 for SoCBoardOutlet, or 13 for SoCBoardInlet,.
        "MonitoringStatus": "<=/Scanner_Lm75_Inlet.Status",  // Temperature sensor status: 0 for normal and 1 for abnormal
        "MonitoringValue": "<=/Scanner_Lm75_Inlet.Value;<=/Scanner_Lm75_Inlet.Value |> expr((($1 + 5) > ($2 - 10)) ? ($1 + 5) : ($2 - 10))",  // Temperature value involved in cooling
        "FailedValue": 80,  // Fan speed when the temperature status is abnormal. If this field is not set, abnormal cooling is not triggered. If it is set, the set fan speed is issued during abnormal cooling when the temperature point fails to be read.
        "TargetTemperatureCelsius": 50,  // Current cooling target value
        "MaxAllowedTemperatureCelsius": 60,
        "TargetTemperatureRangeCelsius": [  // Allowed range for the custom target temperature, which is used for validity check of custom target values
            40,
            60
        ],
        "SmartCoolingTargetTemperature": [  // Target temperature values for EnergySaving, HighPerformance, and LowNoise modes. This field is optional.
            50,
            47,
            53
        ],
        "CustomSupported": true,  // Whether custom target values are supported. The value can be true or false.
        "CustomTargetTemperatureCelsius": 50, // Custom temperature. The value 255 is invalid.
        "SensorName": "#/ThresholdSensor_InletTemp.SensorName"  // Sensor name
    },
    "CoolingArea_1_25": {
        "AreaId": 25,
        "RequirementIdx": 25,
        "PolicyIdxGroup": [],
        "FanIdxGroup": [
            1,
            2,
            3,
            4
        ]
    }
}

Adding GPUs that Require Implementation of Out-of-band Management Protocol

NVIDIA GPUs:
NVIDIA GPUs use a custom out-of-band management protocol called SMBus Post Box Interface (SMBPBI), which openUBMC already supports. Theoretically, you only need to add a configuration file (path: general_hardware/src/lualib/hardware_config/) for new NVIDIA GPUs. The configuration filename must match the "Model" field of the GPU object. See Tesla_T4.lua and RTX_A6000.lua in the general_hardware repository for details.

Non-NVIDIA GPUs:
To implement a new out-of-band management protocol for non-NVIDIA GPUs, you need to modify the code of the general_hardware component. You can refer to the code logic for NVIDIA GPUs. Follow the specifications below for adding new GPU drivers:
GPU Driver Specification 1.0


Common Issues

  1. The GPU cannot be identified.
    Description: Generally, the GPU is loaded with IdentifyMode set to 2. The overall process is as follows:
    ->During device adaptation, configure the PcieAddrInfo object for the GPU in the BMC.
    ->The BIOS transmits the BDF number of the GPU to the BMC through WritePcieCardBdfToBmc and WriteOcpCardBdfToBmc.
    ->The BMC queries the IMU for the quadruple information of the corresponding PCIe slot based on the BDF number. It sets the Id, AuxId, and presence information for the connector, triggering the CSR load.

  2. The data reported by the iBMA is mismatched.
    Description: Check whether the PcieAddrInfo object configuration matches the hardware information. You can run the lscpi command on the OS to view the BDF information of each slot card.

  3. Fan speed, noise, or cooling requirements are not met after the GPU is connected.
    Description: GPUs are usually high-heat components and require dedicated cooling policies. Failure to add a dedicated policy may lead to insufficient cooling or excessively high fan speed. You must consider GPU cooling during system adaptation and provide cooling policies for specific GPUs.