Skip to content

Conversation

@rwgk
Copy link
Collaborator

@rwgk rwgk commented Jan 22, 2026

Not a full PR, but meant as a starting point for @mdboom to pick up. (I can share the colossus machine for testing.)

Relevance: this will surface in the next QA testing cycle

The minimal fix suffers from the same problem we had elsewhere: the new xfail aborts the nested loop over devices and fans.

This change resolve this error:

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /wrk/forked/cuda-python/TestVenv/bin/python
cachedir: .pytest_cache
benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=1233216746
rootdir: /wrk/forked/cuda-python/cuda_core
configfile: pytest.ini
plugins: benchmark-5.2.3, randomly-4.0.1
collecting ... collected 45 items

tests/system/test_system_device.py::test_device_pci_info PASSED
tests/system/test_system_device.py::test_get_p2p_status SKIPPED (Tes...)
tests/system/test_system_device.py::test_devices_are_the_same_architecture PASSED
tests/system/test_system_device.py::test_affinity PASSED
tests/system/test_system_device.py::test_device_serial PASSED
tests/system/test_system_device.py::test_device_architecture PASSED
tests/system/test_system_device.py::test_persistence_mode_enabled XFAIL
tests/system/test_system_device.py::test_c2c_mode_enabled SKIPPED (U...)
tests/system/test_system_device.py::test_get_nearest_gpus PASSED
tests/system/test_system_device.py::test_device_name PASSED
tests/system/test_system_device.py::test_clock_event_reasons PASSED
tests/system/test_system_device.py::test_device_memory PASSED
tests/system/test_system_device.py::test_get_all_devices_with_cpu_affinity PASSED
tests/system/test_system_device.py::test_device_attributes SKIPPED (...)
tests/system/test_system_device.py::test_device_cuda_compute_capability PASSED
tests/system/test_system_device.py::test_fan FAILED
tests/system/test_system_device.py::test_temperature SKIPPED (Unsupp...)
tests/system/test_system_device.py::test_to_cuda_device PASSED
tests/system/test_system_device.py::test_device_cpu_affinity PASSED
tests/system/test_system_device.py::test_unpack_bitmask[params1] PASSED
tests/system/test_system_device.py::test_unpack_bitmask[params0] PASSED
tests/system/test_system_device.py::test_device_brand PASSED
tests/system/test_system_device.py::test_clock SKIPPED (Unsupported ...)
tests/system/test_system_device.py::test_event_type_parsing PASSED
tests/system/test_system_device.py::test_unpack_bitmask[params2] PASSED
tests/system/test_system_device.py::test_index PASSED
tests/system/test_system_device.py::test_register_events PASSED
tests/system/test_system_device.py::test_unpack_bitmask[params3] PASSED
tests/system/test_system_device.py::test_device_bar1_memory PASSED
tests/system/test_system_device.py::test_numa_node_id SKIPPED (Unsup...)
tests/system/test_system_device.py::test_field_values PASSED
tests/system/test_system_device.py::test_module_id PASSED
tests/system/test_system_device.py::test_display_mode PASSED
tests/system/test_system_device.py::test_pstates PASSED
tests/system/test_system_device.py::test_device_uuid PASSED
tests/system/test_system_device.py::test_cooler PASSED
tests/system/test_system_device.py::test_repair_status PASSED
tests/system/test_system_device.py::test_get_topology_common_ancestor SKIPPED
tests/system/test_system_device.py::test_get_inforom_version PASSED
tests/system/test_system_device.py::test_addressing_mode PASSED
tests/system/test_system_device.py::test_unpack_bitmask_single_value PASSED
tests/system/test_system_device.py::test_device_count PASSED
tests/system/test_system_device.py::test_auto_boosted_clocks_enabled SKIPPED
tests/system/test_system_device.py::test_get_minor_number PASSED
tests/system/test_system_device.py::test_device_pci_bus_id PASSED

=================================== FAILURES ===================================
___________________________________ test_fan ___________________________________

    def test_fan():
        for device in system.Device.get_all_devices():
            # The fan APIs are only supported on discrete devices with fans,
            # but when they are not available `device.num_fans` returns 0.
            if device.num_fans == 0:
                pytest.skip("Device has no fans to test")
    
            for fan_idx in range(device.num_fans):
                fan_info = device.fan(fan_idx)
                assert isinstance(fan_info, system.FanInfo)
    
                try:
                    speed = fan_info.speed
                    assert isinstance(speed, int)
                    assert 0 <= speed <= 200
    
>                   fan_info.speed = 50
                    ^^^^^^^^^^^^^^

device     = <cuda.core.system._device.Device object at 0x79f60e7b9ab0>
fan_idx    = 0
fan_info   = <cuda.core.system._device.FanInfo object at 0x79f60e7b9a70>
speed      = 0

tests/system/test_system_device.py:626: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cuda/core/system/_fan.pxi:36: in cuda.core.system._device.FanInfo.speed.__set__
    nvml.device_set_fan_speed_v2(self._handle, self._fan, speed)
cuda/bindings/_nvml.pyx:24343: in cuda.bindings._nvml.device_set_fan_speed_v2
    ???
cuda/bindings/_nvml.pyx:24355: in cuda.bindings._nvml.device_set_fan_speed_v2
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   cuda.bindings._nvml.NoPermissionError: Insufficient Permissions


cuda/bindings/_nvml.pyx:1740: NoPermissionError

During handling of the above exception, another exception occurred:

    def test_fan():
        for device in system.Device.get_all_devices():
            # The fan APIs are only supported on discrete devices with fans,
            # but when they are not available `device.num_fans` returns 0.
            if device.num_fans == 0:
                pytest.skip("Device has no fans to test")
    
            for fan_idx in range(device.num_fans):
                fan_info = device.fan(fan_idx)
                assert isinstance(fan_info, system.FanInfo)
    
                try:
                    speed = fan_info.speed
                    assert isinstance(speed, int)
                    assert 0 <= speed <= 200
    
                    fan_info.speed = 50
                    fan_info.speed = speed
    
                    speed_rpm = fan_info.speed_rpm
                    assert isinstance(speed_rpm, int)
                    assert speed_rpm >= 0
    
                    target_speed = fan_info.target_speed
                    assert isinstance(target_speed, int)
                    assert speed <= target_speed * 2
    
                    min_, max_ = fan_info.min_max_speed
                    assert isinstance(min_, int)
                    assert isinstance(max_, int)
                    assert min_ <= max_
                    if speed > 0:
                        assert min_ <= speed <= max_
    
                    control_policy = fan_info.control_policy
                    assert isinstance(control_policy, system.FanControlPolicy)
                finally:
>                   fan_info.set_default_fan_speed()

device     = <cuda.core.system._device.Device object at 0x79f60e7b9ab0>
fan_idx    = 0
fan_info   = <cuda.core.system._device.FanInfo object at 0x79f60e7b9a70>
speed      = 0

tests/system/test_system_device.py:647: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cuda/core/system/_fan.pxi:103: in cuda.core.system._device.FanInfo.set_default_fan_speed
    nvml.device_set_default_fan_speed_v2(self._handle, self._fan)
cuda/bindings/_nvml.pyx:24256: in cuda.bindings._nvml.device_set_default_fan_speed_v2
    ???
cuda/bindings/_nvml.pyx:24267: in cuda.bindings._nvml.device_set_default_fan_speed_v2
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   cuda.bindings._nvml.NoPermissionError: Insufficient Permissions


cuda/bindings/_nvml.pyx:1740: NoPermissionError
=========================== short test summary info ============================
SKIPPED [1] tests/system/test_system_device.py:480: Test requires at least 2 GPUs
SKIPPED [6] ../cuda_bindings/cuda/bindings/_test_helpers/arch_check.py:35: Unsupported call for device architecture AMPERE on device 'NVIDIA A10G'
SKIPPED [1] tests/system/test_system_device.py:465: Test requires at least 2 GPUs
XFAIL tests/system/test_system_device.py::test_persistence_mode_enabled - nvml.NoPermissionError: Insufficient Permissions
FAILED tests/system/test_system_device.py::test_fan - cuda.bindings._nvml.NoP...
============== 1 failed, 35 passed, 8 skipped, 1 xfailed in 1.82s ==============

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 22, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk rwgk requested a review from mdboom January 22, 2026 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant