Refactor the CI matrix #555

cryos · 2025-04-10T20:28:44Z

Description

Refactor the CI matrix - closes #576.

Refactor the way the CI matrix is composed to be layered, add proxy cache to improve Python package download times, offer a concept of pull request and nightly matrix for the tests.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-04-10T20:28:47Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot · 2025-04-10T20:29:22Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cryos · 2025-04-10T20:30:34Z

/ok to test

cryos · 2025-04-10T20:34:08Z

/ok to test

cryos · 2025-04-10T20:37:14Z

/ok to test

cryos · 2025-04-10T23:52:55Z

/ok to test

cryos · 2025-04-11T00:04:52Z

/ok to test

cryos · 2025-04-11T00:27:08Z

/ok to test

cryos · 2025-04-11T16:24:08Z

/ok to test a8238f2

cryos · 2025-04-11T16:29:03Z

/ok to test 3097f67

cryos · 2025-04-11T16:38:47Z

/ok to test 3fff795

leofang · 2025-04-17T05:51:48Z

.github/workflows/build.yml

I haven't checked the diff of this PR carefully but this seems to be on the right direction! I've been wanting to factor out the build parts from build-and-test.yml to build-wheel.yml (the test parts were already done and are now in test-wheel.yml).

Thanks Leo, just working through test matrix proposals now. If you want to discuss this in more detail I am more than happy to, just trying a few things out and scoping out what fits well right now along with trying to offer better grouping/labeling.

Let's rename this to build-wheel.yml. In the future we should also add build-conda.yml (#280).

cryos · 2025-04-17T18:09:48Z

/ok to test

cryos · 2025-04-17T18:13:43Z

/ok to test

cryos · 2025-04-17T18:19:22Z

/ok to test

cryos · 2025-04-23T19:22:15Z

/ok to test

cryos · 2025-04-23T19:24:55Z

/ok to test

cryos · 2025-04-23T19:59:09Z

/ok to test

cryos · 2025-04-24T14:30:12Z

/ok to test

cryos · 2025-04-24T14:31:12Z

/ok to test

cryos · 2025-04-24T14:35:32Z

/ok to test

github-actions · 2025-04-24T14:55:26Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-555/
https://nvidia.github.io/cuda-python/pr-preview/pr-555/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-555/cuda-bindings/
Preview will be ready when the GitHub Pages deployment is complete.

cryos · 2025-04-24T15:22:47Z

/ok to test

kkraus14 · 2025-04-24T17:09:38Z

.github/workflows/build.yml

+        id: setup-python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"


Does this need to use the python version matrix?

It was not, and does not seem to need to. Good call out as I had meant to ask and forgotten - @leofang https://github.com/NVIDIA/cuda-python/blob/main/.github/workflows/build-and-test.yml#L56-L61 is where I got this from and it was said the matrix version there, but then set up 3.12 everytime.

@cryos Back then there was a nasty setup-python bug (see the comment right above the lines you linked to) that made the build steps fail. That was why the old build pipeline needed to lock to PY312. But when actually building the wheels (using the cibuildwheel action), we honor the Python version set in the matrix.

cuda-python/.github/workflows/build-and-test.yml

Lines 69 to 77 in c71ed39

PYTHON_VERSION_FORMATTED=$(echo '${{ matrix.python-version }}' | tr -d '.')

if [[ "${{ matrix.host-platform }}" == linux* ]]; then

CIBW_BUILD="cp${PYTHON_VERSION_FORMATTED}-manylinux*"

REPO_DIR=$(pwd)

elif [[ "${{ matrix.host-platform }}" == win* ]]; then

CIBW_BUILD="cp${PYTHON_VERSION_FORMATTED}-win_amd64"

PWD=$(pwd)

REPO_DIR=$(cygpath -w $PWD)

fi

I haven't checked the latest new build pipeline carefully -- will try to get to it tomorrow.

Suggested change

python-version: "3.12"

# WAR: setup-python is not relocatable...

# see https://github.com/actions/setup-python/issues/871

python-version: "3.12"

.github/workflows/build.yml

kkraus14 · 2025-04-24T17:42:56Z

.github/workflows/test-wheel-linux.yml

+          #
+          #     [PY_VER, CUDA_VER, LOCAL_CTK, GPU, DRIVER]
+          #
+          export MATRICES="


We don't really have much interaction between Python version and driver version, where we don't need the cross product of them, just coverage of each. Similarly, we don't really have much interaction between Python version and CTK install type, where we don't need the cross product of them, just coverage of each.

I think the coverage matrix we want to hit is CUDA_VER x LOCAL_CTK x DRIVER which would give us 12x jobs per ARCH today that we can spread the PY_VER to have coverage. When we move to 13.x, we'll then have 4 CUDA_VER we want to test as opposed to 3 today, which would give us 24x jobs per ARCH which I believe can still be handled reasonably easily.

This is a great point - I wanted to get opinions. This is a literal copy of what was happening but I thought it would be great to cut it down a fair bit. It sounds like you are onboard. I figured old and new driver to have some coverage, some local versus wheel. I would gladly reduce it, I was also thinking that nightly might do more, and PRs could do less.

Not sure I hit exactly what you were thinking, but I got it to 12 per architecture aside from the special one for amd64 and have a mixture to cover different elements of the matrix. This could be tweaked up or down for sure or mixed with some other variations (part of why I wanted to structure it this way).

I was thinking to make sure we have coverage on each combination of CUDA_VER: {'11.8.0', '12.0.1', '12.8.0'} x LOCAL_CTK: {'0', '1'} x DRIVER: {'earliest', 'latest'}

Your update looks like the DRIVER: 'earliest' is only covered with LOCAL_CTK: '0' and CUDA_VER='11.8.0' currently.

Unless I'm misunderstanding and we need to cover something like a third driver version as well where it's the earliest 12.x supported version for example.

Ah, there was no existing support for changing the driver - it wasn't clear what we wanted to cover there and I figured adding something. I proved earliest does not work with CUDA 12.8.0 at all in an earlier run, so that can't be added. I get what you mean, cover as much of CUDA_VER, LOCAL_CTK and DRIVER as possible spread across Python versions as you see little to no interaction there. I can tweak but I won't add earliest back with CUDA 12.8.0 as it fails at runtime. Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.

I proved earliest does not work with CUDA 12.8.0 at all in an earlier run, so that can't be added.

Yea, this is expected since earliest is a CUDA 11 driver that does not support CUDA 12. Ideally we would have a test that uses the CUDA 12.0 driver with our CUDA 12.8 build.

Doesn't need to be covered in this PR, but we should probably support something like cuda11_earliest, cuda11_latest, cuda12_earliest, cuda12_latest for the driver versions.

Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.

This is a bit surprising. Any idea why?

I proved earliest does not work with CUDA 12.8.0 at all in an earlier run, so that can't be added.

Yea, this is expected since earliest is a CUDA 11 driver that does not support CUDA 12. Ideally we would have a test that uses the CUDA 12.0 driver with our CUDA 12.8 build.

Doesn't need to be covered in this PR, but we should probably support something like cuda11_earliest, cuda11_latest, cuda12_earliest, cuda12_latest for the driver versions.

Yes - we need support on the runner side, this is something that is on the list for future additions to the runners.

Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.

This is a bit surprising. Any idea why?

This line here is where it was excluded in the original matrix.

Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.

This is a bit surprising. Any idea why?

This line here is where it was excluded in the original matrix.

Reason: We have no way to express "I want to install the CUDA wheels associated with a certain CTK major.minor(.patch)" (#393 (comment)), until Emma's metapackage lands (soon, @rwgk has been testing them internally).

To confirm my understanding: The nightly CIs will cover the cartesian product of PY_VER, CUDA_VER, and LOCAL_CTK, except for the above case (CUDA 12.0.1 wheels).

When we enable the nightly CI in the next PR, we should also spread it out across more GPU variants (ex: RTX 2080, T4, V100, ..., which we do have in nv-gha-runners).

kkraus14 · 2025-04-24T17:43:53Z

versions.json

@@ -0,0 +1,7 @@
+{


What is this file needed for?

This is read in at the start to share the CUDA build version, it was an initial stab at offering a central spot to write that down once, I can back it out, move it, etc. I had seen a desire to have a JSON file store some of these things rather than having it in several CI workflows. I think it might be useful but agree it is a tiny file with a single version in it right now.

Given this is currently only used for CI, it would be nice to put it in a folder away from the root of the directory then.

Sounds good - I nearly did that, just create a ci folder, or preferences on the path?

Whoever taking over the CI development/maintenance gets to make the call :)

xref: #328, it'd be nice (not for this PR, but let's keep this in mind) in the future to encode the CI matrix/runner/etc in a json file instead of embedding this information in the workflow code, if possible (sometimes it is surprisingly hard, e.g. per-arch matrix: that we discussed before).

cryos · 2025-04-24T20:21:33Z

/ok to test

Add a note on combinations that are known not to work together.

cryos · 2025-04-25T15:06:51Z

.github/workflows/build-and-test.yml

-        runner:
-          - default
-        exclude:
-          # To test this combo would require nontrivial installation steps.


@kkraus14 this is the exclusion - I think a practical matter of packaging and something I preserved for no 12.0.1 with maybe not having wheels that far back, or being in flux?

#555 (comment)

leofang

Thanks a lot, Marcus! LGTM overall, we're heading toward the right direction! I left some comments but only one of them is important to address before merging (CI workflow name).

leofang · 2025-04-27T19:28:28Z

.github/workflows/ci.yml

@@ -0,0 +1,119 @@
+name: "CI"


Important: As noted in the original file, this name is referred to in the test stages to query the run ID, e.g. here:

cuda-python/.github/workflows/test-wheel-linux.yml

Line 129 in 2aca306

LATEST_PRIOR_RUN_ID=$(gh run list -b ${OLD_BRANCH} -L 1 -w "CI: Build and test" -s completed -R NVIDIA/cuda-python --json databaseId | jq '.[]| .databaseId')

and here:

cuda-python/.github/workflows/test-wheel-windows.yml

Line 113 in 2aca306

$runData = gh run list -b $OLD_BRANCH -L 1 -w "CI: Build and test" -s completed -R NVIDIA/cuda-python --json databaseId | ConvertFrom-Json

so either we make a global change, or we keep the original name intact:

Suggested change

name: "CI"

# Note: This name is referred to in the test job, so make sure any changes are sync'd up!

name: "CI: Build and test"

leofang · 2025-04-27T19:36:24Z

.github/workflows/ci.yml

+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}-${{ github.event_name }}
+  cancel-in-progress: true


btw I don't think it is in-scope for this PR, just something that came to my mind.

This PR is in fact v3 of the CI. When I refactored to v2, one thing I noticed is that it has less concurrency compared to v1 (and I think v3 inherits from v2). In v1, when a build job finishes, the dependent test job starts immediately (and potentially overlaps with other build jobs). But in v2, all build jobs have to finish, synchronize, and then all test jobs start to queue up. I think this is because of mismatching number of jobs between build and test in v2, and of the use of needs: to declare job dependency. If you have any idea for restoring the v1-like behavior, we should discuss/implement in a future PR.

leofang · 2025-04-27T19:50:21Z

.github/workflows/build.yml

Let's rename this to build-wheel.yml. In the future we should also add build-conda.yml (#280).

leofang · 2025-04-27T19:51:01Z

.github/workflows/ci.yml

+    if: ${{ github.repository_owner == 'nvidia' }}
+    secrets: inherit
+    uses:
+      ./.github/workflows/build.yml


ditto

Suggested change

./.github/workflows/build.yml

./.github/workflows/build-wheel.yml

leofang · 2025-04-27T19:54:40Z

.github/workflows/build.yml

+      - name: Setup proxy cache
+        uses: nv-gha-runners/setup-proxy-cache@main
+        continue-on-error: true
+        # Skip the cache on Windows nodes outside of our org.
+        if: ${{ inputs.host-platform != 'win-64' }}


leofang · 2025-04-27T20:21:23Z

.github/workflows/test-wheel-linux.yml

+      - name: Compute Python Test Matrix
+        id: compute-matrix
+        run: |
+          set -eo pipefail


nit: we could move this to either the job default

... defaults: run: shell: bash --noprofile --norc -xeuo pipefail {0} steps: - name: Validate Test Type ...

or the workflow default (note that the other job below also set this).

leofang · 2025-04-27T20:51:29Z

.github/workflows/test-wheel-linux.yml

+          #
+          #     [PY_VER, CUDA_VER, LOCAL_CTK, GPU, DRIVER]
+          #
+          export MATRICES="


To confirm my understanding: The nightly CIs will cover the cartesian product of PY_VER, CUDA_VER, and LOCAL_CTK, except for the above case (CUDA 12.0.1 wheels).

When we enable the nightly CI in the next PR, we should also spread it out across more GPU variants (ex: RTX 2080, T4, V100, ..., which we do have in nv-gha-runners).

leofang · 2025-04-27T20:53:46Z

.github/workflows/test-wheel-windows.yml

+      - name: Compute Python Test Matrix
+        id: compute-matrix
+        run: |
+          set -eo pipefail


leofang · 2025-04-27T20:56:48Z

.github/workflows/test-wheel-windows.yml

+            nightly:
+              - { ARCH: ${ARCH}, PY_VER: '3.12',  CUDA_VER: '11.8.0', LOCAL_CTK: '0' }
+              - { ARCH: ${ARCH}, PY_VER: '3.12',  CUDA_VER: '11.8.0', LOCAL_CTK: '1' }
+              - { ARCH: ${ARCH}, PY_VER: '3.12',  CUDA_VER: '12.8.0', LOCAL_CTK: '0' }
+              - { ARCH: ${ARCH}, PY_VER: '3.12',  CUDA_VER: '12.8.0', LOCAL_CTK: '1' }


I hope we can do the same for Windows too and cover the full cartesian product PY_VER x CUDA_VER x LOCAL_CTK in nightly.

leofang · 2025-04-27T20:59:57Z

.github/workflows/test-wheel-windows.yml

+
+          # Please keep the matrices sorted in ascending order by the following:
+          #
+          #     [PY_VER, CUDA_VER, LOCAL_CTK]


Note to self: This is a smaller matrix because currently we don't get to choose GPU on Windows.

However, we do get to choose the driver version on Windows (it is installed at job start time). @cryos when we set up nightly perhaps we could make it configurable?

cryos marked this pull request as draft April 10, 2025 20:29

cryos force-pushed the refactor-ci branch from 6cee1ea to 9cac84c Compare April 10, 2025 20:33

cryos force-pushed the refactor-ci branch from 9cac84c to aa6b208 Compare April 10, 2025 20:37

cryos force-pushed the refactor-ci branch from aa6b208 to 8ee06ee Compare April 10, 2025 23:51

cryos added the CI/CD CI/CD infrastructure label Apr 10, 2025

cryos force-pushed the refactor-ci branch from 8ee06ee to a0ba8ba Compare April 11, 2025 00:04

cryos force-pushed the refactor-ci branch from a0ba8ba to b903988 Compare April 11, 2025 00:26

cryos force-pushed the refactor-ci branch from b903988 to a8238f2 Compare April 11, 2025 16:23

cryos force-pushed the refactor-ci branch from 3097f67 to 3fff795 Compare April 11, 2025 16:38

leofang self-requested a review April 17, 2025 05:48

leofang assigned cryos Apr 17, 2025

leofang added the P1 Medium priority - Should do label Apr 17, 2025

leofang reviewed Apr 17, 2025

View reviewed changes

cryos force-pushed the refactor-ci branch from 1422c86 to d8a6dd0 Compare April 17, 2025 18:13

cryos force-pushed the refactor-ci branch from d8a6dd0 to bcffdb5 Compare April 17, 2025 18:19

cryos force-pushed the refactor-ci branch from bcffdb5 to 965e7a1 Compare April 17, 2025 18:40

cryos force-pushed the refactor-ci branch from 7d26a7e to 06be3a9 Compare April 23, 2025 19:20

cryos force-pushed the refactor-ci branch from 06be3a9 to 47b71ae Compare April 23, 2025 19:24

cryos force-pushed the refactor-ci branch from 9a7c4bc to a93a9bc Compare April 24, 2025 14:30

cryos force-pushed the refactor-ci branch from a93a9bc to bbfe0ad Compare April 24, 2025 14:35

cryos marked this pull request as ready for review April 24, 2025 16:02

kkraus14 reviewed Apr 24, 2025

View reviewed changes

cryos added 9 commits April 24, 2025 16:20

Disable existing CI for PRs

f93bd1a

Get the build CI working with python caching

3079b35

Add in some testing matrix updates (simple version)

acfd172

Tweak labeling and CTK install moves to the matrix

6885b94

Add back in Windows build and test

d9e324b

Use better naming for the windows test runs

cd74b1e

Get CUDA build version from JSON, docs and checks, full matrix

01513f0

Use the latest driver for CUDA 12.8.0 testing

bffae5f

Trim down the matrix, remove old workflow

2d6889b

cryos force-pushed the refactor-ci branch from 82b9058 to 2d6889b Compare April 24, 2025 20:21

Tweak the matrix for more coverage

2c9b24e

Add a note on combinations that are known not to work together.

cryos commented Apr 25, 2025

View reviewed changes

shwina mentioned this pull request Apr 25, 2025

[EPIC]: Wheels for cuda_{cccl, parallel, cooperative} NVIDIA/cccl#4546

Closed

leofang requested changes Apr 27, 2025

View reviewed changes

shwina mentioned this pull request Apr 30, 2025

[EPIC]: Make cuda.{parallel,cooperative} available on PyPI NVIDIA/cccl#2555

Open

	PYTHON_VERSION_FORMATTED=$(echo '${{ matrix.python-version }}' \| tr -d '.')
	if [[ "${{ matrix.host-platform }}" == linux* ]]; then
	CIBW_BUILD="cp${PYTHON_VERSION_FORMATTED}-manylinux*"
	REPO_DIR=$(pwd)
	elif [[ "${{ matrix.host-platform }}" == win* ]]; then
	CIBW_BUILD="cp${PYTHON_VERSION_FORMATTED}-win_amd64"
	PWD=$(pwd)
	REPO_DIR=$(cygpath -w $PWD)
	fi

	name: "CI"
	# Note: This name is referred to in the test job, so make sure any changes are sync'd up!
	name: "CI: Build and test"

	./.github/workflows/build.yml
	./.github/workflows/build-wheel.yml

Refactor the CI matrix #555

Are you sure you want to change the base?

Refactor the CI matrix #555

Conversation

cryos commented Apr 10, 2025 • edited Loading

Description

Checklist

copy-pr-bot bot commented Apr 10, 2025

copy-pr-bot bot commented Apr 10, 2025

cryos commented Apr 10, 2025

cryos commented Apr 10, 2025

cryos commented Apr 10, 2025

cryos commented Apr 10, 2025

cryos commented Apr 11, 2025

cryos commented Apr 11, 2025

cryos commented Apr 11, 2025

cryos commented Apr 11, 2025

cryos commented Apr 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cryos commented Apr 17, 2025

cryos commented Apr 17, 2025

cryos commented Apr 17, 2025

cryos commented Apr 23, 2025

cryos commented Apr 23, 2025

cryos commented Apr 23, 2025

cryos commented Apr 24, 2025

cryos commented Apr 24, 2025

cryos commented Apr 24, 2025

github-actions bot commented Apr 24, 2025

Preview will be ready when the GitHub Pages deployment is complete.

cryos commented Apr 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang Apr 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cryos Apr 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang Apr 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cryos commented Apr 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cryos commented Apr 10, 2025 •

edited

Loading

leofang Apr 27, 2025 •

edited

Loading

cryos Apr 24, 2025 •

edited

Loading

leofang Apr 27, 2025 •

edited

Loading