Skip to content

Refactor the CI matrix #555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Refactor the CI matrix #555

wants to merge 10 commits into from

Conversation

cryos
Copy link
Collaborator

@cryos cryos commented Apr 10, 2025

Description

Refactor the CI matrix - closes #576.

Refactor the way the CI matrix is composed to be layered, add proxy cache to improve Python package download times, offer a concept of pull request and nightly matrix for the tests.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Contributor

copy-pr-bot bot commented Apr 10, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cryos cryos marked this pull request as draft April 10, 2025 20:29
Copy link
Contributor

copy-pr-bot bot commented Apr 10, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cryos
Copy link
Collaborator Author

cryos commented Apr 10, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 10, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 10, 2025

/ok to test

@cryos cryos added the CI/CD CI/CD infrastructure label Apr 10, 2025
@cryos
Copy link
Collaborator Author

cryos commented Apr 10, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 11, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 11, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 11, 2025

/ok to test a8238f2

@cryos
Copy link
Collaborator Author

cryos commented Apr 11, 2025

/ok to test 3097f67

@cryos
Copy link
Collaborator Author

cryos commented Apr 11, 2025

/ok to test 3fff795

@leofang leofang self-requested a review April 17, 2025 05:48
@leofang leofang added the P1 Medium priority - Should do label Apr 17, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the diff of this PR carefully but this seems to be on the right direction! I've been wanting to factor out the build parts from build-and-test.yml to build-wheel.yml (the test parts were already done and are now in test-wheel.yml).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Leo, just working through test matrix proposals now. If you want to discuss this in more detail I am more than happy to, just trying a few things out and scoping out what fits well right now along with trying to offer better grouping/labeling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to build-wheel.yml. In the future we should also add build-conda.yml (#280).

@cryos
Copy link
Collaborator Author

cryos commented Apr 17, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 17, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 17, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 23, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 23, 2025

/ok to test

2 similar comments
@cryos
Copy link
Collaborator Author

cryos commented Apr 23, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 24, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 24, 2025

/ok to test

@cryos
Copy link
Collaborator Author

cryos commented Apr 24, 2025

/ok to test

Copy link

@cryos
Copy link
Collaborator Author

cryos commented Apr 24, 2025

/ok to test

@cryos cryos marked this pull request as ready for review April 24, 2025 16:02
id: setup-python
uses: actions/setup-python@v5
with:
python-version: "3.12"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to use the python version matrix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not, and does not seem to need to. Good call out as I had meant to ask and forgotten - @leofang https://github.com/NVIDIA/cuda-python/blob/main/.github/workflows/build-and-test.yml#L56-L61 is where I got this from and it was said the matrix version there, but then set up 3.12 everytime.

Copy link
Member

@leofang leofang Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cryos Back then there was a nasty setup-python bug (see the comment right above the lines you linked to) that made the build steps fail. That was why the old build pipeline needed to lock to PY312. But when actually building the wheels (using the cibuildwheel action), we honor the Python version set in the matrix.

PYTHON_VERSION_FORMATTED=$(echo '${{ matrix.python-version }}' | tr -d '.')
if [[ "${{ matrix.host-platform }}" == linux* ]]; then
CIBW_BUILD="cp${PYTHON_VERSION_FORMATTED}-manylinux*"
REPO_DIR=$(pwd)
elif [[ "${{ matrix.host-platform }}" == win* ]]; then
CIBW_BUILD="cp${PYTHON_VERSION_FORMATTED}-win_amd64"
PWD=$(pwd)
REPO_DIR=$(cygpath -w $PWD)
fi

I haven't checked the latest new build pipeline carefully -- will try to get to it tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python-version: "3.12"
# WAR: setup-python is not relocatable...
# see https://github.com/actions/setup-python/issues/871
python-version: "3.12"

#
# [PY_VER, CUDA_VER, LOCAL_CTK, GPU, DRIVER]
#
export MATRICES="
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really have much interaction between Python version and driver version, where we don't need the cross product of them, just coverage of each. Similarly, we don't really have much interaction between Python version and CTK install type, where we don't need the cross product of them, just coverage of each.

I think the coverage matrix we want to hit is CUDA_VER x LOCAL_CTK x DRIVER which would give us 12x jobs per ARCH today that we can spread the PY_VER to have coverage. When we move to 13.x, we'll then have 4 CUDA_VER we want to test as opposed to 3 today, which would give us 24x jobs per ARCH which I believe can still be handled reasonably easily.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point - I wanted to get opinions. This is a literal copy of what was happening but I thought it would be great to cut it down a fair bit. It sounds like you are onboard. I figured old and new driver to have some coverage, some local versus wheel. I would gladly reduce it, I was also thinking that nightly might do more, and PRs could do less.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I hit exactly what you were thinking, but I got it to 12 per architecture aside from the special one for amd64 and have a mixture to cover different elements of the matrix. This could be tweaked up or down for sure or mixed with some other variations (part of why I wanted to structure it this way).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to make sure we have coverage on each combination of CUDA_VER: {'11.8.0', '12.0.1', '12.8.0'} x LOCAL_CTK: {'0', '1'} x DRIVER: {'earliest', 'latest'}

Your update looks like the DRIVER: 'earliest' is only covered with LOCAL_CTK: '0' and CUDA_VER='11.8.0' currently.

Unless I'm misunderstanding and we need to cover something like a third driver version as well where it's the earliest 12.x supported version for example.

Copy link
Collaborator Author

@cryos cryos Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, there was no existing support for changing the driver - it wasn't clear what we wanted to cover there and I figured adding something. I proved earliest does not work with CUDA 12.8.0 at all in an earlier run, so that can't be added. I get what you mean, cover as much of CUDA_VER, LOCAL_CTK and DRIVER as possible spread across Python versions as you see little to no interaction there. I can tweak but I won't add earliest back with CUDA 12.8.0 as it fails at runtime. Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proved earliest does not work with CUDA 12.8.0 at all in an earlier run, so that can't be added.

Yea, this is expected since earliest is a CUDA 11 driver that does not support CUDA 12. Ideally we would have a test that uses the CUDA 12.0 driver with our CUDA 12.8 build.

Doesn't need to be covered in this PR, but we should probably support something like cuda11_earliest, cuda11_latest, cuda12_earliest, cuda12_latest for the driver versions.

Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.

This is a bit surprising. Any idea why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proved earliest does not work with CUDA 12.8.0 at all in an earlier run, so that can't be added.

Yea, this is expected since earliest is a CUDA 11 driver that does not support CUDA 12. Ideally we would have a test that uses the CUDA 12.0 driver with our CUDA 12.8 build.

Doesn't need to be covered in this PR, but we should probably support something like cuda11_earliest, cuda11_latest, cuda12_earliest, cuda12_latest for the driver versions.

Yes - we need support on the runner side, this is something that is on the list for future additions to the runners.

Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.

This is a bit surprising. Any idea why?

This line here is where it was excluded in the original matrix.

Copy link
Member

@leofang leofang Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.

This is a bit surprising. Any idea why?

This line here is where it was excluded in the original matrix.

Reason: We have no way to express "I want to install the CUDA wheels associated with a certain CTK major.minor(.patch)" (#393 (comment)), until Emma's metapackage lands (soon, @rwgk has been testing them internally).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm my understanding: The nightly CIs will cover the cartesian product of PY_VER, CUDA_VER, and LOCAL_CTK, except for the above case (CUDA 12.0.1 wheels).

When we enable the nightly CI in the next PR, we should also spread it out across more GPU variants (ex: RTX 2080, T4, V100, ..., which we do have in nv-gha-runners).

@@ -0,0 +1,7 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this file needed for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is read in at the start to share the CUDA build version, it was an initial stab at offering a central spot to write that down once, I can back it out, move it, etc. I had seen a desire to have a JSON file store some of these things rather than having it in several CI workflows. I think it might be useful but agree it is a tiny file with a single version in it right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is currently only used for CI, it would be nice to put it in a folder away from the root of the directory then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - I nearly did that, just create a ci folder, or preferences on the path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoever taking over the CI development/maintenance gets to make the call :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref: #328, it'd be nice (not for this PR, but let's keep this in mind) in the future to encode the CI matrix/runner/etc in a json file instead of embedding this information in the workflow code, if possible (sometimes it is surprisingly hard, e.g. per-arch matrix: that we discussed before).

@cryos
Copy link
Collaborator Author

cryos commented Apr 24, 2025

/ok to test

Add a note on combinations that are known not to work together.
runner:
- default
exclude:
# To test this combo would require nontrivial installation steps.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14 this is the exclusion - I think a practical matter of packaging and something I preserved for no 12.0.1 with maybe not having wheels that far back, or being in flux?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, Marcus! LGTM overall, we're heading toward the right direction! I left some comments but only one of them is important to address before merging (CI workflow name).

@@ -0,0 +1,119 @@
name: "CI"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: As noted in the original file, this name is referred to in the test stages to query the run ID, e.g. here:

LATEST_PRIOR_RUN_ID=$(gh run list -b ${OLD_BRANCH} -L 1 -w "CI: Build and test" -s completed -R NVIDIA/cuda-python --json databaseId | jq '.[]| .databaseId')

and here:
$runData = gh run list -b $OLD_BRANCH -L 1 -w "CI: Build and test" -s completed -R NVIDIA/cuda-python --json databaseId | ConvertFrom-Json

so either we make a global change, or we keep the original name intact:

Suggested change
name: "CI"
# Note: This name is referred to in the test job, so make sure any changes are sync'd up!
name: "CI: Build and test"

Comment on lines +3 to +5
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.event_name }}
cancel-in-progress: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I don't think it is in-scope for this PR, just something that came to my mind.

This PR is in fact v3 of the CI. When I refactored to v2, one thing I noticed is that it has less concurrency compared to v1 (and I think v3 inherits from v2). In v1, when a build job finishes, the dependent test job starts immediately (and potentially overlaps with other build jobs). But in v2, all build jobs have to finish, synchronize, and then all test jobs start to queue up. I think this is because of mismatching number of jobs between build and test in v2, and of the use of needs: to declare job dependency. If you have any idea for restoring the v1-like behavior, we should discuss/implement in a future PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to build-wheel.yml. In the future we should also add build-conda.yml (#280).

if: ${{ github.repository_owner == 'nvidia' }}
secrets: inherit
uses:
./.github/workflows/build.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
./.github/workflows/build.yml
./.github/workflows/build-wheel.yml

Comment on lines +39 to +43
- name: Setup proxy cache
uses: nv-gha-runners/setup-proxy-cache@main
continue-on-error: true
# Skip the cache on Windows nodes outside of our org.
if: ${{ inputs.host-platform != 'win-64' }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

- name: Compute Python Test Matrix
id: compute-matrix
run: |
set -eo pipefail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could move this to either the job default

    ...
    defaults:
      run:
        shell: bash --noprofile --norc -xeuo pipefail {0}
    steps:
      - name: Validate Test Type
        ...

or the workflow default (note that the other job below also set this).

#
# [PY_VER, CUDA_VER, LOCAL_CTK, GPU, DRIVER]
#
export MATRICES="
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm my understanding: The nightly CIs will cover the cartesian product of PY_VER, CUDA_VER, and LOCAL_CTK, except for the above case (CUDA 12.0.1 wheels).

When we enable the nightly CI in the next PR, we should also spread it out across more GPU variants (ex: RTX 2080, T4, V100, ..., which we do have in nv-gha-runners).

- name: Compute Python Test Matrix
id: compute-matrix
run: |
set -eo pipefail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +49 to +53
nightly:
- { ARCH: ${ARCH}, PY_VER: '3.12', CUDA_VER: '11.8.0', LOCAL_CTK: '0' }
- { ARCH: ${ARCH}, PY_VER: '3.12', CUDA_VER: '11.8.0', LOCAL_CTK: '1' }
- { ARCH: ${ARCH}, PY_VER: '3.12', CUDA_VER: '12.8.0', LOCAL_CTK: '0' }
- { ARCH: ${ARCH}, PY_VER: '3.12', CUDA_VER: '12.8.0', LOCAL_CTK: '1' }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we can do the same for Windows too and cover the full cartesian product PY_VER x CUDA_VER x LOCAL_CTK in nightly.


# Please keep the matrices sorted in ascending order by the following:
#
# [PY_VER, CUDA_VER, LOCAL_CTK]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: This is a smaller matrix because currently we don't get to choose GPU on Windows.

However, we do get to choose the driver version on Windows (it is installed at job start time). @cryos when we set up nightly perhaps we could make it configurable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD CI/CD infrastructure P1 Medium priority - Should do
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Refactor the CI matrix
3 participants