-
Notifications
You must be signed in to change notification settings - Fork 158
Refactor the CI matrix #555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test a8238f2 |
/ok to test 3097f67 |
/ok to test 3fff795 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked the diff of this PR carefully but this seems to be on the right direction! I've been wanting to factor out the build parts from build-and-test.yml
to build-wheel.yml
(the test parts were already done and are now in test-wheel.yml
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Leo, just working through test matrix proposals now. If you want to discuss this in more detail I am more than happy to, just trying a few things out and scoping out what fits well right now along with trying to offer better grouping/labeling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to build-wheel.yml
. In the future we should also add build-conda.yml
(#280).
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
2 similar comments
/ok to test |
/ok to test |
/ok to test |
/ok to test |
|
/ok to test |
id: setup-python | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: "3.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to use the python version matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not, and does not seem to need to. Good call out as I had meant to ask and forgotten - @leofang https://github.com/NVIDIA/cuda-python/blob/main/.github/workflows/build-and-test.yml#L56-L61 is where I got this from and it was said the matrix version there, but then set up 3.12 everytime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cryos Back then there was a nasty setup-python bug (see the comment right above the lines you linked to) that made the build steps fail. That was why the old build pipeline needed to lock to PY312. But when actually building the wheels (using the cibuildwheel action), we honor the Python version set in the matrix.
cuda-python/.github/workflows/build-and-test.yml
Lines 69 to 77 in c71ed39
PYTHON_VERSION_FORMATTED=$(echo '${{ matrix.python-version }}' | tr -d '.') | |
if [[ "${{ matrix.host-platform }}" == linux* ]]; then | |
CIBW_BUILD="cp${PYTHON_VERSION_FORMATTED}-manylinux*" | |
REPO_DIR=$(pwd) | |
elif [[ "${{ matrix.host-platform }}" == win* ]]; then | |
CIBW_BUILD="cp${PYTHON_VERSION_FORMATTED}-win_amd64" | |
PWD=$(pwd) | |
REPO_DIR=$(cygpath -w $PWD) | |
fi |
I haven't checked the latest new build pipeline carefully -- will try to get to it tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python-version: "3.12" | |
# WAR: setup-python is not relocatable... | |
# see https://github.com/actions/setup-python/issues/871 | |
python-version: "3.12" |
# | ||
# [PY_VER, CUDA_VER, LOCAL_CTK, GPU, DRIVER] | ||
# | ||
export MATRICES=" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really have much interaction between Python version and driver version, where we don't need the cross product of them, just coverage of each. Similarly, we don't really have much interaction between Python version and CTK install type, where we don't need the cross product of them, just coverage of each.
I think the coverage matrix we want to hit is CUDA_VER
x LOCAL_CTK
x DRIVER
which would give us 12x jobs per ARCH
today that we can spread the PY_VER
to have coverage. When we move to 13.x, we'll then have 4 CUDA_VER
we want to test as opposed to 3 today, which would give us 24x jobs per ARCH which I believe can still be handled reasonably easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point - I wanted to get opinions. This is a literal copy of what was happening but I thought it would be great to cut it down a fair bit. It sounds like you are onboard. I figured old and new driver to have some coverage, some local versus wheel. I would gladly reduce it, I was also thinking that nightly might do more, and PRs could do less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I hit exactly what you were thinking, but I got it to 12 per architecture aside from the special one for amd64
and have a mixture to cover different elements of the matrix. This could be tweaked up or down for sure or mixed with some other variations (part of why I wanted to structure it this way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to make sure we have coverage on each combination of CUDA_VER: {'11.8.0', '12.0.1', '12.8.0'}
x LOCAL_CTK: {'0', '1'}
x DRIVER: {'earliest', 'latest'}
Your update looks like the DRIVER: 'earliest'
is only covered with LOCAL_CTK: '0'
and CUDA_VER='11.8.0'
currently.
Unless I'm misunderstanding and we need to cover something like a third driver version as well where it's the earliest 12.x supported version for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, there was no existing support for changing the driver - it wasn't clear what we wanted to cover there and I figured adding something. I proved earliest
does not work with CUDA 12.8.0
at all in an earlier run, so that can't be added. I get what you mean, cover as much of CUDA_VER
, LOCAL_CTK
and DRIVER
as possible spread across Python versions as you see little to no interaction there. I can tweak but I won't add earliest
back with CUDA 12.8.0
as it fails at runtime. Additionally CUDA 12.0.1
does not work for LOCAL_CTK
set to 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I proved
earliest
does not work with CUDA12.8.0
at all in an earlier run, so that can't be added.
Yea, this is expected since earliest
is a CUDA 11 driver that does not support CUDA 12. Ideally we would have a test that uses the CUDA 12.0 driver with our CUDA 12.8 build.
Doesn't need to be covered in this PR, but we should probably support something like cuda11_earliest
, cuda11_latest
, cuda12_earliest
, cuda12_latest
for the driver versions.
Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.
This is a bit surprising. Any idea why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I proved
earliest
does not work with CUDA12.8.0
at all in an earlier run, so that can't be added.Yea, this is expected since
earliest
is a CUDA 11 driver that does not support CUDA 12. Ideally we would have a test that uses the CUDA 12.0 driver with our CUDA 12.8 build.Doesn't need to be covered in this PR, but we should probably support something like
cuda11_earliest
,cuda11_latest
,cuda12_earliest
,cuda12_latest
for the driver versions.
Yes - we need support on the runner side, this is something that is on the list for future additions to the runners.
Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.
This is a bit surprising. Any idea why?
This line here is where it was excluded in the original matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally CUDA 12.0.1 does not work for LOCAL_CTK set to 0.
This is a bit surprising. Any idea why?
This line here is where it was excluded in the original matrix.
Reason: We have no way to express "I want to install the CUDA wheels associated with a certain CTK major.minor(.patch)" (#393 (comment)), until Emma's metapackage lands (soon, @rwgk has been testing them internally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm my understanding: The nightly
CIs will cover the cartesian product of PY_VER
, CUDA_VER
, and LOCAL_CTK
, except for the above case (CUDA 12.0.1 wheels).
When we enable the nightly
CI in the next PR, we should also spread it out across more GPU variants (ex: RTX 2080, T4, V100, ..., which we do have in nv-gha-runners).
@@ -0,0 +1,7 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this file needed for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is read in at the start to share the CUDA build version, it was an initial stab at offering a central spot to write that down once, I can back it out, move it, etc. I had seen a desire to have a JSON file store some of these things rather than having it in several CI workflows. I think it might be useful but agree it is a tiny file with a single version in it right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is currently only used for CI, it would be nice to put it in a folder away from the root of the directory then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - I nearly did that, just create a ci
folder, or preferences on the path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoever taking over the CI development/maintenance gets to make the call :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref: #328, it'd be nice (not for this PR, but let's keep this in mind) in the future to encode the CI matrix/runner/etc in a json file instead of embedding this information in the workflow code, if possible (sometimes it is surprisingly hard, e.g. per-arch matrix:
that we discussed before).
/ok to test |
Add a note on combinations that are known not to work together.
runner: | ||
- default | ||
exclude: | ||
# To test this combo would require nontrivial installation steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kkraus14 this is the exclusion - I think a practical matter of packaging and something I preserved for no 12.0.1 with maybe not having wheels that far back, or being in flux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, Marcus! LGTM overall, we're heading toward the right direction! I left some comments but only one of them is important to address before merging (CI workflow name).
@@ -0,0 +1,119 @@ | |||
name: "CI" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important: As noted in the original file, this name is referred to in the test stages to query the run ID, e.g. here:
LATEST_PRIOR_RUN_ID=$(gh run list -b ${OLD_BRANCH} -L 1 -w "CI: Build and test" -s completed -R NVIDIA/cuda-python --json databaseId | jq '.[]| .databaseId') |
and here:
$runData = gh run list -b $OLD_BRANCH -L 1 -w "CI: Build and test" -s completed -R NVIDIA/cuda-python --json databaseId | ConvertFrom-Json |
so either we make a global change, or we keep the original name intact:
name: "CI" | |
# Note: This name is referred to in the test job, so make sure any changes are sync'd up! | |
name: "CI: Build and test" |
concurrency: | ||
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.event_name }} | ||
cancel-in-progress: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw I don't think it is in-scope for this PR, just something that came to my mind.
This PR is in fact v3 of the CI. When I refactored to v2, one thing I noticed is that it has less concurrency compared to v1 (and I think v3 inherits from v2). In v1, when a build job finishes, the dependent test job starts immediately (and potentially overlaps with other build jobs). But in v2, all build jobs have to finish, synchronize, and then all test jobs start to queue up. I think this is because of mismatching number of jobs between build and test in v2, and of the use of needs:
to declare job dependency. If you have any idea for restoring the v1-like behavior, we should discuss/implement in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to build-wheel.yml
. In the future we should also add build-conda.yml
(#280).
if: ${{ github.repository_owner == 'nvidia' }} | ||
secrets: inherit | ||
uses: | ||
./.github/workflows/build.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
./.github/workflows/build.yml | |
./.github/workflows/build-wheel.yml |
- name: Setup proxy cache | ||
uses: nv-gha-runners/setup-proxy-cache@main | ||
continue-on-error: true | ||
# Skip the cache on Windows nodes outside of our org. | ||
if: ${{ inputs.host-platform != 'win-64' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
- name: Compute Python Test Matrix | ||
id: compute-matrix | ||
run: | | ||
set -eo pipefail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could move this to either the job default
...
defaults:
run:
shell: bash --noprofile --norc -xeuo pipefail {0}
steps:
- name: Validate Test Type
...
or the workflow default (note that the other job below also set this).
# | ||
# [PY_VER, CUDA_VER, LOCAL_CTK, GPU, DRIVER] | ||
# | ||
export MATRICES=" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm my understanding: The nightly
CIs will cover the cartesian product of PY_VER
, CUDA_VER
, and LOCAL_CTK
, except for the above case (CUDA 12.0.1 wheels).
When we enable the nightly
CI in the next PR, we should also spread it out across more GPU variants (ex: RTX 2080, T4, V100, ..., which we do have in nv-gha-runners).
- name: Compute Python Test Matrix | ||
id: compute-matrix | ||
run: | | ||
set -eo pipefail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
nightly: | ||
- { ARCH: ${ARCH}, PY_VER: '3.12', CUDA_VER: '11.8.0', LOCAL_CTK: '0' } | ||
- { ARCH: ${ARCH}, PY_VER: '3.12', CUDA_VER: '11.8.0', LOCAL_CTK: '1' } | ||
- { ARCH: ${ARCH}, PY_VER: '3.12', CUDA_VER: '12.8.0', LOCAL_CTK: '0' } | ||
- { ARCH: ${ARCH}, PY_VER: '3.12', CUDA_VER: '12.8.0', LOCAL_CTK: '1' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we can do the same for Windows too and cover the full cartesian product PY_VER
x CUDA_VER
x LOCAL_CTK
in nightly
.
|
||
# Please keep the matrices sorted in ascending order by the following: | ||
# | ||
# [PY_VER, CUDA_VER, LOCAL_CTK] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: This is a smaller matrix because currently we don't get to choose GPU on Windows.
However, we do get to choose the driver version on Windows (it is installed at job start time). @cryos when we set up nightly
perhaps we could make it configurable?
Description
Refactor the CI matrix - closes #576.
Refactor the way the CI matrix is composed to be layered, add proxy cache to improve Python package download times, offer a concept of pull request and nightly matrix for the tests.
Checklist