Skip to content

DRA: CPU Placement Bit String #5213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks
johnbelamaric opened this issue Mar 20, 2025 · 6 comments
Open
4 tasks

DRA: CPU Placement Bit String #5213

johnbelamaric opened this issue Mar 20, 2025 · 6 comments
Assignees
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@johnbelamaric
Copy link
Member

Enhancement Description

One of the key goals of DRA is to help with alignment of devices in the intra-node topology. In the current DRA incarnation, this is done via the matchAttributes constraint. This allows you to require that specified attributes for all devices satisfying the request have the same value. For example, the NIC and GPU that are selected must share the same PCIe root complex.

One issue with this is that it does not include CPU alignment. Another issue is that intra-node topologies vary widely and are often quite complex; simple attribute matching may not be sufficient in many cases. It also requires a fair bit of knowledge on the part of users as to how to align these different devices.

The approach taken by Slurm, an HPC scheduler, is a bit different. Rather than requiring an understanding of specific attributes, Slurm calculates a standardize CPU-placement bit string for every device. That is, it normalizes based up on the number of CPUs in the node, and publishes a bit string for each device that represents which CPU(s) that device is aligned with. This localizes the alignment logic to the node, rather than requiring users to understand it in depth.

This bit string is calculable based on data in /proc and similar places on Linux machines, and the folks building Slurm have agreed to develop the necessary code to calculate this bit string in OSS to share with Kubernetes. This common library could be used by DRA plugin authors to publish the common placement bit string, and an alignment constraint option added to the ResourceClaim constraints. This will allow users to require alignment between their devices without having to understand the details of intra-node topology. It would also allows us to optimally align by default (that is, without the user asking, we would align if possible; with the constraint, we would fail scheduling if we cannot align).

  • One-line enhancement description (can be used as a release note): Enable DRA drivers to publish CPU alignment data, and allow DRA users to require that alignment during scheduling
  • Kubernetes Enhancement Proposal: TBD
  • Discussion Link: Experimenting with managing CPU alignment with DRA cncf/maintainer-summit#15
  • Primary contact (assignee): @johnbelamaric @catblade
  • Responsible SIGs: Node, Scheduling
  • Enhancement target (which target equals to which milestone):
    • Alpha release target (x.y): 1.34
    • Beta release target (x.y): 1.35
    • Stable release target (x.y): 1.36
  • Alpha
    • KEP (k/enhancements) update PR(s):
    • Code (k/k) update PR(s):
    • Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 20, 2025
@johnbelamaric
Copy link
Member Author

/sig node
/sig scheduling
/wg device-management

/cc @pohly @klueska @mrunalp

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 20, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Mar 20, 2025
@johnbelamaric johnbelamaric self-assigned this Mar 20, 2025
@pohly pohly moved this from 🆕 New to 📋 Backlog in SIG Node: Dynamic Resource Allocation Mar 24, 2025
@johnbelamaric
Copy link
Member Author

/assign @catblade

@k8s-ci-robot
Copy link
Contributor

@johnbelamaric: GitHub didn't allow me to assign the following users: catblade.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @catblade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ffromani
Copy link
Contributor

/cc

@ffromani
Copy link
Contributor

I'm very very interested in this direction, so I have random initial musings/question, I think there will be more clarity/details at the maintainer summit session

  1. is the bit string meant to be literally a bit string like 0x111000001110000 or so or is it gonna be a more sparse representation of the all the CPUs the resource is affine to? (e.g. CPUs the resource is affine to expressed as allowlist/linux cpuset rather than enumerating all the physical CPUs - I'd experct the bit string to be sparse, with more 0's than 1's)
  2. more in general, the goal here is to express cpu affinity for resources, or to express it with straightforwarded "out of the box" experience? or both?
  3. is this proposal meant to replace or augment alignments based on PCIe properties like e.g. the PCIe root complex example?

thanks!

@johnbelamaric
Copy link
Member Author

I'm very very interested in this direction, so I have random initial musings/question, I think there will be more clarity/details at the maintainer summit session

  1. is the bit string meant to be literally a bit string like 0x111000001110000 or so or is it gonna be a more sparse representation of the all the CPUs the resource is affine to? (e.g. CPUs the resource is affine to expressed as allowlist/linux cpuset rather than enumerating all the physical CPUs - I'd experct the bit string to be sparse, with more 0's than 1's)

@wickberg or @catblade will have a more definitive answer, but my understanding is it is a literal bitstring. But it's up to us really.

  1. more in general, the goal here is to express cpu affinity for resources, or to express it with straightforwarded "out of the box" experience? or both?

Both.

  1. is this proposal meant to replace or augment alignments based on PCIe properties like e.g. the PCIe root complex example?>

we won't take that away - it's part of the API. but maybe its uses will reduce. It still has uses for other things, but maybe it won't be needed for CPU alignment

We might be able to achieve this without an API changes at first, at least for "best effort".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: No status
Status: 📋 Backlog
Status: Needs Triage
Development

No branches or pull requests

3 participants