-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Bazel 8.2.0 remote cache timed out regression #25860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think that you are seeing this now because of c202315, which makes it so that failures during the download of an action output are only retried once within the build rather than multiple times by restarting the entire build. I don't know enough about your situation to assess whether this is just turning a preexisting issue into a hard failure or whether this kind of retrying behavior really is needed. It does seem weird to me to retry full builds for this failure case though. @coeuvre What do you think? |
Checked on s3 and that particular cache file is 100MB+, so size is likely a contributing factor to trigger the regression (but we didn't have the issue in 8.1.1) |
@bazel-io fork 8.2.1 |
@coeuvre can you confirm if this is an actual regression we should fix? |
I am curious what changed in
Previously, transient remote cache errors (not just CacheNotFoundException) will cause the build to rewind. In practice, I think it is useful because remote cache cannot guarantee 100% uptime. I do want to keep this retry behavior if possible. However, for this specific case, it seems like a remote cache server error that was hidden by the build rewinding now is surfaced. The correct thing to do is to fix the server. |
I removed this behavior in c202315 since I thought that transient remote cache errors could just be retried directly rather than requesting the entire build to be rewind. Do you have a particular error (other than |
See #23033. The transient remote cache errors might last longer than the duration of an invocation, rewinding the build makes Bazel bypass the remote cache and fallback to local execution. |
maybe. the changes between |
I see, that makes sense. I did not remove the build rewinding introduced in #23033, only the retry on server errors during uploads - which I think could and should always be retried "locally" instead of rewinding the build. |
One possibility (purely speculating based on the api change between |
Added some logs on the go proxy, and this is the largest download I can find:
So for a ~53MB file from s3, with |
@bazel-io fork 8.3.0 |
... so that bazel can correctly rewind the build. Fixes bazelbuild#25860. Closes bazelbuild#25929. PiperOrigin-RevId: 750917248 Change-Id: I8a278a36bb6565e1d7204eb1a0d8a800abd845bd
…25942) ... so that bazel can correctly rewind the build. Fixes #25860. Closes #25929. PiperOrigin-RevId: 750917248 Change-Id: I8a278a36bb6565e1d7204eb1a0d8a800abd845bd Commit ba6f6f7 Co-authored-by: Chi Wang <[email protected]>
... so that bazel can correctly rewind the build. Fixes bazelbuild#25860. Closes bazelbuild#25929. PiperOrigin-RevId: 750917248 Change-Id: I8a278a36bb6565e1d7204eb1a0d8a800abd845bd
Description of the bug:
We run our CI tests with a remote cache setup with a s3 bucket and a local go http proxy to the s3 bucket, and bazel args of
--remote_cache=http://localhost --remote_proxy=unix:/tmp/bazel_cache_proxy
. There's no remote execution.It worked perfectly fine with bazel 8.1.1 and before. But when we try to upgrade to 8.2.0, with the diff of:
It suddenly starts to fail the tests with:
Which would also fail the whole CI run without local fallback.
The interesting things here are:
aws-sdk-go
toaws-sdk-go-v2
, that actually fixes the issue with 8.2.0, but when I try that with 8.1.1, I don't see it actually run faster thanaws-sdk-go
.The test that failed with time out is one of the tests with bigger output, so it's possible that the bug only triggers with the combination of 8.2.0 and the cache exceeding some size threshold.
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
No response
Which operating system are you running Bazel on?
linux
What is the output of
bazel info release
?release 8.2.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.
We only have access to the s3 bucket in the CI environment, so I cannot do the bisect locally. But this is one of the commits between 8.1.1 and 8.2.0
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: