Skip to content

Bazel 8.2.0 remote cache timed out regression #25860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fishy opened this issue Apr 15, 2025 · 13 comments
Closed

Bazel 8.2.0 remote cache timed out regression #25860

fishy opened this issue Apr 15, 2025 · 13 comments
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged

Comments

@fishy
Copy link

fishy commented Apr 15, 2025

Description of the bug:

We run our CI tests with a remote cache setup with a s3 bucket and a local go http proxy to the s3 bucket, and bazel args of --remote_cache=http://localhost --remote_proxy=unix:/tmp/bazel_cache_proxy. There's no remote execution.

It worked perfectly fine with bazel 8.1.1 and before. But when we try to upgrade to 8.2.0, with the diff of:

diff --git a/.bazelversion b/.bazelversion
index 0e79152459..fbb9ea12de 100644
--- a/.bazelversion
+++ b/.bazelversion
@@ -1 +1 @@
-8.1.1
+8.2.0
diff --git a/MODULE.bazel.lock b/MODULE.bazel.lock
index 7219cc582c..06924910b3 100644
--- a/MODULE.bazel.lock
+++ b/MODULE.bazel.lock
@@ -110,10 +110,10 @@
     "https://bcr.bazel.build/modules/rules_java/7.2.0/MODULE.bazel": "06c0334c9be61e6cef2c8c84a7800cef502063269a5af25ceb100b192453d4ab",
     "https://bcr.bazel.build/modules/rules_java/7.3.2/MODULE.bazel": "50dece891cfdf1741ea230d001aa9c14398062f2b7c066470accace78e412bc2",
     "https://bcr.bazel.build/modules/rules_java/7.6.1/MODULE.bazel": "2f14b7e8a1aa2f67ae92bc69d1ec0fa8d9f827c4e17ff5e5f02e91caa3b2d0fe",
+    "https://bcr.bazel.build/modules/rules_java/8.11.0/MODULE.bazel": "c3d280bc5ff1038dcb3bacb95d3f6b83da8dd27bba57820ec89ea4085da767ad",
+    "https://bcr.bazel.build/modules/rules_java/8.11.0/source.json": "302b52a39259a85aa06ca3addb9787864ca3e03b432a5f964ea68244397e7544",
     "https://bcr.bazel.build/modules/rules_java/8.3.2/MODULE.bazel": "7336d5511ad5af0b8615fdc7477535a2e4e723a357b6713af439fe8cf0195017",
     "https://bcr.bazel.build/modules/rules_java/8.5.1/MODULE.bazel": "d8a9e38cc5228881f7055a6079f6f7821a073df3744d441978e7a43e20226939",
-    "https://bcr.bazel.build/modules/rules_java/8.6.1/MODULE.bazel": "f4808e2ab5b0197f094cabce9f4b006a27766beb6a9975931da07099560ca9c2",
-    "https://bcr.bazel.build/modules/rules_java/8.6.1/source.json": "f18d9ad3c4c54945bf422ad584fa6c5ca5b3116ff55a5b1bc77e5c1210be5960",
     "https://bcr.bazel.build/modules/rules_jvm_external/4.4.2/MODULE.bazel": "a56b85e418c83eb1839819f0b515c431010160383306d13ec21959ac412d2fe7",
     "https://bcr.bazel.build/modules/rules_jvm_external/5.1/MODULE.bazel": "33f6f999e03183f7d088c9be518a63467dfd0be94a11d0055fe2d210f89aa909",
     "https://bcr.bazel.build/modules/rules_jvm_external/5.2/MODULE.bazel": "d9351ba35217ad0de03816ef3ed63f89d411349353077348a45348b096615036",
@@ -582,28 +582,6 @@
         ]
       }
     },
-    "@@rules_java+//java:rules_java_deps.bzl%compatibility_proxy": {
-      "general": {
-        "bzlTransitiveDigest": "84xJEZ1jnXXwo8BXMprvBm++rRt4jsTu9liBxz0ivps=",
-        "usagesDigest": "jTQDdLDxsS43zuRmg1faAjIEPWdLAbDAowI1pInQSoo=",
-        "recordedFileInputs": {},
-        "recordedDirentsInputs": {},
-        "envVariables": {},
-        "generatedRepoSpecs": {
-          "compatibility_proxy": {
-            "repoRuleId": "@@rules_java+//java:rules_java_deps.bzl%_compatibility_proxy_repo_rule",
-            "attributes": {}
-          }
-        },
-        "recordedRepoMappingEntries": [
-          [
-            "rules_java+",
-            "bazel_tools",
-            "bazel_tools"
-          ]
-        ]
-      }
-    },
     "@@rules_kotlin+//src/main/starlark/core/repositories:bzlmod_setup.bzl%rules_kotlin_extensions": {
       "general": {
         "bzlTransitiveDigest": "sFhcgPbDQehmbD1EOXzX4H1q/CD5df8zwG4kp4jbvr8=",

It suddenly starts to fail the tests with:

ERROR: /path/to/BUILD.bazel:157:8: Testing //path/to:to_test failed: Failed to fetch blobs because of a remote cache error.: Download of '/cas/895986e760f9feb60526662e4d39924c4ca7e5d7c59495bc446b00734f0b6f5e' timed out. Received 0 bytes.

Which would also fail the whole CI run without local fallback.

The interesting things here are:

  1. This fails every time not just randomly. No matter how many times I retry the CI run, with 8.2.0 it always fails with the same error, never succeeded
  2. If I update the go proxy from using aws-sdk-go to aws-sdk-go-v2, that actually fixes the issue with 8.2.0, but when I try that with 8.1.1, I don't see it actually run faster than aws-sdk-go.

The test that failed with time out is one of the tests with bigger output, so it's possible that the bug only triggers with the combination of 8.2.0 and the cache exceeding some size threshold.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

linux

What is the output of bazel info release?

release 8.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?


If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

We only have access to the s3 bucket in the CI environment, so I cannot do the bisect locally. But this is one of the commits between 8.1.1 and 8.2.0

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@fmeum
Copy link
Collaborator

fmeum commented Apr 15, 2025

I think that you are seeing this now because of c202315, which makes it so that failures during the download of an action output are only retried once within the build rather than multiple times by restarting the entire build.

I don't know enough about your situation to assess whether this is just turning a preexisting issue into a hard failure or whether this kind of retrying behavior really is needed. It does seem weird to me to retry full builds for this failure case though.

@coeuvre What do you think?

@fishy
Copy link
Author

fishy commented Apr 15, 2025

Checked on s3 and that particular cache file is 100MB+, so size is likely a contributing factor to trigger the regression (but we didn't have the issue in 8.1.1)

@satyanandak satyanandak added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Apr 16, 2025
@meteorcloudy
Copy link
Member

@bazel-io fork 8.2.1

@meteorcloudy
Copy link
Member

@coeuvre can you confirm if this is an actual regression we should fix?

@coeuvre
Copy link
Member

coeuvre commented Apr 22, 2025

If I update the go proxy from using aws-sdk-go to aws-sdk-go-v2, that actually fixes the issue with 8.2.0, but when I try that with 8.1.1, I don't see it actually run faster than aws-sdk-go.

I am curious what changed in aws-sdk-go-v2 that fixed the issue. Does it handle large blob differently?

whether this kind of retrying behavior really is needed

Previously, transient remote cache errors (not just CacheNotFoundException) will cause the build to rewind. In practice, I think it is useful because remote cache cannot guarantee 100% uptime. I do want to keep this retry behavior if possible.

However, for this specific case, it seems like a remote cache server error that was hidden by the build rewinding now is surfaced. The correct thing to do is to fix the server.

@fmeum
Copy link
Collaborator

fmeum commented Apr 22, 2025

Previously, transient remote cache errors (not just CacheNotFoundException) will cause the build to rewind. In practice, I think it is useful because remote cache cannot guarantee 100% uptime. I do want to keep this retry behavior if possible.

I removed this behavior in c202315 since I thought that transient remote cache errors could just be retried directly rather than requesting the entire build to be rewind. Do you have a particular error (other than CacheNotFoundException) in mind that could only be fixed by build rewinding?

@coeuvre
Copy link
Member

coeuvre commented Apr 22, 2025

See #23033.

The transient remote cache errors might last longer than the duration of an invocation, rewinding the build makes Bazel bypass the remote cache and fallback to local execution.

@fishy
Copy link
Author

fishy commented Apr 22, 2025

I am curious what changed in aws-sdk-go-v2 that fixed the issue. Does it handle large blob differently?

maybe. the changes between aws-sdk-go and aws-sdk-go-v2 are quite large, so it's hard to say what changed under-the-hood without digging into the code.

@fmeum
Copy link
Collaborator

fmeum commented Apr 22, 2025

See #23033.

The transient remote cache errors might last longer than the duration of an invocation, rewinding the build makes Bazel bypass the remote cache and fallback to local execution.

I see, that makes sense. I did not remove the build rewinding introduced in #23033, only the retry on server errors during uploads - which I think could and should always be retried "locally" instead of rewinding the build.

@fishy
Copy link
Author

fishy commented Apr 22, 2025

I am curious what changed in aws-sdk-go-v2 that fixed the issue. Does it handle large blob differently?

maybe. the changes between aws-sdk-go and aws-sdk-go-v2 are quite large, so it's hard to say what changed under-the-hood without digging into the code.

One possibility (purely speculating based on the api change between aws-sdk-go and aws-sdk-go-v2) is that the timeout we hit was first-byte-timeout (as the error message mentioned "Received 0 bytes"). With the api of aws-sdk-go the go proxy actually needs to download the file from s3 bucket wholly into memory, before sending the response to bazel; With aws-sdk-go-v2 we could be streaming the file from s3 to bazel, so the total download time might be roughly the same, but the first byte can arrive much sooner.

@fishy
Copy link
Author

fishy commented Apr 22, 2025

I am curious what changed in aws-sdk-go-v2 that fixed the issue. Does it handle large blob differently?

maybe. the changes between aws-sdk-go and aws-sdk-go-v2 are quite large, so it's hard to say what changed under-the-hood without digging into the code.

One possibility (purely speculating based on the api change between aws-sdk-go and aws-sdk-go-v2) is that the timeout we hit was first-byte-timeout (as the error message mentioned "Received 0 bytes"). With the api of aws-sdk-go the go proxy actually needs to download the file from s3 bucket wholly into memory, before sending the response to bazel; With aws-sdk-go-v2 we could be streaming the file from s3 to bazel, so the total download time might be roughly the same, but the first byte can arrive much sooner.

Added some logs on the go proxy, and this is the largest download I can find:

size=53226408 headerTime=114.195509ms took=5.244117023s

So for a ~53MB file from s3, with aws-sdk-go-v2, it only took 114ms to get the headers back from s3 api, but took 5s to fully proxy the whole file back to bazel.

@coeuvre
Copy link
Member

coeuvre commented Apr 23, 2025

@fishy Thanks for the investigation!

I created #25929 in spirit of #23033.

@iancha1992
Copy link
Member

@bazel-io fork 8.3.0

iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Apr 24, 2025
... so that bazel can correctly rewind the build.

Fixes bazelbuild#25860.

Closes bazelbuild#25929.

PiperOrigin-RevId: 750917248
Change-Id: I8a278a36bb6565e1d7204eb1a0d8a800abd845bd
github-merge-queue bot pushed a commit that referenced this issue Apr 25, 2025
…25942)

... so that bazel can correctly rewind the build.

Fixes #25860.

Closes #25929.

PiperOrigin-RevId: 750917248
Change-Id: I8a278a36bb6565e1d7204eb1a0d8a800abd845bd

Commit
ba6f6f7

Co-authored-by: Chi Wang <[email protected]>
fmeum pushed a commit to fmeum/bazel that referenced this issue Apr 25, 2025
... so that bazel can correctly rewind the build.

Fixes bazelbuild#25860.

Closes bazelbuild#25929.

PiperOrigin-RevId: 750917248
Change-Id: I8a278a36bb6565e1d7204eb1a0d8a800abd845bd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants