Bazel 8.2.0 remote cache timed out regression #25860

fishy · 2025-04-15T16:28:08Z

Description of the bug:

We run our CI tests with a remote cache setup with a s3 bucket and a local go http proxy to the s3 bucket, and bazel args of --remote_cache=http://localhost --remote_proxy=unix:/tmp/bazel_cache_proxy. There's no remote execution.

It worked perfectly fine with bazel 8.1.1 and before. But when we try to upgrade to 8.2.0, with the diff of:

diff --git a/.bazelversion b/.bazelversion
index 0e79152459..fbb9ea12de 100644
--- a/.bazelversion
+++ b/.bazelversion
@@ -1 +1 @@
-8.1.1
+8.2.0
diff --git a/MODULE.bazel.lock b/MODULE.bazel.lock
index 7219cc582c..06924910b3 100644
--- a/MODULE.bazel.lock
+++ b/MODULE.bazel.lock
@@ -110,10 +110,10 @@
     "https://bcr.bazel.build/modules/rules_java/7.2.0/MODULE.bazel": "06c0334c9be61e6cef2c8c84a7800cef502063269a5af25ceb100b192453d4ab",
     "https://bcr.bazel.build/modules/rules_java/7.3.2/MODULE.bazel": "50dece891cfdf1741ea230d001aa9c14398062f2b7c066470accace78e412bc2",
     "https://bcr.bazel.build/modules/rules_java/7.6.1/MODULE.bazel": "2f14b7e8a1aa2f67ae92bc69d1ec0fa8d9f827c4e17ff5e5f02e91caa3b2d0fe",
+    "https://bcr.bazel.build/modules/rules_java/8.11.0/MODULE.bazel": "c3d280bc5ff1038dcb3bacb95d3f6b83da8dd27bba57820ec89ea4085da767ad",
+    "https://bcr.bazel.build/modules/rules_java/8.11.0/source.json": "302b52a39259a85aa06ca3addb9787864ca3e03b432a5f964ea68244397e7544",
     "https://bcr.bazel.build/modules/rules_java/8.3.2/MODULE.bazel": "7336d5511ad5af0b8615fdc7477535a2e4e723a357b6713af439fe8cf0195017",
     "https://bcr.bazel.build/modules/rules_java/8.5.1/MODULE.bazel": "d8a9e38cc5228881f7055a6079f6f7821a073df3744d441978e7a43e20226939",
-    "https://bcr.bazel.build/modules/rules_java/8.6.1/MODULE.bazel": "f4808e2ab5b0197f094cabce9f4b006a27766beb6a9975931da07099560ca9c2",
-    "https://bcr.bazel.build/modules/rules_java/8.6.1/source.json": "f18d9ad3c4c54945bf422ad584fa6c5ca5b3116ff55a5b1bc77e5c1210be5960",
     "https://bcr.bazel.build/modules/rules_jvm_external/4.4.2/MODULE.bazel": "a56b85e418c83eb1839819f0b515c431010160383306d13ec21959ac412d2fe7",
     "https://bcr.bazel.build/modules/rules_jvm_external/5.1/MODULE.bazel": "33f6f999e03183f7d088c9be518a63467dfd0be94a11d0055fe2d210f89aa909",
     "https://bcr.bazel.build/modules/rules_jvm_external/5.2/MODULE.bazel": "d9351ba35217ad0de03816ef3ed63f89d411349353077348a45348b096615036",
@@ -582,28 +582,6 @@
         ]
       }
     },
-    "@@rules_java+//java:rules_java_deps.bzl%compatibility_proxy": {
-      "general": {
-        "bzlTransitiveDigest": "84xJEZ1jnXXwo8BXMprvBm++rRt4jsTu9liBxz0ivps=",
-        "usagesDigest": "jTQDdLDxsS43zuRmg1faAjIEPWdLAbDAowI1pInQSoo=",
-        "recordedFileInputs": {},
-        "recordedDirentsInputs": {},
-        "envVariables": {},
-        "generatedRepoSpecs": {
-          "compatibility_proxy": {
-            "repoRuleId": "@@rules_java+//java:rules_java_deps.bzl%_compatibility_proxy_repo_rule",
-            "attributes": {}
-          }
-        },
-        "recordedRepoMappingEntries": [
-          [
-            "rules_java+",
-            "bazel_tools",
-            "bazel_tools"
-          ]
-        ]
-      }
-    },
     "@@rules_kotlin+//src/main/starlark/core/repositories:bzlmod_setup.bzl%rules_kotlin_extensions": {
       "general": {
         "bzlTransitiveDigest": "sFhcgPbDQehmbD1EOXzX4H1q/CD5df8zwG4kp4jbvr8=",

It suddenly starts to fail the tests with:

ERROR: /path/to/BUILD.bazel:157:8: Testing //path/to:to_test failed: Failed to fetch blobs because of a remote cache error.: Download of '/cas/895986e760f9feb60526662e4d39924c4ca7e5d7c59495bc446b00734f0b6f5e' timed out. Received 0 bytes.

Which would also fail the whole CI run without local fallback.

The interesting things here are:

This fails every time not just randomly. No matter how many times I retry the CI run, with 8.2.0 it always fails with the same error, never succeeded
If I update the go proxy from using aws-sdk-go to aws-sdk-go-v2, that actually fixes the issue with 8.2.0, but when I try that with 8.1.1, I don't see it actually run faster than aws-sdk-go.

The test that failed with time out is one of the tests with bigger output, so it's possible that the bug only triggers with the combination of 8.2.0 and the cache exceeding some size threshold.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

linux

What is the output of `bazel info release`?

release 8.2.0

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

We only have access to the s3 bucket in the CI environment, so I cannot do the bisect locally. But this is one of the commits between 8.1.1 and 8.2.0

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

The text was updated successfully, but these errors were encountered:

fmeum · 2025-04-15T16:46:21Z

I think that you are seeing this now because of c202315, which makes it so that failures during the download of an action output are only retried once within the build rather than multiple times by restarting the entire build.

I don't know enough about your situation to assess whether this is just turning a preexisting issue into a hard failure or whether this kind of retrying behavior really is needed. It does seem weird to me to retry full builds for this failure case though.

@coeuvre What do you think?

fishy · 2025-04-15T18:02:36Z

Checked on s3 and that particular cache file is 100MB+, so size is likely a contributing factor to trigger the regression (but we didn't have the issue in 8.1.1)

meteorcloudy · 2025-04-16T15:33:54Z

@bazel-io fork 8.2.1

meteorcloudy · 2025-04-16T15:34:53Z

@coeuvre can you confirm if this is an actual regression we should fix?

coeuvre · 2025-04-22T08:29:05Z

If I update the go proxy from using aws-sdk-go to aws-sdk-go-v2, that actually fixes the issue with 8.2.0, but when I try that with 8.1.1, I don't see it actually run faster than aws-sdk-go.

I am curious what changed in aws-sdk-go-v2 that fixed the issue. Does it handle large blob differently?

whether this kind of retrying behavior really is needed

Previously, transient remote cache errors (not just CacheNotFoundException) will cause the build to rewind. In practice, I think it is useful because remote cache cannot guarantee 100% uptime. I do want to keep this retry behavior if possible.

However, for this specific case, it seems like a remote cache server error that was hidden by the build rewinding now is surfaced. The correct thing to do is to fix the server.

fmeum · 2025-04-22T09:23:50Z

Previously, transient remote cache errors (not just CacheNotFoundException) will cause the build to rewind. In practice, I think it is useful because remote cache cannot guarantee 100% uptime. I do want to keep this retry behavior if possible.

I removed this behavior in c202315 since I thought that transient remote cache errors could just be retried directly rather than requesting the entire build to be rewind. Do you have a particular error (other than CacheNotFoundException) in mind that could only be fixed by build rewinding?

coeuvre · 2025-04-22T09:43:49Z

See #23033.

The transient remote cache errors might last longer than the duration of an invocation, rewinding the build makes Bazel bypass the remote cache and fallback to local execution.

fishy · 2025-04-22T15:31:18Z

I am curious what changed in aws-sdk-go-v2 that fixed the issue. Does it handle large blob differently?

maybe. the changes between aws-sdk-go and aws-sdk-go-v2 are quite large, so it's hard to say what changed under-the-hood without digging into the code.

fmeum · 2025-04-22T15:57:02Z

See #23033.

The transient remote cache errors might last longer than the duration of an invocation, rewinding the build makes Bazel bypass the remote cache and fallback to local execution.

I see, that makes sense. I did not remove the build rewinding introduced in #23033, only the retry on server errors during uploads - which I think could and should always be retried "locally" instead of rewinding the build.

fishy · 2025-04-22T16:07:13Z

I am curious what changed in aws-sdk-go-v2 that fixed the issue. Does it handle large blob differently?

maybe. the changes between aws-sdk-go and aws-sdk-go-v2 are quite large, so it's hard to say what changed under-the-hood without digging into the code.

One possibility (purely speculating based on the api change between aws-sdk-go and aws-sdk-go-v2) is that the timeout we hit was first-byte-timeout (as the error message mentioned "Received 0 bytes"). With the api of aws-sdk-go the go proxy actually needs to download the file from s3 bucket wholly into memory, before sending the response to bazel; With aws-sdk-go-v2 we could be streaming the file from s3 to bazel, so the total download time might be roughly the same, but the first byte can arrive much sooner.

fishy · 2025-04-22T17:03:53Z

I am curious what changed in aws-sdk-go-v2 that fixed the issue. Does it handle large blob differently?

maybe. the changes between aws-sdk-go and aws-sdk-go-v2 are quite large, so it's hard to say what changed under-the-hood without digging into the code.

One possibility (purely speculating based on the api change between aws-sdk-go and aws-sdk-go-v2) is that the timeout we hit was first-byte-timeout (as the error message mentioned "Received 0 bytes"). With the api of aws-sdk-go the go proxy actually needs to download the file from s3 bucket wholly into memory, before sending the response to bazel; With aws-sdk-go-v2 we could be streaming the file from s3 to bazel, so the total download time might be roughly the same, but the first byte can arrive much sooner.

Added some logs on the go proxy, and this is the largest download I can find:

size=53226408 headerTime=114.195509ms took=5.244117023s

So for a ~53MB file from s3, with aws-sdk-go-v2, it only took 114ms to get the headers back from s3 api, but took 5s to fully proxy the whole file back to bazel.

coeuvre · 2025-04-23T12:30:56Z

@fishy Thanks for the investigation!

I created #25929 in spirit of #23033.

iancha1992 · 2025-04-24T16:01:00Z

@bazel-io fork 8.3.0

... so that bazel can correctly rewind the build. Fixes bazelbuild#25860. Closes bazelbuild#25929. PiperOrigin-RevId: 750917248 Change-Id: I8a278a36bb6565e1d7204eb1a0d8a800abd845bd

…25942) ... so that bazel can correctly rewind the build. Fixes #25860. Closes #25929. PiperOrigin-RevId: 750917248 Change-Id: I8a278a36bb6565e1d7204eb1a0d8a800abd845bd Commit ba6f6f7 Co-authored-by: Chi Wang <[email protected]>

... so that bazel can correctly rewind the build. Fixes bazelbuild#25860. Closes bazelbuild#25929. PiperOrigin-RevId: 750917248 Change-Id: I8a278a36bb6565e1d7204eb1a0d8a800abd845bd

fishy added type: bug untriaged labels Apr 15, 2025

fishy assigned iancha1992, satyanandak and sgowroji Apr 15, 2025

satyanandak added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Apr 16, 2025

satyanandak unassigned sgowroji, iancha1992 and satyanandak Apr 16, 2025

bazel-io mentioned this issue Apr 16, 2025

[8.2.1] Bazel 8.2.0 remote cache timed out regression #25873

Closed

iancha1992 added the potential 8.2.x cherry-picks label Apr 17, 2025

coeuvre removed the potential 8.2.x cherry-picks label Apr 22, 2025

coeuvre mentioned this issue Apr 23, 2025

Treat download error during input prefetching as lost input. #25929

Closed

copybara-service bot closed this as completed in ba6f6f7 Apr 24, 2025

bazel-io mentioned this issue Apr 24, 2025

[8.3.0] Bazel 8.2.0 remote cache timed out regression #25939

Closed

iancha1992 mentioned this issue Apr 24, 2025

[8.3.0] Treat download error during input prefetching as lost input. #25942

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bazel 8.2.0 remote cache timed out regression #25860

Bazel 8.2.0 remote cache timed out regression #25860

fishy commented Apr 15, 2025

fmeum commented Apr 15, 2025

fishy commented Apr 15, 2025

meteorcloudy commented Apr 16, 2025

meteorcloudy commented Apr 16, 2025

coeuvre commented Apr 22, 2025

fmeum commented Apr 22, 2025

coeuvre commented Apr 22, 2025

fishy commented Apr 22, 2025

fmeum commented Apr 22, 2025

fishy commented Apr 22, 2025

fishy commented Apr 22, 2025

coeuvre commented Apr 23, 2025

iancha1992 commented Apr 24, 2025

Bazel 8.2.0 remote cache timed out regression #25860

Bazel 8.2.0 remote cache timed out regression #25860

Comments

fishy commented Apr 15, 2025

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

fmeum commented Apr 15, 2025

fishy commented Apr 15, 2025

meteorcloudy commented Apr 16, 2025

meteorcloudy commented Apr 16, 2025

coeuvre commented Apr 22, 2025

fmeum commented Apr 22, 2025

coeuvre commented Apr 22, 2025

fishy commented Apr 22, 2025

fmeum commented Apr 22, 2025

fishy commented Apr 22, 2025

fishy commented Apr 22, 2025

coeuvre commented Apr 23, 2025

iancha1992 commented Apr 24, 2025

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?