-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[SYCL][OPT] Fix reorder optimization for Q4_0 #13003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SYCL][OPT] Fix reorder optimization for Q4_0 #13003
Conversation
I think one more TODO is to remove setting tensor->extra in llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp Lines 340 to 344 in 2f74c35
|
Thanks for the PR, we'll have a look! Please make sure to keep this PR in review until we have time to review it. |
I agree however the suggested solution to follow the logic from ggml-cpu-aarch64 also sets the llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Lines 6323 to 6328 in b9154ec
It's not clear to me how this could be avoided at this stage. |
Yes, wait for you all review. Yes, I have added the suggestion of slaren: consider reorder the tensor when load from GGUF. It's depended on all Q4_0 cases be supported (first item). |
Yes. I think this solution depend on all cases of Q4_0 supported reorder. |
Please test the PR with your LLMs of Q4_0. |
@@ -2925,13 +2983,15 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor | |||
// KQ + KQV multi-batch | |||
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst); | |||
} else if (use_dequantize_mul_mat_vec) { | |||
opt_for_reorder(&ctx, src0, src1, dst); //the OP function in this branch support reorder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mean that src0
will always be reordered before running the mul_mat
, isn't that going to badly affect performance?
Testing on B580 I get with main and GGML_SYCL_DISABLE_OPT=0
set:
model | size | params | backend | ngl | sm | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | none | 0 | pp512 | 6289.87 ± 12.41 |
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | none | 0 | tg128 | 104.89 ± 2.27 |
on main with GGML_SYCL_DISABLE_OPT=1
set (default):
model | size | params | backend | ngl | sm | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | none | 0 | pp512 | 7841.72 ± 18.29 |
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | none | 0 | tg128 | 102.66 ± 1.36 |
and with this patch (GGML_SYCL_DISABLE_OPT=0
set by default):
model | size | params | backend | ngl | sm | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | none | 0 | pp512 | 6309.32 ± 30.75 |
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | none | 0 | tg128 | 105.90 ± 0.19 |
- There seem to be very little benefit to enable the reorder optimization by default in the text generation and it degrades the performance of the prompt processing phase.
- The fact that
reorder_qw
requires a temporary buffer and would now run during the execution of the model increases the memory usage which can be an issue.
I think GGML_SYCL_DISABLE_OPT
still needs to be disabled by default until we can solve these issues. To me the right direction would be to move to something closer to what the ggml-cpu backend does. I'm not sure that it would require all Q4_0 cases to support the reorder optimization as you say. Supporting all type of mul_mat for all quantization layout with the reorder format is a lot of work so I think we should look for a solution to only enable the reorder optimization for some useful cases (what @ShanoToni is working on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the previous traces are bottlenecked by the host. Rerunning on a Lunar Lake iGPU and with a larger model as well.
main with GGML_SYCL_DISABLE_OPT=0
model | size | params | backend | ngl | threads | sm | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|---|
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | 8 | none | 0 | pp512 | 890.09 ± 1.82 |
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | 8 | none | 0 | tg128 | 41.25 ± 0.19 |
llama 7B Q4_0 | 3.57 GiB | 6.74 B | SYCL | 99 | 8 | none | 0 | pp512 | 207.97 ± 0.40 |
llama 7B Q4_0 | 3.57 GiB | 6.74 B | SYCL | 99 | 8 | none | 0 | tg128 | 14.37 ± 0.05 |
main with GGML_SYCL_DISABLE_OPT=1
model | size | params | backend | ngl | threads | sm | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|---|
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | 8 | none | 0 | pp512 | 817.07 ± 0.64 |
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | 8 | none | 0 | tg128 | 38.98 ± 0.86 |
llama 7B Q4_0 | 3.57 GiB | 6.74 B | SYCL | 99 | 8 | none | 0 | pp512 | 188.65 ± 0.67 |
llama 7B Q4_0 | 3.57 GiB | 6.74 B | SYCL | 99 | 8 | none | 0 | tg128 | 12.58 ± 0.04 |
PR:
model | size | params | backend | ngl | threads | sm | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|---|
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | 8 | none | 0 | pp512 | 890.10 ± 1.97 |
qwen2 1.5B Q4_0 | 1013.62 MiB | 1.78 B | SYCL | 99 | 8 | none | 0 | tg128 | 41.06 ± 0.17 |
llama 7B Q4_0 | 3.57 GiB | 6.74 B | SYCL | 99 | 8 | none | 0 | pp512 | 207.98 ± 0.52 |
llama 7B Q4_0 | 3.57 GiB | 6.74 B | SYCL | 99 | 8 | none | 0 | tg128 | 13.87 ± 0.06 |
The reorder does seem beneficial here. I would still suggest to not enable GGML_SYCL_DISABLE_OPT=0
by default until we can solve the 2 issues above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mean that src0 will always be reordered before running the mul_mat, isn't that going to badly affect performance?
It's skipped in line 2917, so it will only be run the first time. Behavior remains from the original PR, only now the reorder doesn't depend on the sycl context.
The CPU backend uses extras for simplicity, but if there is no extra data that needs to be stored per-tensor, you can rely on the buffer type alone to determine if the tensor data is reordered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from performance concerns @Rbiessy mentioned already, this resolves my concerns about running reorders for all relevant cases. (link to discussion for convenience: #5277 (reply in thread))
@Rbiessy
A: Yes.
A: The temporary buffer will be released after finish the reorder. It's size is same as current Q4 tensor. |
A: TG is more important than PP in customer cases. In the case of bigger LLM (like llama2-7B) and dGPU (Arc, BMG/PVC), the TG will be increased 20%-70% by this feature. If we want a feature can make both are increased in same time, it's very hard to do in fact. |
The first PR of reorder lead to the wrong result of some LLM Q4_0. This PR fix the issue and won't impact the result of all LLM Q4_0. For normal user, this feature will be ignored if they don't ready the guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For normal user, this feature will be ignored if they don't ready the guide. User like OOB feature. I suggest enabling this feature as default.
I agree with this. The only reason we disabled the reorder by default was because it broke some user models. Even if we lose quite a bit of performance in prompt processing, from the user perspective, it feels better when using llama-cli.
However, I don't think this implementation should be final. We are most likely increasing other metrics that we are not really measuring, like the time to first token which also affects the "user" perspective, which relates to @Rbiessy 's concerns.
@NeoZhangJianyu can you explain further what was the issue was with reordered tensor and reorder OP not matching? A Q4_0
tensor using in a different operator?
@@ -2925,13 +2983,15 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor | |||
// KQ + KQV multi-batch | |||
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst); | |||
} else if (use_dequantize_mul_mat_vec) { | |||
opt_for_reorder(&ctx, src0, src1, dst); //the OP function in this branch support reorder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mean that src0 will always be reordered before running the mul_mat, isn't that going to badly affect performance?
It's skipped in line 2917, so it will only be run the first time. Behavior remains from the original PR, only now the reorder doesn't depend on the sycl context.
IMO. Both PP and TG are required for inference. A bad PP performance will result in bad experience, especially in long context LLMs. My suggestion is to disable reorder opt by default until we find solution to fix PP. I agree with @Rbiessy here. |
The numbers I have observed are not "bad" PP, but slightly worse than what we had. Less powerful systems are gonna notice more, but just for the first prompt that is processed and on benchmarks. It would, however, have an impact from starting the application to actually starting to run things, so if @Rbiessy and @qnixsynapse disagree and notice the performance impact I wouldn't push for the merge. |
Are we sure the performance drop in PP is due to the first call to I agree the extra memory usage should be fine since this is just for the first run. I'd suggest in this PR we either don't enable Q4_0 by default or we disable the reorder optimization for the mul_mat case. |
In the previous solution, reorder the Q4_0 tensor by go through all nodes in a model. Then execute the mul_mat_reorder (for example) in mul_mat() function by condition. Because mul_mat_reorder() can't support all src0 and src1 combination cases, we can't reorder all Q4_0 tensors. The condition of reorder tensor should be same as that of execute mul_mat_reorder() in mul_mat(). In this PR, I remove the reorder tensor in other function. Currently, mul_mat_reorder() is implemented in two legacy functions. |
dequantize_mul_mat_vec() is the bottleneck of performance in common LLM, like llama2. By this PR, the wrong result of mul_mat() for Q4_0 is fixed. Otherwise, user will turn to Other backend since all optimizations are enabled as default in other backend. |
Idea: change the rule to call reorder tensor of Q4_0. Move it from initial graph_compute() to execute OPs.
fix the issue that the reordered tensor and reorder OP don't match, that lead to wrong result in some LLM.
Test by pythia-1.4b-Q4_0.gguf.
set reorder optimization feature as default, since fixed the known issues.
rm unused global variable.
fix the bug of missing to reorder the tensors in second call graph_compute() of same context.
It impacts the UT result: some UT cases can't test the reorder feature.
Todo: