[SYCL][OPT] Fix reorder optimization for Q4_0 #13003

NeoZhangJianyu · 2025-04-18T06:16:08Z

Idea: change the rule to call reorder tensor of Q4_0. Move it from initial graph_compute() to execute OPs.

fix the issue that the reordered tensor and reorder OP don't match, that lead to wrong result in some LLM.
Test by pythia-1.4b-Q4_0.gguf.
set reorder optimization feature as default, since fixed the known issues.
rm unused global variable.
fix the bug of missing to reorder the tensors in second call graph_compute() of same context.
It impacts the UT result: some UT cases can't test the reorder feature.

Todo:

support more cases of Q4_0.
consider reorder the tensor when load from GGUF. It's depended on all Q4_0 cases be supported (first item).
optimize other Quantized data type like Q4_K, Q5, ..Q8 by same framework.

qnixsynapse · 2025-04-18T06:38:30Z

I think one more TODO is to remove setting tensor->extra in ggml_backend_sycl_buffer_init_tensor and follow what slaren suggested.

llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp

Lines 340 to 344 in 2f74c35

    
           if (tensor->type == GGML_TYPE_Q4_0) { 
        
               ggml_tensor_extra_gpu * extra = new ggml_tensor_extra_gpu{}; 
        
               tensor->extra                 = extra; 
        
               ctx->tensor_extras.push_back(extra);  //used to release it when destroy ctx. 
        
           }

Rbiessy · 2025-04-18T07:54:57Z

Thanks for the PR, we'll have a look! Please make sure to keep this PR in review until we have time to review it.

Rbiessy · 2025-04-18T08:09:28Z

I think one more TODO is to remove setting tensor->extra in ggml_backend_sycl_buffer_init_tensor and follow what slaren suggested.

I agree however the suggested solution to follow the logic from ggml-cpu-aarch64 also sets the extra field in the init_tensor function:

llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Lines 6323 to 6328 in b9154ec

    
           static enum ggml_status ggml_backend_cpu_aarch64_buffer_init_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) { 
        
               tensor->extra = (void *) const_cast<ggml::cpu::tensor_traits *>(ggml_aarch64_get_optimal_repack_type(tensor)); 
        
               GGML_UNUSED(buffer); 
        
               return GGML_STATUS_SUCCESS; 
        
           }

It's not clear to me how this could be avoided at this stage.

NeoZhangJianyu · 2025-04-18T09:15:47Z

Yes, wait for you all review.

Yes, I have added the suggestion of slaren: consider reorder the tensor when load from GGUF. It's depended on all Q4_0 cases be supported (first item).

NeoZhangJianyu · 2025-04-18T09:18:04Z

I think one more TODO is to remove setting tensor->extra in ggml_backend_sycl_buffer_init_tensor and follow what slaren suggested.

llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp

Lines 340 to 344 in 2f74c35

if (tensor->type == GGML_TYPE_Q4_0) {

ggml_tensor_extra_gpu * extra = new ggml_tensor_extra_gpu{};

tensor->extra = extra;

ctx->tensor_extras.push_back(extra); //used to release it when destroy ctx.

}

Yes. I think this solution depend on all cases of Q4_0 supported reorder.

NeoZhangJianyu · 2025-04-18T09:20:47Z

Please test the PR with your LLMs of Q4_0.
I think all LLMs of Q4_0 shouldn't be blocked (wrong result).

Rbiessy · 2025-04-18T08:43:02Z

ggml/src/ggml-sycl/ggml-sycl.cpp

@@ -2925,13 +2983,15 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
        // KQ + KQV multi-batch
        ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst);
    } else if (use_dequantize_mul_mat_vec) {
+        opt_for_reorder(&ctx, src0, src1, dst); //the OP function in this branch support reorder.


This mean that src0 will always be reordered before running the mul_mat, isn't that going to badly affect performance?

Testing on B580 I get with main and GGML_SYCL_DISABLE_OPT=0 set:

model size params backend ngl sm mmap test t/s

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 pp512 6289.87 ± 12.41

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 tg128 104.89 ± 2.27

on main with GGML_SYCL_DISABLE_OPT=1 set (default):

model size params backend ngl sm mmap test t/s

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 pp512 7841.72 ± 18.29

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 tg128 102.66 ± 1.36

and with this patch (GGML_SYCL_DISABLE_OPT=0 set by default):

model size params backend ngl sm mmap test t/s

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 pp512 6309.32 ± 30.75

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 tg128 105.90 ± 0.19

There seem to be very little benefit to enable the reorder optimization by default in the text generation and it degrades the performance of the prompt processing phase.

The fact that reorder_qw requires a temporary buffer and would now run during the execution of the model increases the memory usage which can be an issue.

I think GGML_SYCL_DISABLE_OPT still needs to be disabled by default until we can solve these issues. To me the right direction would be to move to something closer to what the ggml-cpu backend does. I'm not sure that it would require all Q4_0 cases to support the reorder optimization as you say. Supporting all type of mul_mat for all quantization layout with the reorder format is a lot of work so I think we should look for a solution to only enable the reorder optimization for some useful cases (what @ShanoToni is working on).

I think the previous traces are bottlenecked by the host. Rerunning on a Lunar Lake iGPU and with a larger model as well.

main with GGML_SYCL_DISABLE_OPT=0

model size params backend ngl threads sm mmap test t/s

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 pp512 890.09 ± 1.82

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 tg128 41.25 ± 0.19

llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 pp512 207.97 ± 0.40

llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 tg128 14.37 ± 0.05

main with GGML_SYCL_DISABLE_OPT=1

model size params backend ngl threads sm mmap test t/s

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 pp512 817.07 ± 0.64

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 tg128 38.98 ± 0.86

llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 pp512 188.65 ± 0.67

llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 tg128 12.58 ± 0.04

PR:

model size params backend ngl threads sm mmap test t/s

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 pp512 890.10 ± 1.97

qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 8 none 0 tg128 41.06 ± 0.17

llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 pp512 207.98 ± 0.52

llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 8 none 0 tg128 13.87 ± 0.06

The reorder does seem beneficial here. I would still suggest to not enable GGML_SYCL_DISABLE_OPT=0 by default until we can solve the 2 issues above.

This mean that src0 will always be reordered before running the mul_mat, isn't that going to badly affect performance?

It's skipped in line 2917, so it will only be run the first time. Behavior remains from the original PR, only now the reorder doesn't depend on the sycl context.

slaren · 2025-04-18T11:27:50Z

I agree however the suggested solution to follow the logic from ggml-cpu-aarch64 also sets the extra field in the init_tensor function
It's not clear to me how this could be avoided at this stage.

The CPU backend uses extras for simplicity, but if there is no extra data that needs to be stored per-tensor, you can rely on the buffer type alone to determine if the tensor data is reordered.

ShanoToni

Apart from performance concerns @Rbiessy mentioned already, this resolves my concerns about running reorders for all relevant cases. (link to discussion for convenience: #5277 (reply in thread))

NeoZhangJianyu · 2025-04-21T06:46:14Z

@Rbiessy
Thank your test and feedback!

There seem to be very little benefit to enable the reorder optimization by default in the text generation and it degrades the performance of the prompt processing phase.

A: Yes.
Because the reorder process happen in the first mul_mat() OP, that will impact the PP performance.
And this solution can't make balance in the two stages.
It's general issue to optimize the LLM.
Maybe need another method to optimize PP.

The fact that reorder_qw requires a temporary buffer and would now run during the execution of the model increases the memory usage which can be an issue.

A: The temporary buffer will be released after finish the reorder. It's size is same as current Q4 tensor.
So, it won't take more memory for a long time.

NeoZhangJianyu · 2025-04-21T06:59:21Z

The reorder does seem beneficial here. I would still suggest to not enable GGML_SYCL_DISABLE_OPT=0 by default until we can solve the 2 issues above.

A: TG is more important than PP in customer cases.
In Qwen2 1.5 case above, PP is about 880 t/s, TG is about 40 t/s.
PP is reduce about 8%, TG is increased about 5%.
TG take more time and PP is short in a pipeline. So the +5% of TG will bring obviously absolutely performance increase than -8% of PP stage.

In the case of bigger LLM (like llama2-7B) and dGPU (Arc, BMG/PVC), the TG will be increased 20%-70% by this feature.

If we want a feature can make both are increased in same time, it's very hard to do in fact.
After balance, I think the benefit is more than side effect of this feature.

NeoZhangJianyu · 2025-04-21T07:50:21Z

The first PR of reorder lead to the wrong result of some LLM Q4_0.
So we disable it as default.

This PR fix the issue and won't impact the result of all LLM Q4_0.
It will increase TG performance more or less.
The degrade of PP could be accepted in customer cases compare to the benefit of TG.

For normal user, this feature will be ignored if they don't ready the guide.
User like OOB feature.
I suggest enabling this feature as default.

Alcpz

For normal user, this feature will be ignored if they don't ready the guide. User like OOB feature. I suggest enabling this feature as default.

I agree with this. The only reason we disabled the reorder by default was because it broke some user models. Even if we lose quite a bit of performance in prompt processing, from the user perspective, it feels better when using llama-cli.

However, I don't think this implementation should be final. We are most likely increasing other metrics that we are not really measuring, like the time to first token which also affects the "user" perspective, which relates to @Rbiessy 's concerns.

@NeoZhangJianyu can you explain further what was the issue was with reordered tensor and reorder OP not matching? A Q4_0 tensor using in a different operator?

Alcpz · 2025-04-21T14:38:32Z

ggml/src/ggml-sycl/ggml-sycl.cpp

@@ -2925,13 +2983,15 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
        // KQ + KQV multi-batch
        ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst);
    } else if (use_dequantize_mul_mat_vec) {
+        opt_for_reorder(&ctx, src0, src1, dst); //the OP function in this branch support reorder.


This mean that src0 will always be reordered before running the mul_mat, isn't that going to badly affect performance?

It's skipped in line 2917, so it will only be run the first time. Behavior remains from the original PR, only now the reorder doesn't depend on the sycl context.

qnixsynapse · 2025-04-22T03:51:12Z

The degrade of PP could be accepted in customer cases compare to the benefit of TG.

IMO. Both PP and TG are required for inference. A bad PP performance will result in bad experience, especially in long context LLMs.

My suggestion is to disable reorder opt by default until we find solution to fix PP. I agree with @Rbiessy here.

Alcpz · 2025-04-22T08:01:02Z

The numbers I have observed are not "bad" PP, but slightly worse than what we had. Less powerful systems are gonna notice more, but just for the first prompt that is processed and on benchmarks.
I don't expect a user to run an LLM just to work with a single prompt though, and applications (llama-cli for example) normally have a warm-up run when the application is loading (where the bottleneck is loading the model), so I don't mind the trade-off.

It would, however, have an impact from starting the application to actually starting to run things, so if @Rbiessy and @qnixsynapse disagree and notice the performance impact I wouldn't push for the merge.

Rbiessy · 2025-04-22T09:46:25Z

Because the reorder process happen in the first mul_mat() OP, that will impact the PP performance.
And this solution can't make balance in the two stages.
It's general issue to optimize the LLM.
Maybe need another method to optimize PP.

Are we sure the performance drop in PP is due to the first call to mul_mat? There is a warmup before running the benchmark so I don't see how it could affect that. It seems to me more likely that the issue is the mul_mat implementation using the reordered Q4_0 format is not as optimized for some sizes. How about we only use the reorder Q4_0 format for dequantize_mul_mat_vec as this seems to always improve performance? Could you confirm this @NeoZhangJianyu ?

I agree the extra memory usage should be fine since this is just for the first run.

I'd suggest in this PR we either don't enable Q4_0 by default or we disable the reorder optimization for the mul_mat case.

NeoZhangJianyu · 2025-04-23T01:43:03Z

For normal user, this feature will be ignored if they don't ready the guide. User like OOB feature. I suggest enabling this feature as default.

I agree with this. The only reason we disabled the reorder by default was because it broke some user models. Even if we lose quite a bit of performance in prompt processing, from the user perspective, it feels better when using llama-cli.

However, I don't think this implementation should be final. We are most likely increasing other metrics that we are not really measuring, like the time to first token which also affects the "user" perspective, which relates to @Rbiessy 's concerns.

@NeoZhangJianyu can you explain further what was the issue was with reordered tensor and reorder OP not matching? A Q4_0 tensor using in a different operator?

In the previous solution, reorder the Q4_0 tensor by go through all nodes in a model. Then execute the mul_mat_reorder (for example) in mul_mat() function by condition.

Because mul_mat_reorder() can't support all src0 and src1 combination cases, we can't reorder all Q4_0 tensors.
We must choose the Q4_0 tensors which are supported by mul_mat_reorder().

The condition of reorder tensor should be same as that of execute mul_mat_reorder() in mul_mat().
But the condition code can't be share/same in above two steps.
If the conditions are different, the reordered tensor can't be handled by mul_mat_reorder(). That lead to wrong result of the mul_mat() OP in same cases.

In this PR, I remove the reorder tensor in other function.
Reorder the tensor before the tensor is handled by mul_mat_reorder().
They execute in same code branch.
That could make sure the reordered tensor is handled by mul_mat_reorder().

Currently, mul_mat_reorder() is implemented in two legacy functions.
This solution can support more functions to be enhanced for reorder.

NeoZhangJianyu · 2025-04-23T02:00:46Z

Because the reorder process happen in the first mul_mat() OP, that will impact the PP performance.
And this solution can't make balance in the two stages.
It's general issue to optimize the LLM.
Maybe need another method to optimize PP.

Are we sure the performance drop in PP is due to the first call to mul_mat? There is a warmup before running the benchmark so I don't see how it could affect that. It seems to me more likely that the issue is the mul_mat implementation using the reordered Q4_0 format is not as optimized for some sizes. How about we only use the reorder Q4_0 format for dequantize_mul_mat_vec as this seems to always improve performance? Could you confirm this @NeoZhangJianyu ?

I agree the extra memory usage should be fine since this is just for the first run.

I'd suggest in this PR we either don't enable Q4_0 by default or we disable the reorder optimization for the mul_mat case.

dequantize_mul_mat_vec() is the bottleneck of performance in common LLM, like llama2.
We optimize this function will get better performance because this function is called more times than other sub_mul_mat() for Q4_0 type.
Because most LLMs are based on the structure of llama family, this optimization works well for most of them.

By this PR, the wrong result of mul_mat() for Q4_0 is fixed.
I think this feature should be opened as default.
So that normal user can enjoy good performance of SYCL backend.

Otherwise, user will turn to Other backend since all optimizations are enabled as default in other backend.
User can get the good result directly in other backend.

change the reorder tensor from init to execute OP

c7500c9

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 18, 2025

NeoZhangJianyu requested a review from airMeng April 18, 2025 06:17

NeoZhangJianyu mentioned this pull request Apr 18, 2025

sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs #12858

Open

3 tasks

airMeng approved these changes Apr 18, 2025

View reviewed changes

Rbiessy self-requested a review April 18, 2025 07:53

Rbiessy reviewed Apr 18, 2025

View reviewed changes

ShanoToni approved these changes Apr 18, 2025

View reviewed changes

Alcpz approved these changes Apr 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][OPT] Fix reorder optimization for Q4_0 #13003

[SYCL][OPT] Fix reorder optimization for Q4_0 #13003

NeoZhangJianyu commented Apr 18, 2025

qnixsynapse commented Apr 18, 2025

Rbiessy commented Apr 18, 2025

Rbiessy commented Apr 18, 2025

NeoZhangJianyu commented Apr 18, 2025

NeoZhangJianyu commented Apr 18, 2025

NeoZhangJianyu commented Apr 18, 2025

Rbiessy Apr 18, 2025 •

edited

Loading

Rbiessy Apr 18, 2025

Alcpz Apr 21, 2025

slaren commented Apr 18, 2025 •

edited

Loading

ShanoToni left a comment

NeoZhangJianyu commented Apr 21, 2025

NeoZhangJianyu commented Apr 21, 2025

NeoZhangJianyu commented Apr 21, 2025

Alcpz left a comment

Alcpz Apr 21, 2025

qnixsynapse commented Apr 22, 2025

Alcpz commented Apr 22, 2025

Rbiessy commented Apr 22, 2025 •

edited

Loading

NeoZhangJianyu commented Apr 23, 2025

NeoZhangJianyu commented Apr 23, 2025

model	size	params	backend	ngl	sm	mmap	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	0	pp512	6289.87 ± 12.41
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	0	tg128	104.89 ± 2.27

[SYCL][OPT] Fix reorder optimization for Q4_0 #13003

Are you sure you want to change the base?

[SYCL][OPT] Fix reorder optimization for Q4_0 #13003

Conversation

NeoZhangJianyu commented Apr 18, 2025

qnixsynapse commented Apr 18, 2025

Rbiessy commented Apr 18, 2025

Rbiessy commented Apr 18, 2025

NeoZhangJianyu commented Apr 18, 2025

NeoZhangJianyu commented Apr 18, 2025

NeoZhangJianyu commented Apr 18, 2025

Rbiessy Apr 18, 2025 • edited Loading

Choose a reason for hiding this comment

Rbiessy Apr 18, 2025

Choose a reason for hiding this comment

Alcpz Apr 21, 2025

Choose a reason for hiding this comment

slaren commented Apr 18, 2025 • edited Loading

ShanoToni left a comment

Choose a reason for hiding this comment

NeoZhangJianyu commented Apr 21, 2025

NeoZhangJianyu commented Apr 21, 2025

NeoZhangJianyu commented Apr 21, 2025

Alcpz left a comment

Choose a reason for hiding this comment

Alcpz Apr 21, 2025

Choose a reason for hiding this comment

qnixsynapse commented Apr 22, 2025

Alcpz commented Apr 22, 2025

Rbiessy commented Apr 22, 2025 • edited Loading

NeoZhangJianyu commented Apr 23, 2025

NeoZhangJianyu commented Apr 23, 2025

Rbiessy Apr 18, 2025 •

edited

Loading

slaren commented Apr 18, 2025 •

edited

Loading

Rbiessy commented Apr 22, 2025 •

edited

Loading