[QCOM] [Llama] the size of w4a16 quantized Llama 3.2 1B Pte is too large #10226

tiger-of-shawn · 2025-04-16T07:05:11Z

use the latest executorch codebase:

get the pte file:

the file size is:

-rw-rw-r-- 1 2.9G Apr 16 06:53 test.pte

while the float model size is:

-rw-rw-r-- 1 2.4G Oct 23 03:12 assets/models/Llama-3.2-1B/original/consolidated.00.pth

the convert script is:

#Export Llama Model
function export_llama {
model_path="$1"
# 4 bits weight only quantize
python -m examples.models.llama.export_llama
-t "$model_path/original/tokenizer.model"
--checkpoint "$model_path/original/consolidated.00.pth"
-p "$model_path/original/params.json"
--disable_dynamic_shape
--qnn
--pt2e_quantize qnn_16a4w
--model llama3_2
-d fp32
--use_kv_cache
--num_sharding 1
--soc_model SM8650
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
-v
--output_name="test.pte"
}

why the pte file size is larger than float model ?

when I use the v0.4 executorch codebase to get the pte with the same configure, the pte's size is normal:

-rw-rw-r-- 1 1.1G Apr 16 06:57 output.pte

cc @cccclai @winskuo-quic @shewu-quic @cbilgin @larryliu0820 @mergennachin @helunwencser @jackzhxng

GregoryComer · 2025-04-16T07:08:38Z

@cccclai Do you have ideas on troubleshooting this? Thanks.

tiger-of-shawn · 2025-04-16T07:54:16Z

I can show more information to help Qualcom partner, the model's quantization is already done:

tiger-of-shawn · 2025-04-16T08:05:33Z

more information, when I change the quantize config from w4a16 to w8a8, I got a smaller pte file, which is unbelievable:

function export_llama {
    model_path="$1"
    # 4 bits weight only quantize
    python -m examples.models.llama.export_llama \
    -t "$model_path/original/tokenizer.model" \
    --checkpoint "$model_path/original/consolidated.00.pth" \
    -p "$model_path/original/params.json" \
    --disable_dynamic_shape \
    --qnn \
    --pt2e_quantize qnn_8a8w \
    --model llama3_2 \
    -d fp32 \
    --use_kv_cache  \
    --num_sharding 1 \
    --soc_model SM8650 \
    --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
    -v \
    --output_name="test.pte"
}

the pte file size:

tiger-of-shawn · 2025-04-16T08:38:41Z

the result on device is wrong for both w4a16 and w8a8 pte file

cccclai · 2025-04-16T17:24:29Z

Regarding the result, yeah currently the ptq quantization algo isn't good enough and we're working on it. In the meanwhile, https://github.com/pytorch/executorch/tree/main/examples/qualcomm/oss_scripts/llama is the current active developed version to run llama on htp. Can you check the model size with this version?

Summary: Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Differential Revision: D73125467

Summary: Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Reviewed By: kirklandsign Differential Revision: D73125467

Summary: Pull Request resolved: pytorch#10231 Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Reviewed By: kirklandsign Differential Revision: D73125467

tiger-of-shawn · 2025-04-17T02:23:47Z

Regarding the result, yeah currently the ptq quantization algo isn't good enough and we're working on it. In the meanwhile, https://github.com/pytorch/executorch/tree/main/examples/qualcomm/oss_scripts/llama is the current active developed version to run llama on htp. Can you check the model size with this version?

Okey, I will try this version just now, but why do you have to maintain 2 different codes to support llama on qualcomm ?

https://github.com/pytorch/executorch/tree/main/examples/qualcomm/oss_scripts/llama

https://github.com/pytorch/executorch/tree/main/examples/models/llama

tiger-of-shawn · 2025-04-17T06:32:06Z

I have tried oss_scripts/llama version to get llama_on_qualcomm, the w4a16 pte file size is right:

However, I've noticed a question:

'Why is it necessary to maintain two separate codebases to support Llama on Qualcomm?'

Summary: Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Reviewed By: kirklandsign Differential Revision: D73125467

Summary: Pull Request resolved: pytorch#10231 Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Reviewed By: kirklandsign Differential Revision: D73125467

Summary: Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Reviewed By: kirklandsign Differential Revision: D73125467

cccclai · 2025-04-17T21:36:51Z

We started with the one in example/llama and wanted to have a unified version to export to different backends, however due to perf reason, we have to rewrite the model and it's getting hard to maintain a unified model definition, so we started a fresh one and iterate on it.

tiger-of-shawn · 2025-04-18T06:32:11Z

We started with the one in example/llama and wanted to have a unified version to export to different backends, however due to perf reason, we have to rewrite the model and it's getting hard to maintain a unified model definition, so we started a fresh one and iterate on it.

OK, do you have plans to fix the PTE bug in the examples/models/llama directory? Or will the code be removed soon?

github-project-automation bot added this to ExecuTorch Core Apr 16, 2025

github-project-automation bot moved this to To triage in ExecuTorch Core Apr 16, 2025

tiger-of-shawn changed the title ~~[QCOM] [Llama] when quantize Llama 3.2 1B , the pte file is too larage~~ [QCOM] [Llama] when quantize Llama 3.2 1B , the pte file is too large Apr 16, 2025

GregoryComer assigned GregoryComer and cccclai and unassigned GregoryComer Apr 16, 2025

cccclai mentioned this issue Apr 16, 2025

Instruct users to run llama for qnn to the active repro #10231

Merged

tiger-of-shawn changed the title ~~[QCOM] [Llama] when quantize Llama 3.2 1B , the pte file is too large~~ [QCOM] [Llama] w4a16 quantized Llama 3.2 1B , the pte file is too large Apr 17, 2025

tiger-of-shawn changed the title ~~[QCOM] [Llama] w4a16 quantized Llama 3.2 1B , the pte file is too large~~ [QCOM] [Llama] the size of w4a16 quantized Llama 3.2 1B Pte is too large Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QCOM] [Llama] the size of w4a16 quantized Llama 3.2 1B Pte is too large #10226

[QCOM] [Llama] the size of w4a16 quantized Llama 3.2 1B Pte is too large #10226

tiger-of-shawn commented Apr 16, 2025 •

edited by pytorch-bot bot

Loading

GregoryComer commented Apr 16, 2025

tiger-of-shawn commented Apr 16, 2025

tiger-of-shawn commented Apr 16, 2025 •

edited

Loading

tiger-of-shawn commented Apr 16, 2025

cccclai commented Apr 16, 2025

tiger-of-shawn commented Apr 17, 2025

tiger-of-shawn commented Apr 17, 2025 •

edited

Loading

cccclai commented Apr 17, 2025

tiger-of-shawn commented Apr 18, 2025

[QCOM] [Llama] the size of w4a16 quantized Llama 3.2 1B Pte is too large #10226

[QCOM] [Llama] the size of w4a16 quantized Llama 3.2 1B Pte is too large #10226

Comments

tiger-of-shawn commented Apr 16, 2025 • edited by pytorch-bot bot Loading

GregoryComer commented Apr 16, 2025

tiger-of-shawn commented Apr 16, 2025

tiger-of-shawn commented Apr 16, 2025 • edited Loading

tiger-of-shawn commented Apr 16, 2025

cccclai commented Apr 16, 2025

tiger-of-shawn commented Apr 17, 2025

tiger-of-shawn commented Apr 17, 2025 • edited Loading

cccclai commented Apr 17, 2025

tiger-of-shawn commented Apr 18, 2025

tiger-of-shawn commented Apr 16, 2025 •

edited by pytorch-bot bot

Loading

tiger-of-shawn commented Apr 16, 2025 •

edited

Loading

tiger-of-shawn commented Apr 17, 2025 •

edited

Loading