Skip to content

[QCOM] [Llama] the size of w4a16 quantized Llama 3.2 1B Pte is too large #10226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tiger-of-shawn opened this issue Apr 16, 2025 · 9 comments
Assignees
Labels
module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm

Comments

@tiger-of-shawn
Copy link

tiger-of-shawn commented Apr 16, 2025

use the latest executorch codebase:

get the pte file:

Image

the file size is:

-rw-rw-r-- 1 2.9G Apr 16 06:53 test.pte

while the float model size is:

-rw-rw-r-- 1 2.4G Oct 23 03:12 assets/models/Llama-3.2-1B/original/consolidated.00.pth

the convert script is:

#Export Llama Model
function export_llama {
model_path="$1"
# 4 bits weight only quantize
python -m examples.models.llama.export_llama
-t "$model_path/original/tokenizer.model"
--checkpoint "$model_path/original/consolidated.00.pth"
-p "$model_path/original/params.json"
--disable_dynamic_shape
--qnn
--pt2e_quantize qnn_16a4w
--model llama3_2
-d fp32
--use_kv_cache
--num_sharding 1
--soc_model SM8650
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
-v
--output_name="test.pte"
}

why the pte file size is larger than float model ?


when I use the v0.4 executorch codebase to get the pte with the same configure, the pte's size is normal:

Image

-rw-rw-r-- 1 1.1G Apr 16 06:57 output.pte

cc @cccclai @winskuo-quic @shewu-quic @cbilgin @larryliu0820 @mergennachin @helunwencser @jackzhxng

@GregoryComer
Copy link
Member

@cccclai Do you have ideas on troubleshooting this? Thanks.

@GregoryComer GregoryComer added partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code labels Apr 16, 2025
@github-project-automation github-project-automation bot moved this to To triage in ExecuTorch Core Apr 16, 2025
@tiger-of-shawn tiger-of-shawn changed the title [QCOM] [Llama] when quantize Llama 3.2 1B , the pte file is too larage [QCOM] [Llama] when quantize Llama 3.2 1B , the pte file is too large Apr 16, 2025
@tiger-of-shawn
Copy link
Author

I can show more information to help Qualcom partner, the model's quantization is already done:

Image

@tiger-of-shawn
Copy link
Author

tiger-of-shawn commented Apr 16, 2025

more information, when I change the quantize config from w4a16 to w8a8, I got a smaller pte file, which is unbelievable:

function export_llama {
    model_path="$1"
    # 4 bits weight only quantize
    python -m examples.models.llama.export_llama \
    -t "$model_path/original/tokenizer.model" \
    --checkpoint "$model_path/original/consolidated.00.pth" \
    -p "$model_path/original/params.json" \
    --disable_dynamic_shape \
    --qnn \
    --pt2e_quantize qnn_8a8w \
    --model llama3_2 \
    -d fp32 \
    --use_kv_cache  \
    --num_sharding 1 \
    --soc_model SM8650 \
    --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
    -v \
    --output_name="test.pte"
}

the pte file size:

Image

@tiger-of-shawn
Copy link
Author

the result on device is wrong for both w4a16 and w8a8 pte file

Image

Image

@cccclai
Copy link
Contributor

cccclai commented Apr 16, 2025

Regarding the result, yeah currently the ptq quantization algo isn't good enough and we're working on it. In the meanwhile, https://github.com/pytorch/executorch/tree/main/examples/qualcomm/oss_scripts/llama is the current active developed version to run llama on htp. Can you check the model size with this version?

cccclai added a commit to cccclai/executorch-1 that referenced this issue Apr 16, 2025
Summary: Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version

Differential Revision: D73125467
cccclai added a commit to cccclai/executorch-1 that referenced this issue Apr 16, 2025
Summary:

Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version

Reviewed By: kirklandsign

Differential Revision: D73125467
cccclai added a commit to cccclai/executorch-1 that referenced this issue Apr 16, 2025
Summary:
Pull Request resolved: pytorch#10231

Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version

Reviewed By: kirklandsign

Differential Revision: D73125467
@tiger-of-shawn
Copy link
Author

Regarding the result, yeah currently the ptq quantization algo isn't good enough and we're working on it. In the meanwhile, https://github.com/pytorch/executorch/tree/main/examples/qualcomm/oss_scripts/llama is the current active developed version to run llama on htp. Can you check the model size with this version?

Okey, I will try this version just now, but why do you have to maintain 2 different codes to support llama on qualcomm ?

https://github.com/pytorch/executorch/tree/main/examples/qualcomm/oss_scripts/llama

  1. https://github.com/pytorch/executorch/tree/main/examples/models/llama

@tiger-of-shawn
Copy link
Author

tiger-of-shawn commented Apr 17, 2025

I have tried oss_scripts/llama version to get llama_on_qualcomm, the w4a16 pte file size is right:

Image

Image

However, I've noticed a question:

'Why is it necessary to maintain two separate codebases to support Llama on Qualcomm?'

@tiger-of-shawn tiger-of-shawn changed the title [QCOM] [Llama] when quantize Llama 3.2 1B , the pte file is too large [QCOM] [Llama] w4a16 quantized Llama 3.2 1B , the pte file is too large Apr 17, 2025
@tiger-of-shawn tiger-of-shawn changed the title [QCOM] [Llama] w4a16 quantized Llama 3.2 1B , the pte file is too large [QCOM] [Llama] the size of w4a16 quantized Llama 3.2 1B Pte is too large Apr 17, 2025
cccclai added a commit to cccclai/executorch-1 that referenced this issue Apr 17, 2025
Summary:

Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version

Reviewed By: kirklandsign

Differential Revision: D73125467
cccclai added a commit to cccclai/executorch-1 that referenced this issue Apr 17, 2025
Summary:
Pull Request resolved: pytorch#10231

Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version

Reviewed By: kirklandsign

Differential Revision: D73125467
cccclai added a commit to cccclai/executorch-1 that referenced this issue Apr 17, 2025
Summary:

Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version

Reviewed By: kirklandsign

Differential Revision: D73125467
@cccclai
Copy link
Contributor

cccclai commented Apr 17, 2025

We started with the one in example/llama and wanted to have a unified version to export to different backends, however due to perf reason, we have to rewrite the model and it's getting hard to maintain a unified model definition, so we started a fresh one and iterate on it.

@tiger-of-shawn
Copy link
Author

We started with the one in example/llama and wanted to have a unified version to export to different backends, however due to perf reason, we have to rewrite the model and it's getting hard to maintain a unified model definition, so we started a fresh one and iterate on it.

OK, do you have plans to fix the PTE bug in the examples/models/llama directory? Or will the code be removed soon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm
Projects
Status: To triage
Development

No branches or pull requests

3 participants