-
Notifications
You must be signed in to change notification settings - Fork 523
[QCOM] [Llama] the size of w4a16 quantized Llama 3.2 1B Pte is too large #10226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@cccclai Do you have ideas on troubleshooting this? Thanks. |
more information, when I change the quantize config from w4a16 to w8a8, I got a smaller pte file, which is unbelievable: function export_llama {
model_path="$1"
# 4 bits weight only quantize
python -m examples.models.llama.export_llama \
-t "$model_path/original/tokenizer.model" \
--checkpoint "$model_path/original/consolidated.00.pth" \
-p "$model_path/original/params.json" \
--disable_dynamic_shape \
--qnn \
--pt2e_quantize qnn_8a8w \
--model llama3_2 \
-d fp32 \
--use_kv_cache \
--num_sharding 1 \
--soc_model SM8650 \
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
-v \
--output_name="test.pte"
} the pte file size: |
Regarding the result, yeah currently the ptq quantization algo isn't good enough and we're working on it. In the meanwhile, https://github.com/pytorch/executorch/tree/main/examples/qualcomm/oss_scripts/llama is the current active developed version to run llama on htp. Can you check the model size with this version? |
Summary: Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Differential Revision: D73125467
Summary: Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Reviewed By: kirklandsign Differential Revision: D73125467
Summary: Pull Request resolved: pytorch#10231 Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Reviewed By: kirklandsign Differential Revision: D73125467
Okey, I will try this version just now, but why do you have to maintain 2 different codes to support llama on qualcomm ? https://github.com/pytorch/executorch/tree/main/examples/qualcomm/oss_scripts/llama |
Summary: Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Reviewed By: kirklandsign Differential Revision: D73125467
Summary: Pull Request resolved: pytorch#10231 Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Reviewed By: kirklandsign Differential Revision: D73125467
Summary: Many users are trying to export llama with this flow https://github.com/pytorch/executorch/tree/main/examples/models/llama and end up with non performant, or different isssues, like pytorch#10226. Instruct users to use the qcom version Reviewed By: kirklandsign Differential Revision: D73125467
We started with the one in |
OK, do you have plans to fix the PTE bug in the examples/models/llama directory? Or will the code be removed soon? |
use the latest executorch codebase:
get the pte file:
the file size is:
-rw-rw-r-- 1 2.9G Apr 16 06:53 test.pte
while the float model size is:
-rw-rw-r-- 1 2.4G Oct 23 03:12 assets/models/Llama-3.2-1B/original/consolidated.00.pth
the convert script is:
#Export Llama Model
function export_llama {
model_path="$1"
# 4 bits weight only quantize
python -m examples.models.llama.export_llama
-t "$model_path/original/tokenizer.model"
--checkpoint "$model_path/original/consolidated.00.pth"
-p "$model_path/original/params.json"
--disable_dynamic_shape
--qnn
--pt2e_quantize qnn_16a4w
--model llama3_2
-d fp32
--use_kv_cache
--num_sharding 1
--soc_model SM8650
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
-v
--output_name="test.pte"
}
why the pte file size is larger than float model ?
when I use the v0.4 executorch codebase to get the pte with the same configure, the pte's size is normal:
-rw-rw-r-- 1 1.1G Apr 16 06:57 output.pte
cc @cccclai @winskuo-quic @shewu-quic @cbilgin @larryliu0820 @mergennachin @helunwencser @jackzhxng
The text was updated successfully, but these errors were encountered: