Running the Dutch LLM fietje-2-chat with reasonable speed on your Android Phone

{,}
October 10th, 2024

To run Fietje-2-Chat on your Android Phone locally on a reasonable speed, you’ll need to create a special quantized version of fietje-2b-chat.

One of the best apps to run a LLM’s on your Android Phone is ChatterUI (https://github.com/Vali-98/ChatterUI).

You can download the APK from Github and transfer it to your phone and install it. It’s not yet available in F-Droid.

As most Android Phones have a ARM CPU, use special `quants` that run faster on ARM, because the use NEON extensions, int8mm and SVE instructions.

Note that these optimized kernels require the model to be quantized into one of the formats: Q4_0_4_4 (Arm Neon), Q4_0_4_8 (int8mm) or Q4_0_8_8 (SVE). The SVE mulmat kernel specifically requires a vector width of 256 bits. When running on devices with a different vector width, it is recommended to use the Q4_0_4_8 (int8mm) or Q4_0_4_4 (Arm Neon) formats for better performance.

https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md#arm-cpu-optimized-mulmat-kernels

How to create a special ARM optimized version of Fietje-2-Chat

Download the f16 guff version of Fietje-2-Chat:

wget https://huggingface.co/BramVanroy/fietje-2-chat-gguf/resolve/main/fietje-2b-chat-f16.gguf?download=true

Install a Docker version of LLama to do the conversion

mkdir p ~/llama/models
sudo docker run -v /home/user/llama/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B

To convert the f32 or f16 gguf to another format Q4_0_4_4 (Arm Neon):

docker run --rm -v /home/user/llama/models:/models ghcr.io/ggerganov/llama.cpp:full --quantize "/models/fietje-2b-chat-f16.gguf" "/models/fietje-2b-chat-Q4_0_4_4.gguf" "Q4_0_4_4"

Transfer the fietje-2b-chat-Q4_0_4_4.gguf to your Android Device.

Open ChatterUI

Import Model:

Go to menu -> API -> Model -> import -> fietje-2b-chat-Q4_0_4_4.gguf

Load Model:

menu ->API -> Model -> select fietje-2b-chat-Q4_0_4_4.gguf -> Load

Then leave the settings and start typing in the prompt. The first cold run will be a little slow, but once it’s running you’ll get about 10 tokens/s on a Snapdragon 865 phone.

That’s not bad.

If you’re interested in a LLM that can generate much better Dutch than LLama3.2 or Phi3 on your phone, give Fietje a try.

Tags: ,

Leave a Reply