To run Fietje-2-Chat on your Android Phone locally on a reasonable speed, you’ll need to create a special quantized version of fietje-2b-chat.
One of the best apps to run a LLM’s on your Android Phone is ChatterUI (https://github.com/Vali-98/ChatterUI).
You can download the APK from Github and transfer it to your phone and install it. It’s not yet available in F-Droid.
As most Android Phones have a ARM CPU, use special `quants` that run faster on ARM, because the use NEON extensions, int8mm and SVE instructions.
Note that these optimized kernels require the model to be quantized into one of the formats:
Q4_0_4_4
(Arm Neon),Q4_0_4_8
(int8mm) orQ4_0_8_8
(SVE). The SVE mulmat kernel specifically requires a vector width of 256 bits. When running on devices with a different vector width, it is recommended to use theQ4_0_4_8
(int8mm) orQ4_0_4_4
(Arm Neon) formats for better performance.https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md#arm-cpu-optimized-mulmat-kernels
How to create a special ARM optimized version of Fietje-2-Chat
Download the f16 guff version of Fietje-2-Chat:
wget https://huggingface.co/BramVanroy/fietje-2-chat-gguf/resolve/main/fietje-2b-chat-f16.gguf?download=true
Install a Docker version of LLama to do the conversion
mkdir p ~/llama/models
sudo docker run -v /home/user/llama/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B
To convert the f32 or f16 gguf to another format Q4_0_4_4
(Arm Neon):
docker run --rm -v /home/user/llama/models:/models ghcr.io/ggerganov/llama.cpp:full --quantize "/models/fietje-2b-chat-f16.gguf" "/models/fietje-2b-chat-Q4_0_4_4.gguf" "Q4_0_4_4"
Transfer the fietje-2b-chat-Q4_0_4_4.gguf to your Android Device.
Open ChatterUI
Import Model:
Go to menu -> API -> Model -> import -> fietje-2b-chat-Q4_0_4_4.gguf
Load Model:
menu ->API -> Model -> select fietje-2b-chat-Q4_0_4_4.gguf -> Load
Then leave the settings and start typing in the prompt. The first cold run will be a little slow, but once it’s running you’ll get about 10 tokens/s on a Snapdragon 865 phone.
That’s not bad.
If you’re interested in a LLM that can generate much better Dutch than LLama3.2 or Phi3 on your phone, give Fietje a try.