This PWA offers a fully offline inference runtime with a chat model finetuned from Cerebras-GPT 111M (M for million).
For comparison, Llama 2 is available in 7B, 13B, 70B (B for billion). GPT 3.5 turbo has 20B parameters, and GPT 4 has 1.76T parameters (T for trillion).
This model is very small in comparison, so it tends to hallucinate a lot and get confused easily. Still, it's quite amazing that we can run it on a phone.
Model: LaMini-Cerebras-111M
Inference: transformers.js
Runtime: ONNX