Embedding Local LLMs in Your Mobile App

**MyrinNew** · 03-26-2026, 03:30 AM

---
title: "Ship an On-Device LLM in Your Mobile App with KMP and llama.cpp"
published: true
description: "A practical guide to embedding llama.cpp in production mobile apps using Kotlin Multiplatform — covering quantization benchmarks, GPU delegation, and a 60fps streaming architecture."
tags: kotlin, mobile, architecture, performance
canonical_url: https://blog.mvpfactory.co/on-device...-kmp-llama-cpp
---

## What We're Building

By the end of this tutorial, you'll have a working architecture for running a 7B-parameter LLM directly on a phone — no cloud calls, no connectivity requirement, no data leaving the device. We'll wire llama.cpp into a Kotlin Multiplatform project, pick the right quantization level using real benchmark data, and build a coroutine-based streaming pipeline that renders tokens without dropping frames.

Let me show you a pattern I use in every project that needs on-device inference.

## Prerequisites

- Kotlin Multiplatform project targeting iOS and Android
- llama.cpp compiled for both platforms (C interop on iOS, JNI on Android)
- A GGUF-format model (we'll use Mistral 7B)
- Familiarity with Kotlin coroutines and Flows

## Step 1: Pick Your Quantization

Most teams get this wrong. They either crush the model down to Q2_K (quality tanks) or refuse to quantize at all (won't fit on any phone). Here are the numbers that make the choice obvious.

**Mistral 7B — iPhone 15 Pro / Pixel 8 Pro:**

| Quant | Size | Peak RAM | tok/s (Metal) | tok/s (NNAPI) | Perplexity |
|-------|------|----------|---------------|---------------|------------|
| Q5_K_S | 5.1 GB | 5.8 GB | 18.4 | 14.1 | 5.86 |
| **Q4_K_M** | **4.4 GB** | **4.9 GB** | **22.7** | **17.3** | **5.92** |
| Q4_0 | 3.8 GB | 4.3 GB | 24.1 | 19.8 | 6.18 |
| Q2_K | 2.7 GB | 3.2 GB | 28.3 | 22.6 | 6.97 |

**Ship Q4_K_M.** You lose ~2% perplexity versus Q5_K_S while gaining 23% faster inference on iOS and staying under the 5GB dirty memory ceiling that triggers iOS jetsam kills.

## Step 2: Memory-Mapped Model Loading

Here is the gotcha that will save you hours: iOS enforces *hard* dirty memory limits. Exceed them and your app dies silently. The fix is `mmap`-based loading — memory-mapped pages count as clean memory, not dirty.

kotlin

// commonMain

expect class LlamaModel {

fun load(path: String, config: ModelConfig): InferenceSession

}

data class ModelConfig(

val useMmap: Boolean = true,

val useGpu: Boolean = true,

val gpuLayers: Int = 99,

val contextSize: Int = 2048

)

Your `actual` implementations call llama.cpp's C API with `use_mmap = true` — via cinterop on iOS, JNI on Android. Setting `gpuLayers = 99` offloads everything possible to Metal or NNAPI. In practice that's 28–32 of 32 layers on recent devices.

## Step 3: The Streaming Token Pipeline

Token generation runs at 17–25 tok/s. If you collect on the main thread or batch UI updates naively, you *will* drop frames. Here's the minimal setup to get this working:

kotlin

fun streamInference(prompt: String): Flow = callbackFlow {

val session = model.createSession()

session.onToken { token ->

trySend(token)

}

session.infer(prompt)

close()

awaitClose { session.cancel() }

}

// ViewModel

viewModelScope.launch {

streamInference(prompt)

.buffer(Channel.CONFLATED)

.collect { token ->

_uiState.update { it.copy(text = it.text + token) }

}

}

`callbackFlow` bridges the C callback into coroutine-land. `Channel.CONFLATED` coalesces tokens when the UI can't keep up during recomposition — no backpressure, no dropped frames. Compose's smart diffing keeps frame time under 12ms.

Run inference on `Dispatchers.Default` with a dedicated single-thread context. llama.cpp is not thread-safe per session.

## Step 4: GPU Delegation

Metal on iOS is mature — expect a consistent 1.3–1.5x speedup. NNAPI on Android is messier. Qualcomm Adreno handles it well; older Mali GPUs can regress.

The docs don't mention this, but my recommendation: default to GPU on iOS, and on Android, run a quick 10-token benchmark at first launch to decide. Cache the result in shared preferences. Adaptive initialization like this shows up everywhere in mobile — any app doing serious on-device work needs it. Even something as simple as [HealthyDesk](https://play.google.com/store/apps/d...om.healthydesk), which I keep running for break reminders during long coding sessions, adapts its scheduling based on device state. For LLM inference the stakes are higher — a bad GPU path means visibly worse performance.

## Gotchas

1. **iOS jetsam is silent.** Your app won't crash with a stack trace — it just vanishes. Always validate dirty memory with Xcode Memory Gauge, not Instruments allocations. They measure different things.
2. **Q5_K_S will OOM on most iPhones.** It works on 12GB+ Android flagships. On iOS, it leaves you zero headroom. Stick with Q4_K_M.
3. **Don't poll for tokens.** Don't batch them either. The `callbackFlow` + `CONFLATED` pattern above is the correct answer. Let the rendering framework decide cadence.
4. **GGUF format matters.** Older GGML files won't work. Convert with `llama.cpp`'s `convert` scripts and verify with `llama-quantize --help`.

## Wrapping Up

On-device LLM inference works in production today. The tooling is there. What separates apps that ship from apps that crash is the boring stuff: memory management, threading discipline, and knowing where iOS and Android disagree. Get those right and you can build things your cloud-dependent competitors can't.

**Resources:**
- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
- [GGUF format spec](https://github.com/ggerganov/ggml/bl...r/docs/gguf.md)
- [Kotlin Multiplatform docs](https://kotlinlang.org/docs/multiplatform.html)
- [Apple Memory Limits](https://developer.apple.com/documentation/metrickit)

More...