Embedding Local LLMs in Your Mobile App

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    Embedding Local LLMs in Your Mobile App



    ---
    title: "Ship an On-Device LLM in Your Mobile App with KMP and llama.cpp"
    published: true
    description: "A practical guide to embedding llama.cpp in production mobile apps using Kotlin Multiplatform — covering quantization benchmarks, GPU delegation, and a 60fps streaming architecture."
    tags: kotlin, mobile, architecture, performance
    canonical_url: https://blog.mvpfactory.co/on-device...-kmp-llama-cpp
    ---

    ## What We're Building

    By the end of this tutorial, you'll have a working architecture for running a 7B-parameter LLM directly on a phone — no cloud calls, no connectivity requirement, no data leaving the device. We'll wire llama.cpp into a Kotlin Multiplatform project, pick the right quantization level using real benchmark data, and build a coroutine-based streaming pipeline that renders tokens without dropping frames.

    Let me show you a pattern I use in every project that needs on-device inference.

    ## Prerequisites

    - Kotlin Multiplatform project targeting iOS and Android
    - llama.cpp compiled for both platforms (C interop on iOS, JNI on Android)
    - A GGUF-format model (we'll use Mistral 7B)
    - Familiarity with Kotlin coroutines and Flows

    ## Step 1: Pick Your Quantization

    Most teams get this wrong. They either crush the model down to Q2_K (quality tanks) or refuse to quantize at all (won't fit on any phone). Here are the numbers that make the choice obvious.

    **Mistral 7B — iPhone 15 Pro / Pixel 8 Pro:**

    | Quant | Size | Peak RAM | tok/s (Metal) | tok/s (NNAPI) | Perplexity |
    |-------|------|----------|---------------|---------------|------------|
    | Q5_K_S | 5.1 GB | 5.8 GB | 18.4 | 14.1 | 5.86 |
    | **Q4_K_M** | **4.4 GB** | **4.9 GB** | **22.7** | **17.3** | **5.92** |
    | Q4_0 | 3.8 GB | 4.3 GB | 24.1 | 19.8 | 6.18 |
    | Q2_K | 2.7 GB | 3.2 GB | 28.3 | 22.6 | 6.97 |

    **Ship Q4_K_M.** You lose ~2% perplexity versus Q5_K_S while gaining 23% faster inference on iOS and staying under the 5GB dirty memory ceiling that triggers iOS jetsam kills.

    ## Step 2: Memory-Mapped Model Loading

    Here is the gotcha that will save you hours: iOS enforces *hard* dirty memory limits. Exceed them and your app dies silently. The fix is `mmap`-based loading — memory-mapped pages count as clean memory, not dirty.








    kotlin

    // commonMain

    expect class LlamaModel {

    fun load(path: String, config: ModelConfig): InferenceSession

    }


    data class ModelConfig(

    val useMmap: Boolean = true,

    val useGpu: Boolean = true,

    val gpuLayers: Int = 99,

    val contextSize: Int = 2048

    )







    Your `actual` implementations call llama.cpp's C API with `use_mmap = true` — via cinterop on iOS, JNI on Android. Setting `gpuLayers = 99` offloads everything possible to Metal or NNAPI. In practice that's 28–32 of 32 layers on recent devices.

    ## Step 3: The Streaming Token Pipeline

    Token generation runs at 17–25 tok/s. If you collect on the main thread or batch UI updates naively, you *will* drop frames. Here's the minimal setup to get this working:








    kotlin

    fun streamInference(prompt: String): Flow = callbackFlow {

    val session = model.createSession()

    session.onToken { token ->

    trySend(token)

    }

    session.infer(prompt)

    close()

    awaitClose { session.cancel() }

    }


    // ViewModel

    viewModelScope.launch {

    streamInference(prompt)

    .buffer(Channel.CONFLATED)

    .collect { token ->

    _uiState.update { it.copy(text = it.text + token) }

    }

    }







    `callbackFlow` bridges the C callback into coroutine-land. `Channel.CONFLATED` coalesces tokens when the UI can't keep up during recomposition — no backpressure, no dropped frames. Compose's smart diffing keeps frame time under 12ms.

    Run inference on `Dispatchers.Default` with a dedicated single-thread context. llama.cpp is not thread-safe per session.

    ## Step 4: GPU Delegation

    Metal on iOS is mature — expect a consistent 1.3–1.5x speedup. NNAPI on Android is messier. Qualcomm Adreno handles it well; older Mali GPUs can regress.

    The docs don't mention this, but my recommendation: default to GPU on iOS, and on Android, run a quick 10-token benchmark at first launch to decide. Cache the result in shared preferences. Adaptive initialization like this shows up everywhere in mobile — any app doing serious on-device work needs it. Even something as simple as [HealthyDesk](https://play.google.com/store/apps/d...om.healthydesk), which I keep running for break reminders during long coding sessions, adapts its scheduling based on device state. For LLM inference the stakes are higher — a bad GPU path means visibly worse performance.

    ## Gotchas

    1. **iOS jetsam is silent.** Your app won't crash with a stack trace — it just vanishes. Always validate dirty memory with Xcode Memory Gauge, not Instruments allocations. They measure different things.
    2. **Q5_K_S will OOM on most iPhones.** It works on 12GB+ Android flagships. On iOS, it leaves you zero headroom. Stick with Q4_K_M.
    3. **Don't poll for tokens.** Don't batch them either. The `callbackFlow` + `CONFLATED` pattern above is the correct answer. Let the rendering framework decide cadence.
    4. **GGUF format matters.** Older GGML files won't work. Convert with `llama.cpp`'s `convert` scripts and verify with `llama-quantize --help`.

    ## Wrapping Up

    On-device LLM inference works in production today. The tooling is there. What separates apps that ship from apps that crash is the boring stuff: memory management, threading discipline, and knowing where iOS and Android disagree. Get those right and you can build things your cloud-dependent competitors can't.

    **Resources:**
    - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
    - [GGUF format spec](https://github.com/ggerganov/ggml/bl...r/docs/gguf.md)
    - [Kotlin Multiplatform docs](https://kotlinlang.org/docs/multiplatform.html)
    - [Apple Memory Limits](https://developer.apple.com/documentation/metrickit)









    More...
Working...