# How to Run Qwen3.6-35B on Your Mac at 77 tok/s

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    # How to Run Qwen3.6-35B on Your Mac at 77 tok/s


    Level: intermediate


    Estimated time: 20-40 minutes (most of it is the model download)


    Minimum requirements: Mac with Apple Silicon (M1/M2/M3/M4) and 48 GB of unified RAM





    What are we setting up?

    A local server compatible with the OpenAI API that runs the Qwen3.6-35B-A3B model (quantized to 4 bits) using MLX, Apple's Machine Learning framework for Silicon. When you're done, you'll have an endpoint at http://127.0.0.1:7979 that you can point any OpenAI-compatible client to (OpenCode, Continue, Cursor, etc.).


    Generation throughput ~77 tok/s
    TTFT (time-to-first-token) ~0.25 s
    Context window 65 536 – 131 072 tokens
    RAM required ~20 GB model + ~12 GB KV cache





    Prerequisites

    Hardware

    • Mac with Apple Silicon chip (M1 Pro/Max/Ultra or M2/M3/M4 equivalents)
    • Minimum 48 GB of unified RAM (the quantized model takes ~20 GB; the KV cache needs up to 12 GB additional)


    Software





    # Check Python version (you need 3.11+)
    python3 --version

    # Check that you have git
    git --version







    If you don't have Python 3.11, install it with Homebrew:






    brew install python@3.11










    Step 1 — Create the virtual environment

    From the folder where you want to install everything:






    mkdir mlx-server && cd mlx-server
    python3.11 -m venv .venv
    source .venv/bin/activate










    Step 2 — Install dependencies





    pip install --upgrade pip

    # MLX and the OpenAI API-compatible server
    pip install mlx-lm
    pip install mlx-openai-server







    Verify the installation:






    mlx-openai-server --help










    Step 3 — Download the model

    The model is automatically downloaded from Hugging Face the first time you run it. It takes approximately 20 GB of disk space.






    # Optional pre-download (recommended to track progress)
    python3 -c "
    from mlx_lm import load
    model, tokenizer = load('mlx-community/Qwen3.6-35B-A3B-4bit')
    print('Model downloaded successfully')
    "







    Note: You need a huggingface.co account and to accept the model's terms if the repository requires it. For this model it is not required.





    Step 4 — Start the server

    Option A — Direct command (simpler)





    mlx-openai-server launch \
    --model-path mlx-community/Qwen3.6-35B-A3B-4bit \
    --model-type lm \
    --host 127.0.0.1 \
    --port 7979 \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3_5 \
    --enable-auto-tool-choice \
    --context-length 65536 \
    --temperature 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --min-p 0.0 \
    --repetition-penalty 1.05 \
    --max-bytes 12884901888 \
    --prompt-cache-size 3 \
    --log-level INFO







    Option B — Startup script (recommended)

    Save the following script as start-mlx-server.sh:






    #!/usr/bin/env bash
    set -euo pipefail

    SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
    VENV="$SCRIPT_DIR/.venv"

    # Default profile: high_context
    # Change with: MLX_PROFILE=baseline ./start-mlx-server.sh
    PROFILE="${MLX_PROFILE:-high_context}"

    MODEL_PATH="mlx-community/Qwen3.6-35B-A3B-4bit"
    HOST="127.0.0.1"
    PORT="7979"

    TOOL_CALL_PARSER="qwen3_coder"
    REASONING_PARSER="qwen3_5"

    TEMPERATURE="0.7"
    TOP_P="0.8"
    TOP_K="20"
    MIN_P="0.0"
    REPETITION_PENALTY="1.05"
    MAX_CACHE_BYTES="12884901888" # 12 GB

    DRAFT_MODEL="mlx-community/Qwen3.5-0.8B-MLX-4bit"
    NUM_DRAFT_TOKENS="${MLX_NUM_DRAFT_TOKENS:-4}"

    case "$PROFILE" in
    baseline)
    CONTEXT_LENGTH="65536"
    PROMPT_CACHE_SIZE="3"
    EXTRA_ARGS=""
    ;;
    high_context)
    CONTEXT_LENGTH="131072"
    PROMPT_CACHE_SIZE="5"
    EXTRA_ARGS=""
    ;;
    speculative)
    CONTEXT_LENGTH="65536"
    PROMPT_CACHE_SIZE="3"
    EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
    ;;
    speculative_high)
    CONTEXT_LENGTH="131072"
    PROMPT_CACHE_SIZE="5"
    EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
    ;;
    *)
    echo "Unknown PROFILE: $PROFILE"
    echo "Options: baseline, high_context, speculative, speculative_high"
    exit 1
    ;;
    esac

    exec "$VENV/bin/mlx-openai-server" launch \
    --model-path "$MODEL_PATH" \
    --model-type lm \
    --host "$HOST" \
    --port "$PORT" \
    --tool-call-parser "$TOOL_CALL_PARSER" \
    --reasoning-parser "$REASONING_PARSER" \
    --enable-auto-tool-choice \
    --context-length "$CONTEXT_LENGTH" \
    --temperature "$TEMPERATURE" \
    --top-p "$TOP_P" \
    --top-k "$TOP_K" \
    --min-p "$MIN_P" \
    --repetition-penalty "$REPETITION_PENALTY" \
    --max-bytes "$MAX_CACHE_BYTES" \
    --prompt-cache-size "$PROMPT_CACHE_SIZE" \
    --log-level INFO \
    $EXTRA_ARGS











    chmod +x start-mlx-server.sh
    ./start-mlx-server.sh







    Usage examples:






    ./start-mlx-server.sh # high_context (default)
    MLX_PROFILE=baseline ./start-mlx-server.sh # maximum throughput
    MLX_PROFILE=speculative ./start-mlx-server.sh # speculative decoding
    MLX_PROFILE=speculative MLX_NUM_DRAFT_TOKENS=6 ./start-mlx-server.sh










    Step 5 — Verify it works

    In another terminal, send a test request:






    curl http://127.0.0.1:7979/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "mlx-community/Qwen3.6-35B-A3B-4bit",
    "messages": [{"role": "user", "content": "Hello, what is 2+2?"}],
    "max_tokens": 100
    }'







    You should see a JSON response with the choices[0].message.content field.





    Stopping the server





    pkill -f mlx-openai-server







    Or if you have the stop-mlx-server.sh script:






    #!/usr/bin/env bash
    pkill -f mlx-openai-server && echo "Server stopped."










    Connect with your favorite client

    The server exposes a 100% OpenAI-compatible API. Just point the base_url to your local server.


    OpenCode

    Create or edit the opencode.json file in the root of your project:






    {
    "$schema": "https://opencode.ai/config.json",
    "provider": {
    "mlx-local": {
    "npm": "@ai-sdk/openai-compatible",
    "name": "MLX Local (Qwen3.6-35B)",
    "options": {
    "baseURL": "http://127.0.0.1:7979/v1"
    },
    "models": {
    "mlx-community/Qwen3.6-35B-A3B-4bit": {
    "name": "Qwen3.6-35B-A3B-4bit (local MLX)",
    "limit": {
    "context": 65536,
    "output": 32768
    }
    }
    }
    }
    }
    }







    Continue / Cursor





    Base URL: http://127.0.0.1:7979/v1
    API Key: any-value (the server does not validate it)
    Model: mlx-community/Qwen3.6-35B-A3B-4bit







    Python (openai SDK)





    from openai import OpenAI

    client = OpenAI(
    base_url="http://127.0.0.1:7979/v1",
    api_key="local"
    )

    response = client.chat.completions.create(
    model="mlx-community/Qwen3.6-35B-A3B-4bit",
    messages=[{"role": "user", "content": "Explain what a transformer is"}]
    )
    print(response.choices[0].message.content)










    Configuration profiles

    baseline 65 536 3 entries 77.4 Maximum throughput
    high_context 131 072 5 entries 75.7 Long documents, extended contexts (default)


    The performance difference between both profiles (~2%) is within the noise margin. Use high_context if you work with large files or very long conversations.





    Key parameters explained

    --max-bytes 12884901888 12 GB Critical. Without this limit the model's KV cache (MoE architecture with ArraysCache) grows unchecked until it exhausts RAM on contexts >30k tokens
    --prompt-cache-size 3 3 LRU entries Limits how many conversations the prefix cache keeps in memory
    --context-length 65536 64k tokens Maximum context window per request
    --temperature 0.7 Balance between creativity and coherence
    --repetition-penalty 1.05 Reduces repetitions in long responses





    Troubleshooting

    The server disconnects after 30,000 tokens

    This was a known bug with the Qwen3.6-35B-A3B model due to its hybrid MoE architecture. The fix is to make sure you pass --max-bytes 12884901888. With this parameter the server works correctly up to 60,000+ tokens (verified).





    Architecture notes (for the curious)

    Qwen3.6-35B-A3B is a hybrid MoE (Mixture of Experts) model. Instead of activating all parameters per token, it only activates a subset of "experts", making it efficient for its size. The 4bit version quantizes the weights to 4 bits, reducing RAM usage from ~70 GB to ~20 GB with minimal quality loss.


    MLX leverages Apple Silicon's unified memory: the GPU and CPU share the same RAM pool, eliminating the transfer bottleneck that exists in systems with a dedicated GPU. That's why a Mac with 48 GB can run a model that on a PC would require a GPU with 80 GB of VRAM.





    References





    More...
Working...