# How to Run Qwen3.6-35B on Your Mac at 77 tok/s

**MyrinNew** · 05-05-2026, 05:24 PM

Level: intermediate

Estimated time: 20-40 minutes (most of it is the model download)

Minimum requirements: Mac with Apple Silicon (M1/M2/M3/M4) and 48 GB of unified RAM

What are we setting up?

A local server compatible with the OpenAI API that runs the Qwen3.6-35B-A3B model (quantized to 4 bits) using MLX, Apple's Machine Learning framework for Silicon. When you're done, you'll have an endpoint at http://127.0.0.1:7979 that you can point any OpenAI-compatible client to (OpenCode, Continue, Cursor, etc.).

Generation throughput	~77 tok/s
TTFT (time-to-first-token)	~0.25 s
Context window	65 536 – 131 072 tokens
RAM required	~20 GB model + ~12 GB KV cache

Prerequisites

Hardware

Mac with Apple Silicon chip (M1 Pro/Max/Ultra or M2/M3/M4 equivalents)
Minimum 48 GB of unified RAM (the quantized model takes ~20 GB; the KV cache needs up to 12 GB additional)

Software

# Check Python version (you need 3.11+)
python3 --version

# Check that you have git
git --version

If you don't have Python 3.11, install it with Homebrew:

brew install python@3.11

Step 1 — Create the virtual environment

From the folder where you want to install everything:

mkdir mlx-server && cd mlx-server
python3.11 -m venv .venv
source .venv/bin/activate

Step 2 — Install dependencies

pip install --upgrade pip

# MLX and the OpenAI API-compatible server
pip install mlx-lm
pip install mlx-openai-server

Verify the installation:

mlx-openai-server --help

Step 3 — Download the model

The model is automatically downloaded from Hugging Face the first time you run it. It takes approximately 20 GB of disk space.

# Optional pre-download (recommended to track progress)
python3 -c "
from mlx_lm import load
model, tokenizer = load('mlx-community/Qwen3.6-35B-A3B-4bit')
print('Model downloaded successfully')
"

Note: You need a huggingface.co account and to accept the model's terms if the repository requires it. For this model it is not required.

Step 4 — Start the server

Option A — Direct command (simpler)

mlx-openai-server launch \
--model-path mlx-community/Qwen3.6-35B-A3B-4bit \
--model-type lm \
--host 127.0.0.1 \
--port 7979 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3_5 \
--enable-auto-tool-choice \
--context-length 65536 \
--temperature 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.0 \
--repetition-penalty 1.05 \
--max-bytes 12884901888 \
--prompt-cache-size 3 \
--log-level INFO

Option B — Startup script (recommended)

Save the following script as start-mlx-server.sh:

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV="$SCRIPT_DIR/.venv"

# Default profile: high_context
# Change with: MLX_PROFILE=baseline ./start-mlx-server.sh
PROFILE="${MLX_PROFILE:-high_context}"

MODEL_PATH="mlx-community/Qwen3.6-35B-A3B-4bit"
HOST="127.0.0.1"
PORT="7979"

TOOL_CALL_PARSER="qwen3_coder"
REASONING_PARSER="qwen3_5"

TEMPERATURE="0.7"
TOP_P="0.8"
TOP_K="20"
MIN_P="0.0"
REPETITION_PENALTY="1.05"
MAX_CACHE_BYTES="12884901888" # 12 GB

DRAFT_MODEL="mlx-community/Qwen3.5-0.8B-MLX-4bit"
NUM_DRAFT_TOKENS="${MLX_NUM_DRAFT_TOKENS:-4}"

case "$PROFILE" in
baseline)
CONTEXT_LENGTH="65536"
PROMPT_CACHE_SIZE="3"
EXTRA_ARGS=""
;;
high_context)
CONTEXT_LENGTH="131072"
PROMPT_CACHE_SIZE="5"
EXTRA_ARGS=""
;;
speculative)
CONTEXT_LENGTH="65536"
PROMPT_CACHE_SIZE="3"
EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
;;
speculative_high)
CONTEXT_LENGTH="131072"
PROMPT_CACHE_SIZE="5"
EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
;;
*)
echo "Unknown PROFILE: $PROFILE"
echo "Options: baseline, high_context, speculative, speculative_high"
exit 1
;;
esac

exec "$VENV/bin/mlx-openai-server" launch \
--model-path "$MODEL_PATH" \
--model-type lm \
--host "$HOST" \
--port "$PORT" \
--tool-call-parser "$TOOL_CALL_PARSER" \
--reasoning-parser "$REASONING_PARSER" \
--enable-auto-tool-choice \
--context-length "$CONTEXT_LENGTH" \
--temperature "$TEMPERATURE" \
--top-p "$TOP_P" \
--top-k "$TOP_K" \
--min-p "$MIN_P" \
--repetition-penalty "$REPETITION_PENALTY" \
--max-bytes "$MAX_CACHE_BYTES" \
--prompt-cache-size "$PROMPT_CACHE_SIZE" \
--log-level INFO \
$EXTRA_ARGS

chmod +x start-mlx-server.sh
./start-mlx-server.sh

Usage examples:

./start-mlx-server.sh # high_context (default)
MLX_PROFILE=baseline ./start-mlx-server.sh # maximum throughput
MLX_PROFILE=speculative ./start-mlx-server.sh # speculative decoding
MLX_PROFILE=speculative MLX_NUM_DRAFT_TOKENS=6 ./start-mlx-server.sh

Step 5 — Verify it works

In another terminal, send a test request:

curl http://127.0.0.1:7979/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3.6-35B-A3B-4bit",
"messages": [{"role": "user", "content": "Hello, what is 2+2?"}],
"max_tokens": 100
}'

You should see a JSON response with the choices[0].message.content field.

Stopping the server

pkill -f mlx-openai-server

Or if you have the stop-mlx-server.sh script:

#!/usr/bin/env bash
pkill -f mlx-openai-server && echo "Server stopped."

Connect with your favorite client

The server exposes a 100% OpenAI-compatible API. Just point the base_url to your local server.

OpenCode

Create or edit the opencode.json file in the root of your project:

{
"$schema": "https://opencode.ai/config.json",
"provider": {
"mlx-local": {
"npm": "@ai-sdk/openai-compatible",
"name": "MLX Local (Qwen3.6-35B)",
"options": {
"baseURL": "http://127.0.0.1:7979/v1"
},
"models": {
"mlx-community/Qwen3.6-35B-A3B-4bit": {
"name": "Qwen3.6-35B-A3B-4bit (local MLX)",
"limit": {
"context": 65536,
"output": 32768
}
}
}
}
}
}

Continue / Cursor

Base URL: http://127.0.0.1:7979/v1
API Key: any-value (the server does not validate it)
Model: mlx-community/Qwen3.6-35B-A3B-4bit

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
base_url="http://127.0.0.1:7979/v1",
api_key="local"
)

response = client.chat.completions.create(
model="mlx-community/Qwen3.6-35B-A3B-4bit",
messages=[{"role": "user", "content": "Explain what a transformer is"}]
)
print(response.choices[0].message.content)

Configuration profiles

baseline	65 536	3 entries	77.4	Maximum throughput
high_context	131 072	5 entries	75.7	Long documents, extended contexts (default)

The performance difference between both profiles (~2%) is within the noise margin. Use high_context if you work with large files or very long conversations.

Key parameters explained

--max-bytes 12884901888	12 GB	Critical. Without this limit the model's KV cache (MoE architecture with ArraysCache) grows unchecked until it exhausts RAM on contexts >30k tokens
--prompt-cache-size 3	3 LRU entries	Limits how many conversations the prefix cache keeps in memory
--context-length 65536	64k tokens	Maximum context window per request
--temperature 0.7	—	Balance between creativity and coherence
--repetition-penalty 1.05	—	Reduces repetitions in long responses

Troubleshooting

The server disconnects after 30,000 tokens

This was a known bug with the Qwen3.6-35B-A3B model due to its hybrid MoE architecture. The fix is to make sure you pass --max-bytes 12884901888. With this parameter the server works correctly up to 60,000+ tokens (verified).

Architecture notes (for the curious)

Qwen3.6-35B-A3B is a hybrid MoE (Mixture of Experts) model. Instead of activating all parameters per token, it only activates a subset of "experts", making it efficient for its size. The 4bit version quantizes the weights to 4 bits, reducing RAM usage from ~70 GB to ~20 GB with minimal quality loss.

MLX leverages Apple Silicon's unified memory: the GPU and CPU share the same RAM pool, eliminating the transfer bottleneck that exists in systems with a dedicated GPU. That's why a Mac with 48 GB can run a model that on a PC would require a GPU with 80 GB of VRAM.

References

More...