Skip to content

MLX Models

Run local LLMs on Apple Silicon with Basepod.

macOS Only

MLX is only available on macOS with Apple Silicon. On Linux VPS (Hetzner, DigitalOcean, etc.), Basepod works for app hosting but LLM features are disabled.

Overview

Basepod includes built-in support for MLX, Apple's machine learning framework optimized for Apple Silicon. Run powerful language models locally with an OpenAI-compatible API.

Requirements

  • Mac with Apple Silicon (M series)
  • macOS 13 Ventura or later
  • 8GB+ RAM (16GB+ recommended for larger models)

Available Models

Chat Models

ModelSizeRAM RequiredDescription
Llama 3.2 1B0.7GB4GBUltra-fast, great for quick tasks
Llama 3.2 3B2GB8GBGood balance of speed and quality
Phi-48GB16GBMicrosoft's latest, strong reasoning
Qwen 2.5 7B4GB12GBExcellent multilingual support

Code Models

ModelSizeRAM RequiredDescription
Qwen 2.5 Coder 7B4GB12GBSpecialized for code generation
DeepSeek Coder 6.7B4GB12GBStrong coding capabilities
CodeLlama 7B4GB12GBMeta's code-focused model

Reasoning Models

ModelSizeRAM RequiredDescription
DeepSeek R1 1.5B1GB6GBChain-of-thought reasoning
DeepSeek R1 7B4GB12GBAdvanced reasoning
DeepSeek R1 14B8GB20GBBest reasoning quality

Starting a Model

Via Web UI

  1. Open the Basepod dashboard
  2. Go to LLMs page
  3. Click Download on your chosen model
  4. Click Start once downloaded

Via CLI

bash
# List available models
bp llm list

# Download a model
bp llm download llama-3.2-3b

# Start a model
bp llm start llama-3.2-3b

Using the API

Once a model is running, access it via the OpenAI-compatible API.

Chat Completion

bash
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

With Python

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mlx-community/Llama-3.2-3B-Instruct-4bit",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

With JavaScript

javascript
import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'not-needed'
});

const response = await client.chat.completions.create({
  model: 'mlx-community/Llama-3.2-3B-Instruct-4bit',
  messages: [{ role: 'user', content: 'Hello!' }]
});

console.log(response.choices[0].message.content);

Exposing Externally

To access the LLM API from other devices:

  1. Configure Caddy to proxy the LLM endpoint
  2. Add to your ~/.basepod/config/basepod.yaml:
yaml
domain:
  base: example.com
  1. The API will be available at https://llm.example.com

Chat Interface

Basepod includes a built-in chat interface at /chat in the dashboard. Features:

  • Real-time streaming responses
  • Conversation history
  • Model switching
  • Clear chat

Performance Tips

Memory Management

  • Close unused applications when running large models
  • Models are automatically unloaded when stopped
  • Check Activity Monitor for memory usage

Model Selection

  • For quick tasks: Llama 3.2 1B or 3B
  • For coding: Qwen Coder or DeepSeek Coder
  • For complex reasoning: DeepSeek R1
  • For general use: Phi-4 or Qwen 2.5 7B

Quantization

All models use 4-bit quantization by default for optimal memory usage while maintaining quality.

Troubleshooting

Model Won't Start

bash
# Check available memory
vm_stat | grep "Pages free"

# Try a smaller model
bp llm start llama-3.2-1b

Slow Responses

  • Ensure no other heavy processes are running
  • Try a smaller model
  • Check if thermal throttling (Activity Monitor > CPU)

Download Fails

bash
# Clear cache and retry
rm -rf ~/.cache/huggingface
bp llm download <model>

Released under the MIT License.