MLX Models

Run local LLMs on Apple Silicon with Basepod.

macOS Only

MLX is only available on macOS with Apple Silicon. On Linux VPS (Hetzner, DigitalOcean, etc.), Basepod works for app hosting but LLM features are disabled.

Overview

Basepod includes built-in support for MLX, Apple's machine learning framework optimized for Apple Silicon. Run powerful language models locally with an OpenAI-compatible API.

Requirements

Mac with Apple Silicon (M series)
macOS 13 Ventura or later
8GB+ RAM (16GB+ recommended for larger models)

Available Models

Chat Models

Model	Size	RAM Required	Description
Llama 3.2 1B	0.7GB	4GB	Ultra-fast, great for quick tasks
Llama 3.2 3B	2GB	8GB	Good balance of speed and quality
Phi-4	8GB	16GB	Microsoft's latest, strong reasoning
Qwen 2.5 7B	4GB	12GB	Excellent multilingual support

Code Models

Model	Size	RAM Required	Description
Qwen 2.5 Coder 7B	4GB	12GB	Specialized for code generation
DeepSeek Coder 6.7B	4GB	12GB	Strong coding capabilities
CodeLlama 7B	4GB	12GB	Meta's code-focused model

Reasoning Models

Model	Size	RAM Required	Description
DeepSeek R1 1.5B	1GB	6GB	Chain-of-thought reasoning
DeepSeek R1 7B	4GB	12GB	Advanced reasoning
DeepSeek R1 14B	8GB	20GB	Best reasoning quality

Starting a Model

Via Web UI

Open the Basepod dashboard
Go to LLMs page
Click Download on your chosen model
Click Start once downloaded

Via CLI

bash

# List available models
bp llm list

# Download a model
bp llm download llama-3.2-3b

# Start a model
bp llm start llama-3.2-3b

Using the API

Once a model is running, access it via the OpenAI-compatible API.

Chat Completion

bash

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

With Python

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mlx-community/Llama-3.2-3B-Instruct-4bit",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

With JavaScript

javascript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'not-needed'
});

const response = await client.chat.completions.create({
  model: 'mlx-community/Llama-3.2-3B-Instruct-4bit',
  messages: [{ role: 'user', content: 'Hello!' }]
});

console.log(response.choices[0].message.content);

Exposing Externally

To access the LLM API from other devices:

Configure Caddy to proxy the LLM endpoint
Add to your ~/.basepod/config/basepod.yaml:

yaml

domain:
  base: example.com

The API will be available at https://llm.example.com

Chat Interface

Basepod includes a built-in chat interface at /chat in the dashboard. Features:

Real-time streaming responses
Conversation history
Model switching
Clear chat

Performance Tips

Memory Management

Close unused applications when running large models
Models are automatically unloaded when stopped
Check Activity Monitor for memory usage

Model Selection

For quick tasks: Llama 3.2 1B or 3B
For coding: Qwen Coder or DeepSeek Coder
For complex reasoning: DeepSeek R1
For general use: Phi-4 or Qwen 2.5 7B

Quantization

All models use 4-bit quantization by default for optimal memory usage while maintaining quality.

Troubleshooting

Model Won't Start

bash

# Check available memory
vm_stat | grep "Pages free"

# Try a smaller model
bp llm start llama-3.2-1b

Slow Responses

Ensure no other heavy processes are running
Try a smaller model
Check if thermal throttling (Activity Monitor > CPU)

Download Fails

bash

# Clear cache and retry
rm -rf ~/.cache/huggingface
bp llm download <model>

MLX Models ​

Overview ​

Requirements ​

Available Models ​

Chat Models ​

Code Models ​

Reasoning Models ​

Starting a Model ​

Via Web UI ​

Via CLI ​

Using the API ​

Chat Completion ​

With Python ​

With JavaScript ​

Exposing Externally ​

Chat Interface ​

Performance Tips ​

Memory Management ​

Model Selection ​

Quantization ​

Troubleshooting ​

Model Won't Start ​

Slow Responses ​

Download Fails ​

MLX Models

Overview

Requirements

Available Models

Chat Models

Code Models

Reasoning Models

Starting a Model

Via Web UI

Via CLI

Using the API

Chat Completion

With Python

With JavaScript

Exposing Externally

Chat Interface

Performance Tips

Memory Management

Model Selection

Quantization

Troubleshooting

Model Won't Start

Slow Responses

Download Fails