MLX Models
Run local LLMs on Apple Silicon with Basepod.
macOS Only
MLX is only available on macOS with Apple Silicon. On Linux VPS (Hetzner, DigitalOcean, etc.), Basepod works for app hosting but LLM features are disabled.
Overview
Basepod includes built-in support for MLX, Apple's machine learning framework optimized for Apple Silicon. Run powerful language models locally with an OpenAI-compatible API.
Requirements
- Mac with Apple Silicon (M series)
- macOS 13 Ventura or later
- 8GB+ RAM (16GB+ recommended for larger models)
Available Models
Chat Models
| Model | Size | RAM Required | Description |
|---|---|---|---|
| Llama 3.2 1B | 0.7GB | 4GB | Ultra-fast, great for quick tasks |
| Llama 3.2 3B | 2GB | 8GB | Good balance of speed and quality |
| Phi-4 | 8GB | 16GB | Microsoft's latest, strong reasoning |
| Qwen 2.5 7B | 4GB | 12GB | Excellent multilingual support |
Code Models
| Model | Size | RAM Required | Description |
|---|---|---|---|
| Qwen 2.5 Coder 7B | 4GB | 12GB | Specialized for code generation |
| DeepSeek Coder 6.7B | 4GB | 12GB | Strong coding capabilities |
| CodeLlama 7B | 4GB | 12GB | Meta's code-focused model |
Reasoning Models
| Model | Size | RAM Required | Description |
|---|---|---|---|
| DeepSeek R1 1.5B | 1GB | 6GB | Chain-of-thought reasoning |
| DeepSeek R1 7B | 4GB | 12GB | Advanced reasoning |
| DeepSeek R1 14B | 8GB | 20GB | Best reasoning quality |
Starting a Model
Via Web UI
- Open the Basepod dashboard
- Go to LLMs page
- Click Download on your chosen model
- Click Start once downloaded
Via CLI
bash
# List available models
bp llm list
# Download a model
bp llm download llama-3.2-3b
# Start a model
bp llm start llama-3.2-3bUsing the API
Once a model is running, access it via the OpenAI-compatible API.
Chat Completion
bash
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
"messages": [{"role": "user", "content": "Hello!"}]
}'With Python
python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="mlx-community/Llama-3.2-3B-Instruct-4bit",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)With JavaScript
javascript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'not-needed'
});
const response = await client.chat.completions.create({
model: 'mlx-community/Llama-3.2-3B-Instruct-4bit',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);Exposing Externally
To access the LLM API from other devices:
- Configure Caddy to proxy the LLM endpoint
- Add to your
~/.basepod/config/basepod.yaml:
yaml
domain:
base: example.com- The API will be available at
https://llm.example.com
Chat Interface
Basepod includes a built-in chat interface at /chat in the dashboard. Features:
- Real-time streaming responses
- Conversation history
- Model switching
- Clear chat
Performance Tips
Memory Management
- Close unused applications when running large models
- Models are automatically unloaded when stopped
- Check Activity Monitor for memory usage
Model Selection
- For quick tasks: Llama 3.2 1B or 3B
- For coding: Qwen Coder or DeepSeek Coder
- For complex reasoning: DeepSeek R1
- For general use: Phi-4 or Qwen 2.5 7B
Quantization
All models use 4-bit quantization by default for optimal memory usage while maintaining quality.
Troubleshooting
Model Won't Start
bash
# Check available memory
vm_stat | grep "Pages free"
# Try a smaller model
bp llm start llama-3.2-1bSlow Responses
- Ensure no other heavy processes are running
- Try a smaller model
- Check if thermal throttling (Activity Monitor > CPU)
Download Fails
bash
# Clear cache and retry
rm -rf ~/.cache/huggingface
bp llm download <model>