Local AI Engine

Local AI Engine is a powerful BinaryCMS plugin that transforms your server into a private, self-hosted AI inference platform.

About this Plugin

Local AI Engine is a powerful BinaryCMS plugin that transforms your server into a private, self-hosted AI inference platform. It enables you to download, manage, and run open-source Large Language Models (LLMs) directly on your own hardware — no cloud API keys, no per-token fees, and complete data privacy.

Built on Mozilla’s llamafile technology, the plugin automatically detects your system’s hardware capabilities (CPU, GPU, RAM, AVX2) and configures the inference engine for optimal performance. Models are sourced directly from Hugging Face, the world’s largest open-source AI model repository, with a built-in search and one-click download system.

Once a model is running, the plugin exposes an OpenAI-compatible API endpoint on localhost, which seamlessly integrates with BinaryCMS’s AI Chat plugin — giving your site a fully private AI assistant without any external dependencies.

Features

🧠 Model Management Hugging Face Integration — Search and browse thousands of GGUF-format models directly from the admin panel One-Click Download — Download models with a single click; the plugin handles the full download lifecycle Live Download Progress — Real-time progress bar with percentage, MB downloaded/total, and download speed (MB/s) Download Resume — Interrupted downloads resume automatically from where they left off via HTTP Range headers Cancel Downloads — Cancel an in-progress download at any time from the UI Model Library — View all installed models in a table with repository name, file size, parameter count, and quantization type Model Deletion — Remove downloaded models from disk and database with one click Auto-Scan — Automatically detects and registers any .gguf files manually placed in the models directory Native GGUF Parser — Reads model metadata (parameter count) directly from the GGUF binary file header

🖥️ Hardware-Aware Engine Automatic Hardware Detection — Scans CPU model, core count, AVX2 support, total RAM, and NVIDIA GPU (via nvidia-smi) GPU Auto-Offloading — Automatically offloads model layers to GPU when NVIDIA hardware is detected (-1 = auto, 0 = CPU only, N = manual) CPU-Only Fallback — Runs entirely on CPU for systems without a dedicated GPU llamafile Runtime — Uses Mozilla’s llamafile (pinned to v0.8.14), a single-binary LLM inference engine that requires no compilation or driver installation Auto-Download Engine — The llamafile binary is automatically downloaded from GitHub on first launch with SHA256 integrity verification Cosmopolitan Libc Compatible — Handles APE (Actually Portable Executable) format edge cases for containerized environments

📊 Live System Monitoring Dashboard CPU Usage Gauge — Real-time circular gauge showing system-wide CPU utilization with color-coded thresholds (blue → orange → red) RAM Usage Gauge — Live circular gauge showing system memory usage with MB used / total breakdown Engine Memory Gauge — Dedicated gauge showing how much RAM the LLM process is consuming (reads from /proc/[PID]/status) Active Model Banner — Prominent banner displaying the currently loaded model name, uptime, and running status with a pulsing indicator dot 2-Second Auto-Refresh — All metrics update every 2 seconds for a responsive, real-time monitoring experience Engine Status Panel — Shows engine status (Stopped / Loading / Running), active model filename, port, and PID Context-Aware Controls — Dashboard buttons dynamically switch between “Force Stop” (when running) and “Go to Model Library” (when stopped) System Hardware Card — Displays GPU availability, CPU model, core count, AVX2 support, and total RAM

⚙️ Engine Management One-Click Start — Start the inference engine on any installed model from the Model Library Force Stop — Immediately terminate the running engine process Health Monitoring — Automatic health checks on the engine’s /health endpoint with up to 5-minute timeout for CPU-only systems Live Loading Status — Dashboard shows “loading model (30s)”, “loading model (60s)” etc. during model warmup Zombie Process Protection — Anti-zombie Pdeathsig ensures the llamafile process is killed if the plugin crashes Error Preservation — Startup errors (e.g., timeout) remain visible on the dashboard instead of being silently overwritten Structured Logging — All engine lifecycle events logged with [LocalAI] prefix for easy debugging

🔌 AI Chat Integration One-Click Connect — The “Connect to AI Chat” button automatically configures BinaryCMS’s AI Chat plugin to use the local engine OpenAI-Compatible API — Exposes a standard /v1/chat/completions endpoint compatible with any OpenAI client library API Key Security — Auto-generated API key for secure localhost communication between plugins Endpoint File Export — Writes local_ai_endpoint.json for programmatic integration by other plugins 🛠️ Configuration Inference Port — Configurable port for the LLM server (default: 8086) Context Length — Adjustable context window size (default: 4096 tokens) GPU Layer Offloading — Fine-grained control over how many model layers are offloaded to GPU API Key Management — View and manage the auto-generated API key Engine Binary Updates — Interface for downloading/updating the llamafile binary

🏋️ Benchmarking (Framework) Tokens/Second Measurement — Built-in benchmark runner that measures inference speed Performance History — Benchmark results stored in SQLite for comparison across models Prompt Token Tracking — Records prompt token counts for each benchmark run

🛡️ Stability & Resilience Non-Blocking HuggingFace API — All outbound HuggingFace requests use a 10-second HTTP timeout to prevent RPC thread deadlocks Download Client Timeout — HEAD requests to HuggingFace use a 30-second timeout to prevent stalled downloads Multi-Click Guard — Frontend prevents parallel download polling loops from rapid button clicks Graceful Error Handling — Download errors, network failures, and engine crashes are handled gracefully with user-visible error states

Technical Specifications Specification Detail Plugin Type BinaryCMS RPC Plugin (hashicorp/go-plugin) Language Go Database SQLite (embedded) Inference Engine llamafile v0.8.14 (Mozilla) Model Format GGUF (llama.cpp compatible) API Protocol OpenAI Chat Completions API (/v1/chat/completions) Model Source Hugging Face Hub Monitoring Linux /proc filesystem (CPU, RAM, per-process) GPU Support NVIDIA (via nvidia-smi), with CPU fallback

System Requirements Requirement Minimum Recommended RAM 4 GB 16 GB CPU x86_64 x86_64 with AVX2 Disk 2 GB (engine small model) 20 GB (for larger models) OS Linux (Docker supported) Linux GPU None (CPU-only mode) NVIDIA with CUDA drivers