🎉 LocalAI 3.10.0 Release! 🚀

LocalAI 3.10.0 is big on agent capabilities, multi-modal support, and cross-platform reliability.

We've added native Anthropic API support, launched a new Video Generation UI, introduced Open Responses API compatibility, and enhanced performance with a unified GPU backend system.

For a full tour, see below!

📌 TL;DR

| Feature | Summary | |--------|--------| | Anthropic API Support | Fully compatible /v1/messages endpoint for seamless drop-in replacement of Claude. | | Open Responses API | Native support for stateful agents with tool calling, streaming, background mode, and multi-turn conversations, passing all official acceptance tests. | | Video & Image Generation Suite | New video gen UI + LTX-2 support for text-to-video and image-to-video. | | Unified GPU Backends | GPU libraries (CUDA, ROCm, Vulkan) packaged inside backend containers — works out of the box on Nvidia, AMD, and ARM64 (Experimental). | | Tool Streaming & XML Parsing | Full support for streaming tool calls and XML-formatted tool outputs. | | System-Aware Backend Gallery | Only see backends your system can run (e.g., hide MLX on Linux). | | Crash Fixes | Prevents crashes on AVX-only CPUs (Intel Sandy/Ivy Bridge) and fixes VRAM reporting on AMD GPUs. | | Request Tracing | Debug agents & fine-tuning with memory-based request/response logging. | | Moonshine Backend | Ultra-fast transcription engine for low-end devices. | | Pocket-TTS | Lightweight, high-fidelity text-to-speech with voice cloning. | | Vulkan arm64 builds | We now build backends and images for vulkan on arm64 as well |

🚀 New Features & Major Enhancements

🤖 Open Responses API: Build Smarter, Autonomous Agents

LocalAI now supports the OpenAI Responses API, enabling powerful agentic workflows locally.

Stateful conversations via response_id — resume and manage long-running agent sessions.
Background mode: Run agents asynchronously and fetch results later.
Streaming support for tools, images, and audio.
Built-in tools: Web search, file search, and computer use (via MCP integrations).
Multi-turn interaction with dynamic context and tool use.

✅ Ideal for developers building agents that can browse, analyze files, or interact with systems — all on your local machine.

🔧 How to Use:

Set response_id in your request to maintain session state across calls.

Use background: true to run agents asynchronously.

Retrieve results via GET /api/v1/responses/{response_id}.

Enable streaming with stream: true to receive partial responses and tool calls in real time.

📌 Tip: Use response_id to build agent orchestration systems that persist context and avoid redundant computation.

Our support passes all the official acceptance tests:

🧠 Anthropic Messages API: Clone Claude Locally

LocalAI now fully supports the Anthropic messages API.

Use https://api.localai.host/v1/messages as a drop-in replacement for Claude.
Full tool/function calling support, just like OpenAI.
Streaming and non-streaming responses.
Compatible with anthropic-sdk-go, LangChain, and other tooling.

🔥 Perfect for teams migrating from Anthropic to local inference with full feature parity.

🎥 Video Generation: From Text to Video in the Web UI

New dedicated video generation page with intuitive controls.
LTX-2 is supported
Supports text-to-video and image-to-video workflows.
Built on top of diffusers with full compatibility.

📌 How to Use:

Go to /video in the web UI.

Enter a prompt (e.g., "A cat walking on a moonlit rooftop").

Optionally upload an image for image-to-video generation.

Adjust parameters like fps, num_frames, and guidance_scale.

⚙️ Unified GPU Backends: Acceleration Works Out of the Box

A major architectural upgrade: GPU libraries (CUDA, ROCm, Vulkan) are now packaged inside backend containers.

Single image: Now you don't need anymore to pull a specific image for your GPU. Any image works regardless if you have a GPU or not.
No more manual GPU driver setup — just run the image and get acceleration.
Works on Nvidia (CUDA), AMD (ROCm), and ARM64 (Vulkan).
Vulkan arm64 builds enabled
Reduced image complexity, faster builds, and consistent performance.

🚀 This means latest/master images now support GPU acceleration on all platforms — no extra config!

Note: this is experimental, please help us by filing an issue if something doesn't work!

🧩 Tool Streaming & Advanced Parsing

Enhance your agent workflows with richer tool interaction.

Streaming tool calls: Receive partial tool arguments in real time (e.g., input_json_delta).
XML-style tool call parsing: Models that return tools in XML format (<function>...</function>) are now properly parsed alongside text.
Works across all backends (llama.cpp, vLLM, diffusers, etc.).

💡 Enables more natural, real-time interaction with agents that use structured tool outputs.

🌐 System-Aware Backend Gallery: Only Compatible Backends Show

The backend gallery now shows only backends your system can run.

Auto-detects system capabilities (CPU, GPU, MLX, etc.).
Hides unsupported backends (e.g., MLX on Linux, CUDA on AMD).
Shows detected capabilities in the hero section.

🎤 New TTS Backends: Pocket-TTS

Add expressive voice generation to your apps with Pocket-TTS.

Real-time text-to-speech with voice cloning support (requires HF login).
Lightweight, fast, and open-source.
Available in the model gallery.

🗣️ Perfect for voice agents, narrators, or interactive assistants. ❗ Note: Voice cloning requires HF authentication and a registered voice model.

🔍 Request Tracing: Debug Your Agents

Trace requests and responses in memory — great for fine-tuning and agent debugging.

Enable via runtime setting or API.
Log stored in memory, dropped after max size.
Fetch logs via GET /api/v1/trace.
Export to JSON for analysis.

🪄 New 'Reasoning' Field: Extract Thinking Steps

LocalAI now automatically detects and extracts thinking tags from model output.

Supports both SSE and non-SSE modes.
Displays reasoning steps in the chat UI (under "Thinking" tab).
Fixes issue where thinking content appeared as part of final answer.

🚀 Moonshine Backend: Faster Transcription for Low-End Devices

Add Moonshine, an ONNX-based transcription engine, for fast, lightweight speech-to-text.

Optimized for low-end devices (Raspberry Pi, older laptops).
One of the fastest transcription engines available.
Supports live transcription.

🛠️ Fixes & Stability Improvements

🔧 Prevent BMI2 Crashes on AVX-Only CPUs

Fixed crashes on older Intel CPUs (Ivy Bridge, Sandy Bridge) that lack BMI2 instructions.

Now safely falls back to llama-cpp-fallback (SSE2 only).
No more EOF errors during model warmup.

✅ Ensures LocalAI runs smoothly on older hardware.

📊 Fix Swapped VRAM Usage on AMD GPUs

Correctly parses rocm-smi output: used and total VRAM are now displayed correctly.

Fixes misreported memory usage on dual-Radeon setups.
Handles HIP_VISIBLE_DEVICES properly (e.g., when using only discrete GPU).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

✅ Star the repo
💬 Contribute code, docs, or feedback
📣 Share with others

Your support keeps this stack alive.

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Bug fixes :bug:

fix(ui): correctly parse import errors by @mudler in https://github.com/mudler/LocalAI/pull/7726
fix(cli): import via CLI needs system state by @mudler in https://github.com/mudler/LocalAI/pull/7746
fix(amd-gpu): correctly show total and used vram by @mudler in https://github.com/mudler/LocalAI/pull/7761
fix: add nil checks before mergo.Merge to prevent panic in gallery model installation by @majiayu000 in https://github.com/mudler/LocalAI/pull/7785
fix: Usage for image generation is incorrect (and causes error in LiteLLM) by @majiayu000 in https://github.com/mudler/LocalAI/pull/7786
fix: propagate validation errors by @majiayu000 in https://github.com/mudler/LocalAI/pull/7787
fix: Failed to download checksums.txt when using launch to install localai by @majiayu000 in https://github.com/mudler/LocalAI/pull/7788
fix(image-gen): fix scrolling issues by @mudler in https://github.com/mudler/LocalAI/pull/7829
fix(llama.cpp/mmproj): fix loading mmproj in nested sub-dirs different from model path by @mudler in https://github.com/mudler/LocalAI/pull/7832
fix: Prevent BMI2 instruction crash on AVX-only CPUs by @coffeerunhobby in https://github.com/mudler/LocalAI/pull/7817
fix: Highly inconsistent agent response to cogito agent calling MCP server - Body "Invalid http method" by @majiayu000 in https://github.com/mudler/LocalAI/pull/7790
fix(chat/ui): record model name in history for consistency by @mudler in https://github.com/mudler/LocalAI/pull/7845
fix(ui): fix 404 on API menu link by pointing to index.html by @DEVMANISHOFFL in https://github.com/mudler/LocalAI/pull/7878
fix: BMI2 crash on AVX-only CPUs (Intel Ivy Bridge/Sandy Bridge) by @coffeerunhobby in https://github.com/mudler/LocalAI/pull/7864
fix(model): do not assume success when deleting a model process by @jroeber in https://github.com/mudler/LocalAI/pull/7963
fix(functions): do not duplicate function when valid JSON is inside XML tags by @mudler in https://github.com/mudler/LocalAI/pull/8043

Exciting New Features 🎉

feat: disable force eviction by @mudler in https://github.com/mudler/LocalAI/pull/7725
feat(api): Allow tracing of requests and responses by @richiejp in https://github.com/mudler/LocalAI/pull/7609
feat(UI): image generation improvements by @mudler in https://github.com/mudler/LocalAI/pull/7804
feat(image-gen/UI): move controls to the left, make the page more compact by @mudler in https://github.com/mudler/LocalAI/pull/7823
feat(function): Add tool streaming, XML Tool Call Parsing Support by @mudler in https://github.com/mudler/LocalAI/pull/7865

New Contributors

@majiayu000 made their first contribution in https://github.com/mudler/LocalAI/pull/7785
@coffeerunhobby made their first contribution in https://github.com/mudler/LocalAI/pull/7817
@DEVMANISHOFFL made their first contribution in https://github.com/mudler/LocalAI/pull/7878
@jroeber made their first contribution in https://github.com/mudler/LocalAI/pull/7963
@Divyanshupandey007 made their first contribution in https://github.com/mudler/LocalAI/pull/8050

Full Changelog: https://github.com/mudler/LocalAI/compare/v3.9.0...v3.10.0

🎉 LocalAI 3.10.0 Release! 🚀

LocalAI 3.10.0 is big on agent capabilities, multi-modal support, and cross-platform reliability.

We've added native Anthropic API support, launched a new Video Generation UI, introduced Open Responses API compatibility, and enhanced performance with a unified GPU backend system.

For a full tour, see below!

📌 TL;DR

🚀 New Features & Major Enhancements

🤖 Open Responses API: Build Smarter, Autonomous Agents

LocalAI now supports the OpenAI Responses API, enabling powerful agentic workflows locally.

Stateful conversations via response_id — resume and manage long-running agent sessions.
Background mode: Run agents asynchronously and fetch results later.
Streaming support for tools, images, and audio.
Built-in tools: Web search, file search, and computer use (via MCP integrations).
Multi-turn interaction with dynamic context and tool use.

✅ Ideal for developers building agents that can browse, analyze files, or interact with systems — all on your local machine.

🔧 How to Use:

Set response_id in your request to maintain session state across calls.

Use background: true to run agents asynchronously.

Retrieve results via GET /api/v1/responses/{response_id}.

Enable streaming with stream: true to receive partial responses and tool calls in real time.

📌 Tip: Use response_id to build agent orchestration systems that persist context and avoid redundant computation.

Our support passes all the official acceptance tests:

🧠 Anthropic Messages API: Clone Claude Locally

LocalAI now fully supports the Anthropic messages API.

Use https://api.localai.host/v1/messages as a drop-in replacement for Claude.
Full tool/function calling support, just like OpenAI.
Streaming and non-streaming responses.
Compatible with anthropic-sdk-go, LangChain, and other tooling.

🔥 Perfect for teams migrating from Anthropic to local inference with full feature parity.

🎥 Video Generation: From Text to Video in the Web UI

New dedicated video generation page with intuitive controls.
LTX-2 is supported
Supports text-to-video and image-to-video workflows.
Built on top of diffusers with full compatibility.

📌 How to Use:

Go to /video in the web UI.

Enter a prompt (e.g., "A cat walking on a moonlit rooftop").

Optionally upload an image for image-to-video generation.

Adjust parameters like fps, num_frames, and guidance_scale.

⚙️ Unified GPU Backends: Acceleration Works Out of the Box

A major architectural upgrade: GPU libraries (CUDA, ROCm, Vulkan) are now packaged inside backend containers.

Single image: Now you don't need anymore to pull a specific image for your GPU. Any image works regardless if you have a GPU or not.
No more manual GPU driver setup — just run the image and get acceleration.
Works on Nvidia (CUDA), AMD (ROCm), and ARM64 (Vulkan).
Vulkan arm64 builds enabled
Reduced image complexity, faster builds, and consistent performance.

🚀 This means latest/master images now support GPU acceleration on all platforms — no extra config!

Note: this is experimental, please help us by filing an issue if something doesn't work!

🧩 Tool Streaming & Advanced Parsing

Enhance your agent workflows with richer tool interaction.

Streaming tool calls: Receive partial tool arguments in real time (e.g., input_json_delta).
XML-style tool call parsing: Models that return tools in XML format (<function>...</function>) are now properly parsed alongside text.
Works across all backends (llama.cpp, vLLM, diffusers, etc.).

💡 Enables more natural, real-time interaction with agents that use structured tool outputs.

🌐 System-Aware Backend Gallery: Only Compatible Backends Show

The backend gallery now shows only backends your system can run.

Auto-detects system capabilities (CPU, GPU, MLX, etc.).
Hides unsupported backends (e.g., MLX on Linux, CUDA on AMD).
Shows detected capabilities in the hero section.

🎤 New TTS Backends: Pocket-TTS

Add expressive voice generation to your apps with Pocket-TTS.

Real-time text-to-speech with voice cloning support (requires HF login).
Lightweight, fast, and open-source.
Available in the model gallery.

🗣️ Perfect for voice agents, narrators, or interactive assistants. ❗ Note: Voice cloning requires HF authentication and a registered voice model.

🔍 Request Tracing: Debug Your Agents

Trace requests and responses in memory — great for fine-tuning and agent debugging.

Enable via runtime setting or API.
Log stored in memory, dropped after max size.
Fetch logs via GET /api/v1/trace.
Export to JSON for analysis.

🪄 New 'Reasoning' Field: Extract Thinking Steps

LocalAI now automatically detects and extracts thinking tags from model output.

Supports both SSE and non-SSE modes.
Displays reasoning steps in the chat UI (under "Thinking" tab).
Fixes issue where thinking content appeared as part of final answer.

🚀 Moonshine Backend: Faster Transcription for Low-End Devices

Add Moonshine, an ONNX-based transcription engine, for fast, lightweight speech-to-text.

Optimized for low-end devices (Raspberry Pi, older laptops).
One of the fastest transcription engines available.
Supports live transcription.

🛠️ Fixes & Stability Improvements

🔧 Prevent BMI2 Crashes on AVX-Only CPUs

Fixed crashes on older Intel CPUs (Ivy Bridge, Sandy Bridge) that lack BMI2 instructions.

Now safely falls back to llama-cpp-fallback (SSE2 only).
No more EOF errors during model warmup.

✅ Ensures LocalAI runs smoothly on older hardware.

📊 Fix Swapped VRAM Usage on AMD GPUs

Correctly parses rocm-smi output: used and total VRAM are now displayed correctly.

Fixes misreported memory usage on dual-Radeon setups.
Handles HIP_VISIBLE_DEVICES properly (e.g., when using only discrete GPU).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

✅ Star the repo
💬 Contribute code, docs, or feedback
📣 Share with others

Your support keeps this stack alive.

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Bug fixes :bug:

fix(ui): correctly parse import errors by @mudler in https://github.com/mudler/LocalAI/pull/7726
fix(cli): import via CLI needs system state by @mudler in https://github.com/mudler/LocalAI/pull/7746
fix(amd-gpu): correctly show total and used vram by @mudler in https://github.com/mudler/LocalAI/pull/7761
fix: add nil checks before mergo.Merge to prevent panic in gallery model installation by @majiayu000 in https://github.com/mudler/LocalAI/pull/7785
fix: Usage for image generation is incorrect (and causes error in LiteLLM) by @majiayu000 in https://github.com/mudler/LocalAI/pull/7786
fix: propagate validation errors by @majiayu000 in https://github.com/mudler/LocalAI/pull/7787
fix: Failed to download checksums.txt when using launch to install localai by @majiayu000 in https://github.com/mudler/LocalAI/pull/7788
fix(image-gen): fix scrolling issues by @mudler in https://github.com/mudler/LocalAI/pull/7829
fix(llama.cpp/mmproj): fix loading mmproj in nested sub-dirs different from model path by @mudler in https://github.com/mudler/LocalAI/pull/7832
fix: Prevent BMI2 instruction crash on AVX-only CPUs by @coffeerunhobby in https://github.com/mudler/LocalAI/pull/7817
fix: Highly inconsistent agent response to cogito agent calling MCP server - Body "Invalid http method" by @majiayu000 in https://github.com/mudler/LocalAI/pull/7790
fix(chat/ui): record model name in history for consistency by @mudler in https://github.com/mudler/LocalAI/pull/7845
fix(ui): fix 404 on API menu link by pointing to index.html by @DEVMANISHOFFL in https://github.com/mudler/LocalAI/pull/7878
fix: BMI2 crash on AVX-only CPUs (Intel Ivy Bridge/Sandy Bridge) by @coffeerunhobby in https://github.com/mudler/LocalAI/pull/7864
fix(model): do not assume success when deleting a model process by @jroeber in https://github.com/mudler/LocalAI/pull/7963
fix(functions): do not duplicate function when valid JSON is inside XML tags by @mudler in https://github.com/mudler/LocalAI/pull/8043

Exciting New Features 🎉

feat: disable force eviction by @mudler in https://github.com/mudler/LocalAI/pull/7725
feat(api): Allow tracing of requests and responses by @richiejp in https://github.com/mudler/LocalAI/pull/7609
feat(UI): image generation improvements by @mudler in https://github.com/mudler/LocalAI/pull/7804
feat(image-gen/UI): move controls to the left, make the page more compact by @mudler in https://github.com/mudler/LocalAI/pull/7823
feat(function): Add tool streaming, XML Tool Call Parsing Support by @mudler in https://github.com/mudler/LocalAI/pull/7865

New Contributors

@majiayu000 made their first contribution in https://github.com/mudler/LocalAI/pull/7785
@coffeerunhobby made their first contribution in https://github.com/mudler/LocalAI/pull/7817
@DEVMANISHOFFL made their first contribution in https://github.com/mudler/LocalAI/pull/7878
@jroeber made their first contribution in https://github.com/mudler/LocalAI/pull/7963
@Divyanshupandey007 made their first contribution in https://github.com/mudler/LocalAI/pull/8050

Full Changelog: https://github.com/mudler/LocalAI/compare/v3.9.0...v3.10.0

🎉 LocalAI 3.10.0 Release! 🚀

📌 TL;DR

🚀 New Features & Major Enhancements

🤖 Open Responses API: Build Smarter, Autonomous Agents

🧠 Anthropic Messages API: Clone Claude Locally

🎥 Video Generation: From Text to Video in the Web UI

⚙️ Unified GPU Backends: Acceleration Works Out of the Box

🧩 Tool Streaming & Advanced Parsing

🌐 System-Aware Backend Gallery: Only Compatible Backends Show

🎤 New TTS Backends: Pocket-TTS

🔍 Request Tracing: Debug Your Agents

🪄 New 'Reasoning' Field: Extract Thinking Steps

🚀 Moonshine Backend: Faster Transcription for Low-End Devices

🛠️ Fixes & Stability Improvements

🔧 Prevent BMI2 Crashes on AVX-Only CPUs

📊 Fix Swapped VRAM Usage on AMD GPUs

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

LocalAGI

❤️ Thank You

✅ Full Changelog

What's Changed

Bug fixes :bug:

Exciting New Features 🎉

New Contributors

More Go Projects

ollama

kubernetes

frp

gin

More Go Projects

ollama

kubernetes

frp

gin

🎉 LocalAI 3.10.0 Release! 🚀

📌 TL;DR

🚀 New Features & Major Enhancements

🤖 Open Responses API: Build Smarter, Autonomous Agents

🧠 Anthropic Messages API: Clone Claude Locally

🎥 Video Generation: From Text to Video in the Web UI

⚙️ Unified GPU Backends: Acceleration Works Out of the Box

🧩 Tool Streaming & Advanced Parsing

🌐 System-Aware Backend Gallery: Only Compatible Backends Show

🎤 New TTS Backends: Pocket-TTS

🔍 Request Tracing: Debug Your Agents

🪄 New 'Reasoning' Field: Extract Thinking Steps

🚀 Moonshine Backend: Faster Transcription for Low-End Devices

🛠️ Fixes & Stability Improvements

🔧 Prevent BMI2 Crashes on AVX-Only CPUs

📊 Fix Swapped VRAM Usage on AMD GPUs

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

LocalAGI

❤️ Thank You

✅ Full Changelog

What's Changed

Bug fixes :bug:

Exciting New Features 🎉

New Contributors

LocalRecall

🧠 Models

📖 Documentation and examples

👒 Dependencies

Other Changes