llama.cpp

路线图 / 项目状态 / 宣言 / ggml

在纯 C/C++ 中推理 Meta 的 LLaMA 模型（以及其他模型）

重要提示

新的 llama.cpp 包位置：ggml-org/llama.cpp

请将您的容器 URL 更新为：ghcr.io/ggml-org/llama.cpp

更多信息：#11801

最近的 API 变更

描述

llama.cpp 的主要目标是在各种硬件上以最少的设置和最先进的性能实现 LLM 推理 - 本地和云端。

纯 C/C++ 实现，没有任何依赖项
Apple silicon 是头等公民 - 通过 ARM NEON、Accelerate 和 Metal 框架进行优化
对 x86 架构的 AVX、AVX2、AVX512 和 AMX 支持
1.5 位、2 位、3 位、4 位、5 位、6 位和 8 位整数量化，以实现更快的推理并减少内存使用
用于在 NVIDIA GPU 上运行 LLM 的自定义 CUDA 内核（通过 HIP 支持 AMD GPU，通过 MUSA 支持 Moore Threads MTT GPU）
Vulkan 和 SYCL 后端支持
CPU+GPU 混合推理，以部分加速大于总 VRAM 容量的模型

llama.cpp 项目是为 ggml 库开发新功能的主要场所。

模型

通常也支持以下基础模型的微调。

添加对新模型支持的说明：HOWTO-add-model.md

纯文本

多模态

绑定

Python: abetlen/llama-cpp-python
Go: go-skynet/go-llama.cpp
Node.js: withcatai/node-llama-cpp
JS/TS (llama.cpp 服务器客户端): lgrammel/modelfusion
JS/TS (可编程 Prompt Engine CLI): offline-ai/cli
JavaScript/Wasm (在浏览器中工作): tangledgroup/llama-cpp-wasm
Typescript/Wasm (更友好的 API, 可在 npm 上获得): ngxson/wllama
Ruby: yoshoku/llama_cpp.rb
Rust (更多功能): edgenai/llama_cpp-rs
Rust (更友好的 API): mdrokz/rust-llama.cpp
Rust (更直接的绑定): utilityai/llama-cpp-rs
Rust (从 crates.io 自动构建): ShelbyJenkins/llm_client
C#/.NET: SciSharp/LLamaSharp
C#/VB.NET (更多功能 - 社区许可证): LM-Kit.NET
Scala 3: donderom/llm4s
Clojure: phronmophobic/llama.clj
React Native: mybigday/llama.rn
Java: kherud/java-llama.cpp
Zig: deins/llama.cpp.zig
Flutter/Dart: netdur/llama_cpp_dart
Flutter: xuegao-tzx/Fllama
PHP (构建在 llama.cpp 之上的 API 绑定和功能): distantmagic/resonance (更多信息)
Guile Scheme: guile_llama_cpp
Swift srgtuszy/llama-cpp-swift
Swift ShenghaiWang/SwiftLlama

(要将项目列在此处，它应该明确声明它依赖于 llama.cpp)

AI Sublime Text 插件 (MIT)
cztomsik/ava (MIT)
Dot (GPL)
eva (MIT)
iohub/collama (Apache-2.0)
janhq/jan (AGPL)
KanTV (Apache-2.0)
KodiBot (GPL)
llama.vim (MIT)
LARS (AGPL)
Llama Assistant (GPL)
LLMFarm (MIT)
LLMUnity (MIT)
LMStudio (专有)
LocalAI (MIT)
LostRuins/koboldcpp (AGPL)
MindMac (专有)
MindWorkAI/AI-Studio (FSL-1.1-MIT)
Mobile-Artificial-Intelligence/maid (MIT)
Mozilla-Ocho/llamafile (Apache-2.0)
nat/openplayground (MIT)
nomic-ai/gpt4all (MIT)
ollama/ollama (MIT)
oobabooga/text-generation-webui (AGPL)
PocketPal AI (MIT)
psugihara/FreeChat (MIT)
ptsochantaris/emeltal (MIT)
pythops/tenere (AGPL)
ramalama (MIT)
semperai/amica (MIT)
withcatai/catai (MIT)
Autopen (GPL)

工具

akx/ggify – 从 HuggingFace Hub 下载 PyTorch 模型并将它们转换为 GGML
akx/ollama-dl – 从 Ollama 库下载模型，以便直接与 llama.cpp 一起使用
crashr/gppm – 启动利用 NVIDIA Tesla P40 或 P100 GPU 的 llama.cpp 实例，并降低空闲功耗
gpustack/gguf-parser - 检查/验证 GGUF 文件并估算内存使用量
Styled Lines (专有许可，用于 Unity3d 游戏开发的推理部分的异步包装器，带有预构建的移动和 Web 平台包装器以及模型示例)

基础设施

Paddler - 为 llama.cpp 定制的有状态负载均衡器
GPUStack - 管理用于运行 LLM 的 GPU 集群
llama_cpp_canister - 使用 WebAssembly 在 Internet Computer 上的智能合约中运行 llama.cpp
llama-swap - 透明代理，为 llama-server 添加自动模型切换
Kalavai - 以任何规模众包端到端 LLM 部署

游戏

Lucy's Labyrinth - 一个简单的迷宫游戏，由 AI 模型控制的代理会试图欺骗你。

支持的后端

后端	目标设备
Metal	Apple Silicon
BLAS	全部
BLIS	全部
SYCL	Intel 和 Nvidia GPU
MUSA	摩尔线程 MTT GPU
CUDA	Nvidia GPU
HIP	AMD GPU
Vulkan	GPU
CANN	昇腾 NPU
OpenCL	Adreno GPU

构建项目

该项目的主要产品是 llama 库。它的 C 风格接口可以在 include/llama.h 中找到。该项目还包括许多使用 llama 库的示例程序和工具。这些示例从简单的、最小的代码片段到复杂的子项目，例如与 OpenAI 兼容的 HTTP 服务器。获取二进制文件的可能方法

克隆此存储库并在本地构建，请参阅如何构建
在 MacOS 或 Linux 上，通过 brew, flox 或 nix 安装 llama.cpp
使用 Docker 镜像，请参阅 Docker 文档
从发布版本下载预构建的二进制文件

获取和量化模型

Hugging Face 平台托管着许多与 llama.cpp 兼容的 LLM

热门
LLaMA

您可以手动下载 GGUF 文件，或者直接使用来自 Hugging Face 的任何 llama.cpp 兼容模型，方法是使用此 CLI 参数：-hf <user>/<model>[:quant]

下载模型后，使用 CLI 工具在本地运行它 - 见下文。

llama.cpp 要求模型以 GGUF 文件格式存储。其他数据格式的模型可以使用此 repo 中的 convert_*.py Python 脚本转换为 GGUF。

Hugging Face 平台提供了各种在线工具，用于转换、量化和托管带有 llama.cpp 的模型

使用 GGUF-my-repo space 转换为 GGUF 格式并将模型权重量化为更小尺寸
使用 GGUF-my-LoRA space 将 LoRA 适配器转换为 GGUF 格式 (更多信息: #10123)
使用 GGUF-editor space 在浏览器中编辑 GGUF 元数据 (更多信息: #9268)
使用 Inference Endpoints 直接在云端托管 llama.cpp (更多信息: #9669)

要了解有关模型量化的更多信息，请阅读本文档

`llama-cli`

一个 CLI 工具，用于访问和试验 `llama.cpp` 的大部分功能。

在对话模式下运行

具有内置聊天模板的模型将自动激活对话模式。如果未发生这种情况，您可以通过添加 -cnv 并使用 --chat-template NAME 指定合适的聊天模板来手动启用它

llama-cli -m model.gguf

# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!

使用自定义聊天模板在对话模式下运行

# use the "chatml" template (use -h to see the list of supported templates)
llama-cli -m model.gguf -cnv --chat-template chatml

# use a custom template
llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'

运行简单的文本补全

要显式禁用对话模式，请使用 -no-cnv

llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128 -no-cnv

# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.

使用自定义语法约束输出
```
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'

# {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
```
grammars/ 文件夹包含一些示例语法。要编写自己的语法，请查看 GBNF 指南。

对于编写更复杂的 JSON 语法，请查看 https://grammar.intrinsiclabs.ai/

`llama-server`

一个轻量级，OpenAI API 兼容的 HTTP 服务器，用于服务 LLM。

在端口 8080 上启动具有默认配置的本地 HTTP 服务器

llama-server -m model.gguf --port 8080

# Basic web UI can be accessed via browser: https://:8080
# Chat completion endpoint: https://:8080/v1/chat/completions

支持多用户和并行解码

# up to 4 concurrent requests, each with 4096 max context
llama-server -m model.gguf -c 16384 -np 4

启用推测解码

# the draft.gguf model should be a small variant of the target model.gguf
llama-server -m model.gguf -md draft.gguf

服务嵌入模型

# use the /embedding endpoint
llama-server -m model.gguf --embedding --pooling cls -ub 8192

服务重新排序模型

# use the /reranking endpoint
llama-server -m model.gguf --reranking

使用语法约束所有输出

# custom grammar
llama-server -m model.gguf --grammar-file grammar.gbnf

# JSON
llama-server -m model.gguf --grammar-file grammars/json.gbnf

`llama-perplexity`

一种用于测量给定文本上模型困惑度¹² (和其他质量指标) 的工具。

测量文本文件的困惑度

llama-perplexity -m model.gguf -f file.txt

# [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
# Final estimate: PPL = 5.4007 +/- 0.67339

测量 KL 散度
```
# TODO
```

`llama-bench`

基准测试各种参数的推理性能。

运行默认基准测试

llama-bench -m model.gguf

# Output:
# | model               |       size |     params | backend    | threads |          test |                  t/s |
# | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
# | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
# | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |
#
# build: 3e0ba0e60 (4229)

`llama-run`

运行 `llama.cpp` 模型的综合示例。用于推理。与 RamaLama³ 一起使用。

使用特定提示运行模型（默认情况下从 Ollama 注册表提取）
```
llama-run granite-code
```

`llama-simple`

一个使用 `llama.cpp` 实现应用程序的最小示例。对开发人员很有用。

基本文本补全

llama-simple -m model.gguf

# Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of

贡献

贡献者可以打开 PR
协作者可以将内容推送到 llama.cpp 存储库中的分支，并将 PR 合并到 master 分支中
将根据贡献邀请协作者
非常感谢您在管理问题、PR 和项目方面的任何帮助！
查看 good first issues 了解适合首次贡献的任务
阅读 CONTRIBUTING.md 了解更多信息
请务必阅读此内容：边缘推理
对于那些感兴趣的人，这里有一点背景故事：Changelog podcast

其他文档

开发文档

关于模型的开创性论文和背景

如果你的问题与模型生成质量有关，那么请至少浏览以下链接和论文，以了解 LLaMA 模型的局限性。这在选择合适的模型大小并理解 LLaMA 模型和 ChatGPT 之间重大而细微的差异时尤为重要

LLaMA
- Introducing LLaMA: A foundational, 65-billion-parameter large language model
- LLaMA: Open and Efficient Foundation Language Models
GPT-3
- Language Models are Few-Shot Learners
GPT-3.5 / InstructGPT / ChatGPT
- Aligning language models to follow instructions
- Training language models to follow instructions with human feedback

补全

命令行补全适用于某些环境。

Bash 补全

$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

可以选择将其添加到你的 .bashrc 或 .bash_profile 以自动加载。例如

$ echo "source ~/.llama-completion.bash" >> ~/.bashrc

llama.cpp

最近的 API 变更

热门话题

描述

纯文本

多模态

支持的后端

构建项目

获取和量化模型

`llama-cli`

一个 CLI 工具，用于访问和试验 `llama.cpp` 的大部分功能。

`llama-server`

一个轻量级，OpenAI API 兼容的 HTTP 服务器，用于服务 LLM。

`llama-perplexity`

一种用于测量给定文本上模型困惑度¹² (和其他质量指标) 的工具。

`llama-bench`

基准测试各种参数的推理性能。

`llama-run`

运行 `llama.cpp` 模型的综合示例。用于推理。与 RamaLama³ 一起使用。

`llama-simple`

一个使用 `llama.cpp` 实现应用程序的最小示例。对开发人员很有用。

贡献

其他文档

开发文档

关于模型的开创性论文和背景

补全

Bash 补全

参考文献

llama.cpp

最近的 API 变更

热门话题

描述

纯文本

多模态

支持的后端

构建项目

获取和量化模型

llama-cli

一个 CLI 工具，用于访问和试验 llama.cpp 的大部分功能。

llama-server

一个轻量级，OpenAI API 兼容的 HTTP 服务器，用于服务 LLM。

llama-perplexity

一种用于测量给定文本上模型困惑度12 (和其他质量指标) 的工具。

llama-bench

基准测试各种参数的推理性能。

llama-run

运行 llama.cpp 模型的综合示例。 用于推理。 与 RamaLama3 一起使用。

llama-simple

一个使用 llama.cpp 实现应用程序的最小示例。 对开发人员很有用。

贡献

其他文档

开发文档

关于模型的开创性论文和背景

补全

Bash 补全

参考文献

脚注

`llama-cli`

一个 CLI 工具，用于访问和试验 `llama.cpp` 的大部分功能。

`llama-server`

`llama-perplexity`

一种用于测量给定文本上模型困惑度¹² (和其他质量指标) 的工具。

`llama-bench`

`llama-run`

运行 `llama.cpp` 模型的综合示例。用于推理。与 RamaLama³ 一起使用。

`llama-simple`

一个使用 `llama.cpp` 实现应用程序的最小示例。对开发人员很有用。