有关该项目的原始目标和实现的更多信息，请参阅 llama.cpp 仓库。

llama.swift 的第 1 版为原始 LLaMA 模型及其一些早期衍生版本提供了一个简单、干净的封装。

llama.swift 的未来是 CameLLM，它提供了简洁的 Swift 接口，用于在 macOS 上本地运行 LLM（并希望将来也能在 iOS 上运行）。 CameLLM 仍在开发中，您可以收藏或关注主仓库以获取更新。

git clone https://github.com/alexrozanski/llama.swift.git
cd llama.swift

获取 LLaMA 模型权重并将其放置在 ./models 中。 ls 应该打印类似以下内容：

ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# install Python dependencies
python3 -m pip install torch numpy sentencepiece

# the command-line tools are in `./tools` instead of the repo root like in llama.cpp
cd tools

# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py ../models/7B/ 1

# quantize the model to 4-bits
make
./quantize.sh 7B

运行更大的模型时，请确保您有足够的磁盘空间来存储所有中间文件。

使用 Xcode（File > Add Packages...）或将其添加到项目的 Package.swift 文件中，将 llama.swift 添加到您的项目中

dependencies: [
  .package(url: "https://github.com/alexrozanski/llama.swift.git", .upToNextMajor(from: "1.0.0"))
]

要从提示生成输出，首先使用 LLaMA 模型文件的 URL 实例化一个 LlamaRunner 实例

import llama

let url = ... // URL to the ggml-model-q4_0.bin model file
let runner = LlamaRunner(modelURL: url)

生成输出就像在 LlamaRunner 实例上使用您的提示调用 run() 一样简单。由于 tokens 是异步生成的，因此它返回一个 AsyncThrowingStream，您可以枚举它以处理返回的 tokens

do {
  for try await token in runner.run(with: "Building a website can be done in 10 simple steps:") {
    print(token, terminator: "")
  }
} catch let error {
  // Handle error
}

请注意，tokens 不一定对应于单个单词，并且还包括任何空格和换行符。

LlamaRunner.run() 采用一个可选的 LlamaRunner.Config 实例，该实例允许您控制运行推断的线程数（默认值：8）、返回的最大 tokens 数（默认值：512）以及可选的反向/负面提示。

let prompt = "..."
let config = LlamaRunner.Config(numThreads: 8, numTokens: 20, reversePrompt: "...")
let tokenStream = runner.run(with: prompt, config: config)

do {
  for try await token in tokenStream {
    ...
  }
} catch let error {
  ...
}

LlamaRunner.run() 还采用一个可选的 stateChangeHandler 闭包，该闭包在运行状态发生更改时被调用

let prompt = "..."
let tokenStream = runner.run(
  with: prompt,
  config: .init(numThreads: 8, numTokens: 20),
  stateChangeHandler: { state in
    switch state {
      case .notStarted:
        // Initial state
        break
      case .initializing:
        // Loading the model and initializing
        break
      case .generatingOutput:
        // Generating tokens
        break
      case .completed:
        // Completed successfully
        break
      case .failed:
        // Failed. This is also the error thrown by the `AsyncThrowingSequence` returned from `LlamaRunner.run()`
        break
    }
  })

如果您不想使用 Swift 并发，则有 run() 的替代版本，它通过 tokenHandler 闭包返回 tokens

let prompt = "..."
runner.run(
  with: prompt,
  config: ...,
  tokenHandler: { token in
    ...
  },
  stateChangeHandler: ...
)

该仓库包含一个简陋的命令行工具 llamaTest，它使用 llama 框架来运行一个简单的输入循环，以在给定的输入提示上运行推理。

MODEL_PATH=/path/to/ggml-model-q4_0.bin

🦙 llama.swift

🚀 llama.swift → 未来

🔨 设置

⬇️ 安装

Swift Package Manager

👩‍💻 用法

Swift 库

配置

状态更改

基于闭包的 API

其他注意事项

`llamaTest` 应用

📃 其他