`swift-embeddings`

使用 MLTensor 在本地 Swift 环境中运行嵌入模型。灵感来源于 mlx-embeddings。

支持的模型架构

BERT (Transformers 的双向编码器表示)

Hugging Face 上一些支持的模型

注意: 虽然支持 google-bert/bert-base-uncased，但必须在 LoadConfig 中提供 weightKeyTransform

let modelBundle = try await Bert.loadModelBundle(
    from: "google-bert/bert-base-uncased",
    loadConfig: LoadConfig(weightKeyTransform: Bert.googleWeightsKeyTransform)
)

XLM-RoBERTa (跨语言语言模型 - 鲁棒优化的 BERT 方法)

Hugging Face 上一些支持的模型

CLIP (对比语言-图像预训练)

注意：目前仅支持文本编码。 Hugging Face 上一些支持的模型

Word2Vec

注意：这是一个词嵌入模型。它将整个模型加载并保存在内存中。为了获得更节省内存的解决方案，您可能需要使用 SQLiteVec。Hugging Face 上一些支持的模型

Model2Vec

更多信息点击这里。

Hugging Face 上一些支持的模型

静态嵌入

更多信息点击这里。

Hugging Face 上一些支持的模型

安装

将以下内容添加到您的 Package.swift 文件中。在 package dependencies 中添加

dependencies: [
    .package(url: "https://github.com/jkrukowski/swift-embeddings", from: "0.0.7")
]

在 target dependencies 中添加

dependencies: [
    .product(name: "Embeddings", package: "swift-embeddings")
]

用法

编码

import Embeddings

// load model and tokenizer from Hugging Face
let modelBundle = try await Bert.loadModelBundle(
    from: "sentence-transformers/all-MiniLM-L6-v2"
)

// encode text
let encoded = modelBundle.encode("The cat is black")
let result = await encoded.cast(to: Float.self).shapedArray(of: Float.self).scalars

// print result
print(result)

批量编码

import Embeddings
import MLTensorUtils

let texts = [
    "The cat is black",
    "The dog is black",
    "The cat sleeps well"
]
let modelBundle = try await Bert.loadModelBundle(
    from: "sentence-transformers/all-MiniLM-L6-v2"
)
let encoded = modelBundle.batchEncode(texts)
let distance = cosineDistance(encoded, encoded)
let result = await distance.cast(to: Float.self).shapedArray(of: Float.self).scalars
print(result)

命令行演示

要运行命令行演示，请使用以下命令

swift run embeddings-cli <subcommand> [--model-id <model-id>] [--model-file <model-file>] [--text <text>] [--max-length <max-length>]

子命令

bert                    Encode text using BERT model
clip                    Encode text using CLIP model
model2vec               Encode text using Model2Vec model
static-embeddings       Encode text using Static Embeddings model
xlm-roberta             Encode text using XLMRoberta model
word2vec                Encode word using Word2Vec model

命令行选项

--model-id <model-id>                       Id of the model to use
--model-file <model-file>                   Path to the model file (only for `Word2Vec`)
--text <text>                               Text to encode
--max-length <max-length>                   Maximum length of the input (not for `Word2Vec`)
-h, --help                                  Show help information.

代码格式化

本项目使用 swift-format。要格式化代码，请运行

swift format . -i -r --configuration .swift-format

致谢

本项目基于并使用了一些来自以下项目的代码：

mlx-embeddings