USearch

更小 & 更快的单文件
用于向量的相似性搜索和聚类引擎 & 🔜 文本

空间 • 二进制 • 概率 • 用户自定义指标
C++ 11 • Python 3 • JavaScript • Java • Rust • C 99 • Objective-C • Swift • C# • GoLang • Wolfram
Linux • MacOS • Windows • iOS • Android • WebAssembly • SQLite3

✅ 快 10 倍 比 FAISS 更快的 HNSW 实现。
✅ 简单且可扩展的单个 C++11 头文件库。
✅ 受谷歌等巨头以及 ClickHouse 和 DuckDB 等数据库信任。
✅ SIMD 优化和带有 JIT 编译的用户自定义指标。
✅ 硬件无关的 f16 & i8 - 半精度和四分之一精度支持。
✅ 从磁盘查看大型索引，无需加载到 RAM 中。
✅ 异构查找、重命名/重新标记和即时删除。
✅ 用于基因组学和化学应用的二进制 Tanimoto 和 Sorensen 系数。
✅ 使用 uint40_t 的空间高效点云，可容纳 4B+ 大小。
✅ 兼容 OpenMP 和自定义“执行器”，以实现细粒度并行。
✅ 语义搜索和连接。
🔄 近乎实时的聚类和子聚类，适用于数万或数百万个集群。

技术见解和相关文章

与 FAISS 的比较

FAISS 是高性能向量搜索引擎的广泛认可的标准。USearch 和 FAISS 都采用了相同的 HNSW 算法，但它们的设计原则显着不同。USearch 结构紧凑，兼容性广泛，且不牺牲性能，主要侧重于用户自定义指标和更少的依赖项。

	FAISS	USearch	提升
索引时间 ⁰
1 亿 96 维 `f32`、`f16`、`i8` 向量	2.6 · 2.6 · 2.6 小时	0.3 · 0.2 · 0.2 小时	9.6 · 10.4 · 10.7 倍
1 亿 1536 维 `f32`、`f16`、`i8` 向量	5.0 · 4.1 · 3.8 小时	2.1 · 1.1 · 0.8 小时	2.3 · 3.6 · 4.4 倍

代码库长度 ¹	84 K SLOC	3 K SLOC	可维护
支持的指标 ²	9 个固定指标	任何指标	可扩展
支持的语言 ³	C++、Python	10 种语言	便携
支持的 ID 类型 ⁴	32 位、64 位	32 位、40 位、64 位	高效
过滤 ⁵	黑名单	任何谓词	可组合
所需依赖项 ⁶	BLAS、OpenMP	-	轻量级
绑定 ⁷	SWIG	原生	低延迟
Python 绑定大小 ⁸	~ 10 MB	< 1 MB	可部署

⁰ 在 Intel Sapphire Rapids 上进行了测试，使用最简单的内积距离、相等的召回率和内存消耗，同时提供了远超的搜索速度。¹ 相对于 faiss/，usearch/ 的代码库更短，使项目更易于维护和审计。² 用户自定义指标允许您为各种应用自定义搜索，从 GIS 到为来自多个 AI 模型的复合嵌入或混合全文和语义搜索创建自定义指标。³ 使用 USearch，您可以在各种编程语言中重用相同的预构建索引。⁴ 40 位整数允许您存储 4B+ 向量，而无需为邻近图中的每个邻居引用分配 8 个字节。⁵ 使用 USearch，索引可以与任意外部容器（如 Bloom 过滤器或第三方数据库）结合使用，以在索引遍历期间过滤掉不相关的键。⁶ 缺少强制性依赖项使 USearch 更加便携。⁷ 原生绑定引入了比更直接的方法更低的调用延迟。⁸ 更轻的绑定使下载和部署更快。

基本功能与 FAISS 相同，如果您曾研究过近似最近邻搜索，则界面必须很熟悉。

# pip install usearch

import numpy as np
from usearch.index import Index

index = Index(ndim=3)               # Default settings for 3D vectors
vector = np.array([0.2, 0.6, 0.4])  # Can be a matrix for batch operations
index.add(42, vector)               # Add one or many vectors in parallel
matches = index.search(vector, 10)  # Find 10 nearest neighbors

assert matches[0].key == 42
assert matches[0].distance <= 0.001
assert np.allclose(index[42], vector, atol=0.1) # Ensure high tolerance in mixed-precision comparisons

始终提供更多设置，并且 API 旨在尽可能灵活。默认的存储/量化级别取决于硬件以提高效率，但建议大多数现代 CPU 使用 bf16。

index = Index(
    ndim=3, # Define the number of dimensions in input vectors
    metric='cos', # Choose 'l2sq', 'ip', 'haversine' or other metric, default = 'cos'
    dtype='bf16', # Store as 'f64', 'f32', 'f16', 'i8', 'b1'..., default = None
    connectivity=16, # Optional: Limit number of neighbors per graph node
    expansion_add=128, # Optional: Control the recall of indexing
    expansion_search=64, # Optional: Control the quality of the search
    multi=False, # Optional: Allow multiple vectors per key, default = False
)

从磁盘序列化和提供 `索引`

USearch 支持多种形式的序列化

序列化到用路径定义的文件中。
序列化到用回调定义的流中，以增量方式序列化或重建。
序列化到固定长度的缓冲区或支持随机访问的内存映射文件中。

后者允许您从外部内存提供索引，使您能够优化服务器选择，以提高索引速度和降低服务成本。这可以在 AWS 和其他公共云上实现 20 倍的成本降低。

index.save("index.usearch")

loaded_copy = index.load("index.usearch")
view = Index.restore("index.usearch", view=True)

other_view = Index(ndim=..., metric=...)
other_view.view("index.usearch")

精确搜索 vs. 近似搜索

当精确的暴力搜索变得过于资源密集时，主要使用近似搜索方法，例如 HNSW。这通常发生在集合中有数百万个条目时。对于较小的集合，我们使用 search 方法提供更直接的方法。

from usearch.index import search, MetricKind, Matches, BatchMatches
import numpy as np

# Generate 10'000 random vectors with 1024 dimensions
vectors = np.random.rand(10_000, 1024).astype(np.float32)
vector = np.random.rand(1024).astype(np.float32)

one_in_many: Matches = search(vectors, vector, 50, MetricKind.L2sq, exact=True)
many_in_many: BatchMatches = search(vectors, vectors, 50, MetricKind.L2sq, exact=True)

如果您传递 exact=True 参数，系统将完全绕过索引，并使用来自 SimSIMD 的 SIMD 优化相似性指标对整个数据集执行暴力搜索。与 Google Colab 中的 FAISS 的 IndexFlatL2 相比，USearch 可能会提供高达 20 倍的性能提升

faiss.IndexFlatL2: 55.3 毫秒。
usearch.index.search: 2.54 毫秒。

用户自定义指标

虽然大多数向量搜索包仅关注两个指标：“内积距离”和“欧几里得距离”，但 USearch 允许任意用户自定义指标。这种灵活性允许您为各种应用自定义搜索，从使用罕见的 Haversine 距离计算地理空间坐标，到为来自多个 AI 模型的复合嵌入（如联合图像-文本嵌入）创建自定义指标。您可以使用 Numba、Cppyy 或 PeachPy 来定义您的自定义指标，即使在 Python 中也是如此。

from numba import cfunc, types, carray
from usearch.index import Index, MetricKind, MetricSignature, CompiledMetric

@cfunc(types.float32(types.CPointer(types.float32), types.CPointer(types.float32)))
def python_inner_product(a, b):
    a_array = carray(a, ndim)
    b_array = carray(b, ndim)
    c = 0.0
    for i in range(ndim):
        c += a_array[i] * b_array[i]
    return 1 - c

metric = CompiledMetric(pointer=python_inner_product.address, kind=MetricKind.IP, signature=MetricSignature.ArrayArray)
index = Index(ndim=ndim, metric=metric, dtype=np.float32)

在 C、C++ 和 Rust 接口中更容易实现类似的效果。此外，与索引高维空间的旧方法（如 KD 树和局部敏感哈希）不同，HNSW 不需要向量长度相同。它们只需要具有可比性。因此，您可以将其应用于模糊应用中，例如搜索相似的集合或模糊文本匹配，使用 GZip 压缩比作为距离函数。

过滤和谓词函数

有时您可能希望将搜索结果与某些外部数据库进行交叉引用，或根据某些标准对其进行过滤。在大多数引擎中，您必须手动执行分页请求，连续过滤结果。在 USearch 中，您只需将谓词函数传递给搜索方法，该函数将直接在图遍历期间应用。在 Rust 中，它看起来像这样

let is_odd = |key: Key| key % 2 == 1;
let query = vec![0.2, 0.1, 0.2, 0.1, 0.3];
let results = index.filtered_search(&query, 10, is_odd).unwrap();
assert!(
    results.keys.iter().all(|&key| key % 2 == 1),
    "All keys must be odd"
);

内存效率、降采样和量化

训练量化模型和降维是加速向量搜索的常用方法。然而，这些方法有时才可靠，可能会显着影响数据的统计属性，并且如果您的分布发生变化，则需要定期调整。相反，我们专注于低精度降采样向量上的高精度算术。相同的索引以及 add 和 search 操作将在 f64_t、f32_t、f16_t、i8_t 和单比特 b1x8_t 表示之间自动降采样或升采样。您可以使用以下命令来检查是否启用了硬件加速

$ python -c 'from usearch.index import Index; print(Index(ndim=768, metric="cos", dtype="f16").hardware_acceleration)'
> sapphire
$ python -c 'from usearch.index import Index; print(Index(ndim=166, metric="tanimoto").hardware_acceleration)'
> ice

在大多数情况下，建议在现代硬件上使用半精度浮点数。启用量化后，“get”类函数将无法恢复原始数据，因此您可能需要在其他地方复制原始向量。当量化为 i8_t 整数时，请注意它仅对类余弦指标有效。作为量化过程的一部分，向量被归一化为单位长度，然后缩放到 [-127, 127] 范围以占据完整的 8 位范围。当量化为 b1x8_t 单比特表示时，请注意它仅对二进制指标（如 Jaccard、Hamming 等）有效。作为量化过程的一部分，大于零的标量分量设置为 true，其余设置为 false。

使用较小的数字类型将为您节省存储向量所需的 RAM，但您也可以压缩形成我们邻近图的邻居列表。默认情况下，使用 32 位 uint32_t 来枚举这些，如果您需要处理超过 40 亿个条目，则这不足够。对于这种情况，我们提供自定义的 uint40_t 类型，它仍然比常用的 8 字节整数节省 37.5% 的空间效率，并且可以扩展到 1 万亿个条目。

用于多索引查找的 `索引`

对于针对数十亿甚至数万亿向量的更大工作负载，并行多索引查找变得非常宝贵。您可以构建多个较小的索引并将它们一起查看，而不是构建一个广泛的索引。

from usearch.index import Indexes

multi_index = Indexes(
    indexes: Iterable[usearch.index.Index] = [...],
    paths: Iterable[os.PathLike] = [...],
    view: bool = False,
    threads: int = 0,
)
multi_index.search(...)

聚类

索引构建完成后，USearch 可以执行 K 最近邻聚类，速度比独立的聚类库（如 SciPy、UMap 和 tSNE）快得多。PCA 降维也是如此。本质上，索引 本身可以看作是一个聚类，允许迭代深化。

clustering = index.cluster(
    min_count=10, # Optional
    max_count=15, # Optional
    threads=..., # Optional
)

# Get the clusters and their sizes
centroid_keys, sizes = clustering.centroids_popularity

# Use Matplotlib to draw a histogram
clustering.plot_centroids_popularity()

# Export a NetworkX graph of the clusters
g = clustering.network

# Get members of a specific cluster
first_members = clustering.members_of(centroid_keys[0])

# Deepen into that cluster, splitting it into more parts, all the same arguments supported
sub_clustering = clustering.subcluster(min_count=..., max_count=...)

生成的聚类与 K-Means 或其他传统方法并不完全相同，但用途相同。或者，在 100 万个点的数据集上使用 Scikit-Learn，人们可能预计查询需要几分钟到几小时不等，具体取决于您要突出显示的集群数量。对于 50'000 个集群，USearch 和传统聚类方法之间的性能差异很容易达到 100 倍。

连接、一对一、一对多和多对多映射

近年来，一个重要的问题是 AI 将如何改变数据库和数据管理的世界。大多数数据库仍在努力实现高质量的模糊搜索，而他们知道的唯一类型的连接是确定性的。“连接”与搜索每个条目不同，它需要一对一映射，禁止在单独的搜索结果之间发生冲突。

精确搜索	模糊搜索	语义搜索？
精确连接	模糊连接？	语义连接？？

使用 USearch，可以实现亚二次复杂度近似、模糊和语义连接。这在数据库管理软件常见的任何模糊匹配任务中都很有用。

men = Index(...)
women = Index(...)
pairs: dict = men.join(women, max_proposals=0, exact=False)

在帖子中阅读更多内容：用于语义搜索的组合稳定婚姻 💍

功能

到目前为止，所有绑定都支持核心功能。更广泛的功能根据请求移植。在某些情况下，例如批量操作，功能对等没有意义，因为宿主语言具有完整的多线程功能，并且 USearch 索引结构在设计上是并发的，因此用户可以以最适合其应用程序的方式实现批处理/调度/负载平衡。

	C++ 11	Python 3	C 99	Java	JavaScript	Rust	GoLang	Swift
添加、搜索、删除	✅	✅	✅	✅	✅	✅	✅	✅
保存、加载、查看	✅	✅	✅	✅	✅	✅	✅	✅
用户自定义指标	✅	✅	✅	❌	❌	❌	❌	❌
批量操作	❌	✅	❌	❌	✅	❌	❌	❌
过滤器谓词	✅	❌	✅	❌	❌	✅	❌	❌
连接	✅	✅	❌	❌	❌	❌	❌	❌
可变长度向量	✅	❌	❌	❌	❌	❌	❌	❌
4B+ 容量	✅	❌	❌	❌	❌	❌	❌	❌

应用示例

USearch + UForm + UCall = 多模态语义搜索

AI 的应用数量不断增长，但最酷的经典想法之一是将其用于语义搜索。可以使用编码器模型（如多模态 UForm）和 Web 编程框架（如 UCall），并仅用 20 行 Python 代码构建一个文本到图像的搜索平台。

from ucall import Server
from uform import get_model, Modality
from usearch.index import Index

import numpy as np
import PIL as pil

processors, models = get_model('unum-cloud/uform3-image-text-english-small')
model_text = models[Modality.TEXT_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
processor_image = processors[Modality.IMAGE_ENCODER]

server = Server()
index = Index(ndim=256)

@server
def add(key: int, photo: pil.Image.Image):
    image = processor_image(photo)
    vector = model_image(image)
    index.add(key, vector.flatten(), copy=True)

@server
def search(query: str) -> np.ndarray:
    tokens = processor_text(query)
    vector = model_text(tokens)
    matches = index.search(vector.flatten(), 3)
    return matches.keys

server.run()

类似的体验也可以在其他语言和客户端实现，从而消除网络延迟。对于 Swift 和 iOS，请查看 ashvardanian/SwiftSemanticSearch 存储库。

GitHub 上提供了更完整的 Streamlit 演示。我们预处理了一些常用的数据集，清理了图像，生成了向量，并预构建了索引。

数据集	模态	图像	下载
Unsplash	图像和描述	2.5 万	HuggingFace / Unum
概念字幕	图像和描述	3 百万	HuggingFace / Unum
Arxiv	标题和摘要	2 百万	HuggingFace / Unum

USearch + RDKit = 分子搜索

比较分子图并搜索相似结构既昂贵又缓慢。它可以看作是 NP 完全子图同构问题的特例。幸运的是，存在特定于领域的近似方法。化学中常用的方法是从 SMILES 生成结构，然后将其哈希为二进制指纹。后者可以使用二进制相似性指标（如 Tanimoto 系数）进行搜索。以下是使用 RDKit 包的示例。

from usearch.index import Index, MetricKind
from rdkit import Chem
from rdkit.Chem import AllChem

import numpy as np

molecules = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO')]
encoder = AllChem.GetRDKitFPGenerator()

fingerprints = np.vstack([encoder.GetFingerprint(x) for x in molecules])
fingerprints = np.packbits(fingerprints, axis=1)

index = Index(ndim=2048, metric=MetricKind.Tanimoto)
keys = np.arange(len(molecules))

index.add(keys, fingerprints)
matches = index.search(fingerprints, 10)

该方法用于构建“USearch 分子”，这是最大的化学信息学数据集之一，包含 70 亿个小分子和 280 亿个指纹。

USearch + POI 坐标 = GIS 应用

与向量和分子搜索类似，USearch 可用于地理空间信息系统。Haversine 距离开箱即用，但您也可以定义更复杂的关系，例如考虑地球扁率的 Vincenty 公式。

from numba import cfunc, types, carray
import math

# Define the dimension as 2 for latitude and longitude
ndim = 2

# Signature for the custom metric
signature = types.float32(
    types.CPointer(types.float32),
    types.CPointer(types.float32))

# WGS-84 ellipsoid parameters
a = 6378137.0  # major axis in meters
f = 1 / 298.257223563  # flattening
b = (1 - f) * a  # minor axis

def vincenty_distance(a_ptr, b_ptr):
    a_array = carray(a_ptr, ndim)
    b_array = carray(b_ptr, ndim)
    lat1, lon1, lat2, lon2 = a_array[0], a_array[1], b_array[0], b_array[1]
    L, U1, U2 = lon2 - lon1, math.atan((1 - f) * math.tan(lat1)), math.atan((1 - f) * math.tan(lat2))
    sinU1, cosU1, sinU2, cosU2 = math.sin(U1), math.cos(U1), math.sin(U2), math.cos(U2)
    lambda_, iterLimit = L, 100
    while iterLimit > 0:
        iterLimit -= 1
        sinLambda, cosLambda = math.sin(lambda_), math.cos(lambda_)
        sinSigma = math.sqrt((cosU2 * sinLambda) ** 2 + (cosU1 * sinU2 - sinU1 * cosU2 * cosLambda) ** 2)
        if sinSigma == 0: return 0.0  # Co-incident points
        cosSigma, sigma = sinU1 * sinU2 + cosU1 * cosU2 * cosLambda, math.atan2(sinSigma, cosSigma)
        sinAlpha, cos2Alpha = cosU1 * cosU2 * sinLambda / sinSigma, 1 - (cosU1 * cosU2 * sinLambda / sinSigma) ** 2
        cos2SigmaM = cosSigma - 2 * sinU1 * sinU2 / cos2Alpha if not math.isnan(cosSigma - 2 * sinU1 * sinU2 / cos2Alpha) else 0  # Equatorial line
        C = f / 16 * cos2Alpha * (4 + f * (4 - 3 * cos2Alpha))
        lambda_, lambdaP = L + (1 - C) * f * (sinAlpha * (sigma + C * sinSigma * (cos2SigmaM + C * cosSigma * (-1 + 2 * cos2SigmaM ** 2)))), lambda_
        if abs(lambda_ - lambdaP) <= 1e-12: break
    if iterLimit == 0: return float('nan')  # formula failed to converge
    u2 = cos2Alpha * (a ** 2 - b ** 2) / (b ** 2)
    A = 1 + u2 / 16384 * (4096 + u2 * (-768 + u2 * (320 - 175 * u2)))
    B = u2 / 1024 * (256 + u2 * (-128 + u2 * (74 - 47 * u2)))
    deltaSigma = B * sinSigma * (cos2SigmaM + B / 4 * (cosSigma * (-1 + 2 * cos2SigmaM ** 2) - B / 6 * cos2SigmaM * (-3 + 4 * sinSigma ** 2) * (-3 + 4 * cos2SigmaM ** 2)))
    s = b * A * (sigma - deltaSigma)
    return s / 1000.0  # Distance in kilometers

# Example usage:
index = Index(ndim=ndim, metric=CompiledMetric(
    pointer=vincenty_distance.address,
    kind=MetricKind.Haversine,
    signature=MetricSignature.ArrayArray,
))

集成和用户

引用

@software{Vardanian_USearch_2023,
doi = {10.5281/zenodo.7949416},
author = {Vardanian, Ash},
title = {{USearch by Unum Cloud}},
url = {https://github.com/unum-cloud/usearch},
version = {2.17.2},
year = {2023},
month = oct,
}