StringZilla 🦖

由于低效的字符串操作，全球每年至少浪费 1 亿美元。典型的代码库逐字符处理字符串，导致过多的分支和数据依赖，忽略了现代 CPU 90% 的潜力。LibC 则有所不同。它尝试利用 SIMD 指令来加速某些操作，并且经常被高级语言、运行时和数据库使用。但它并不完美。1️⃣ 首先，即使在常见的硬件上，包括超过 10 亿的 64 位 ARM CPU，strstr 和 memmem 等常用函数也只能达到 CPU 吞吐量的 1/3。2️⃣ 其次，SIMD 的覆盖范围不一致：正向扫描的加速并不能保证反向搜索的速度。3️⃣ 最后，大多数高级语言并不总是能使用 LibC，因为字符串通常不是以 NULL 结尾，或者可能在字符串中间包含 Unicode “零” 字符。这就是创建 StringZilla 的原因。旨在提供可预测的高性能，并可移植到任何现代平台、操作系统和编程语言。

StringZilla 是字符串库中的 GodZilla，使用 SIMD 和 SWAR 来加速现代 CPU 上的字符串操作。它比 C、C++、Python 和其他语言中的默认字符串库甚至其他 SIMD 加速的字符串库快 10 倍，同时涵盖广泛的功能。它可以加速精确和模糊字符串匹配、编辑距离计算、排序、延迟计算的范围以避免内存分配，甚至随机字符串生成器。

🐂 C : 在 C 99 中将 LibC 的 <string.h> 升级到 <stringzilla.h>
🐉 C++: 在 C++ 11 中将 STL 的 <string> 升级到 <stringzilla.hpp>
🐍 Python: 将你的 str 升级到更快的 Str
🍎 Swift: 使用 String+StringZilla 扩展
🦀 Rust: 使用 StringZilla traits crate
🐚 Shell: 使用 sz_ 前缀加速常用 CLI 工具
📚 研究人员？跳转到算法 & 设计决策
💡 考虑贡献？寻找 “good first issues”
🤝 并查看指南以设置环境
想要更多绑定或功能？请告诉我！

目标用户？

对于解析大型数据集的数据工程师，例如 CommonCrawl、RedPajama 或 LAION。
对于优化其应用程序和服务中的字符串的软件工程师。
对于寻找 USearch 的编辑距离的生物信息学家和搜索工程师。
对于优化 LIKE、ORDER BY 和 GROUP BY 操作的 DBMS 开发人员。
对于需要 SWAR 基线用于字符串处理功能的硬件设计人员。
对于学习 SIMD/SWAR 在非数据并行操作中的应用的学生。

性能

C	C++	Python	StringZilla
查找文本中随机单词的第一次出现，≅ 5 个字节长
`strstr` ¹ x86: 7.4 · arm: 2.0 GB/s	`.find` x86: 2.9 · arm: 1.6 GB/s	`.find` x86: 1.1 · arm: 0.6 GB/s	`sz_find` x86: 10.6 · arm: 7.1 GB/s
查找文本中随机单词的最后一次出现，≅ 5 个字节长
⚪	`.rfind` x86: 0.5 · arm: 0.4 GB/s	`.rfind` x86: 0.9 · arm: 0.5 GB/s	`sz_rfind` x86: 10.8 · arm: 6.7 GB/s
拆分由 `\n` 或 `\r` 分隔的行 ²
`strcspn` ¹ x86: 5.42 · arm: 2.19 GB/s	`.find_first_of` x86: 0.59 · arm: 0.46 GB/s	`re.finditer` x86: 0.06 · arm: 0.02 GB/s	`sz_find_charset` x86: 4.08 · arm: 3.22 GB/s
查找 6 个空格中的任何一个的最后一次出现 ²
⚪	`.find_last_of` x86: 0.25 · arm: 0.25 GB/s	⚪	`sz_rfind_charset` x86: 0.43 · arm: 0.23 GB/s
来自给定字母表的随机字符串，20 个字节长 ⁵
`rand() % n` x86: 18.0 · arm: 9.4 MB/s	`std::uniform_int_distribution` x86: 47.2 · arm: 20.4 MB/s	`join(random.choices(...))` x86: 13.3 · arm: 5.9 MB/s	`sz_generate` x86: 56.2 · arm: 25.8 MB/s
使用查找表转换映射字符
⚪	`std::transform` x86: 3.81 · arm: 2.65 GB/s	`str.translate` x86: 260.0 · arm: 140.0 MB/s	`sz_look_up_transform` x86: 21.2 · arm: 8.5 GB/s
获取排序后的顺序，≅ 8 百万个英文单词 ⁶
`qsort_r` x86: 3.55 · arm: 5.77 s	`std::sort` x86: 2.79 · arm: 4.02 s	`numpy.argsort` x86: 7.58 · arm: 13.00 s	`sz_sort` x86: 1.91 · arm: 2.37 s
莱文斯坦编辑距离，≅ 5 个字节长
⚪	⚪	通过 `jellyfish` ³ x86: 1,550 · arm: 2,220 ns	`sz_edit_distance` x86: 99 · arm: 180 ns
Needleman-Wunsch 对齐分数，≅ 10 K 个氨基酸长
⚪	⚪	通过 `biopython` ⁴ x86: 257 · arm: 367 ms	`sz_alignment_score` x86: 73 · arm: 177 ms

StringZilla 具有许多功能，其中大部分功能都包含在 C、C++、Python 和其他语言的基准测试中。你可以在 ./scripts 目录中找到这些基准测试，使用说明列在 CONTRIBUTING.md 文件中。值得注意的是，如果 CPU 支持未对齐加载，即使是 64 位 SWAR 后端也比标准库快。

大多数基准测试是在 1 GB 的英语文本语料库上进行的，平均字长为 6 个字符。该代码使用 GCC 12 和 glibc v2.35 编译。在基于 Arm 的 Graviton3 AWS c7g 实例和 r7iz Intel Sapphire Rapids 上执行的基准测试。大多数现代基于 Arm 的 64 位 CPU 将具有类似的相对加速。x86 CPU 的差异会更大。¹ 与其他库不同，LibC 要求字符串以 NULL 结尾。² ASCII 集中有六个空格： \t\n\v\f\r。 Python 和其他标准库有专门的函数用于这些空格。³ 大多数 Python 字符串库也是用 C 实现的。⁴ 与 BioPython 的其余部分不同，对齐分数计算是用 C 实现的。⁵ 所有模运算均使用 uint8_t 进行，以便编译器有更多优化机会。 C++ STL 和 StringZilla 基准测试使用 64 位梅森旋转算法作为生成器。对于 C、C++ 和 StringZilla，使用了字符串的就地更新。在 Python 中，每个字符串都必须分配为一个新对象，这不太公平。⁶ 与普遍的看法相反，Python 的默认 sorted 函数比 C 和 C++ 标准库运行得更快。这适用于大型字符串列表或元组，但一旦你需要更复杂的逻辑（例如按字符串键对字典进行排序或生成“排序顺序”排列），它就会失败。后者在数据库引擎中非常常见，并且与 numpy.argsort 最相似。当前的 StringZilla 解决方案至少可以快 4 倍，而不会损失通用性。

功能

StringZilla 与大多数现代 CPU 兼容，并提供广泛的功能。

适用于小端和大端体系结构。
适用于 32 位和 64 位硬件体系结构。
与 ASCII 和 UTF-8 编码兼容。

并非所有功能都可在所有绑定中使用。如果你需要尚未实现的功能，请考虑贡献。

	成熟度	C 99	C++ 11	Python	Swift	Rust
子字符串搜索	🌳	✅	✅	✅	✅	✅
字符集搜索	🌳	✅	✅	✅	✅	✅
编辑距离	🧐	✅	✅	✅	✅	⚪
小字符串类	🧐	✅	✅	❌	❌	⚪
排序 & 序列操作	🚧	✅	✅	✅	⚪	⚪
惰性范围，压缩数组	🧐	⚪	✅	✅	⚪	⚪
哈希 & 指纹	🚧	✅	✅	⚪	⚪	⚪

🌳 部分已在生产中使用。 🧐 部分处于 Beta 阶段。 🚧 部分正在积极开发中，并且可能会在后续版本中中断。 ✅ 已实现。 ⚪ 正在考虑。 ❌ 不打算实现。

快速开始：Python 🐍

Python 绑定可在 PyPI 上获得，可以使用 pip 安装。你可以立即使用以下命令检查已安装的版本和使用的硬件功能

pip install stringzilla
python -c "import stringzilla; print(stringzilla.__version__)"
python -c "import stringzilla; print(stringzilla.__capabilities__)"

基本用法

如果你使用过 Python 的 str、bytes、bytearray、memoryview 类，你就会知道会发生什么。 StringZilla 的 Str 类是这些类的混合体，为字节数组提供类似 str 的接口。

from stringzilla import Str, File

text_from_str = Str('some-string') # no copies, just a view
text_from_bytes = Str(b'some-array') # no copies, just a view
text_from_file = Str(File('some-file.txt')) # memory-mapped file

import numpy as np
alphabet_array = np.arange(ord("a"), ord("z"), dtype=np.uint8)
text_from_array = Str(memoryview(alphabet_array))

File 类从持久内存中内存映射一个文件，而无需将其副本加载到 RAM 中。该文件的内容将保持不变，并且映射可以由多个 Python 进程同时共享。一个标准的数据集预处理用例是将 Common Crawl 等大型文本数据集映射到内存中，生成子进程，并在它们之间分配作业。

基本操作

长度：len(text) -> int
索引：text[42] -> str
切片：text[42:46] -> Str
子字符串检查：'substring' in text -> bool
哈希：hash(text) -> int
字符串转换：str(text) -> str

高级操作

import sys

x: bool = text.contains('substring', start=0, end=sys.maxsize)
x: int = text.find('substring', start=0, end=sys.maxsize)
x: int = text.count('substring', start=0, end=sys.maxsize, allowoverlap=False)
x: str = text.decode(encoding='utf-8', errors='strict')
x: Strs = text.split(separator=' ', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.rsplit(separator=' ', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.splitlines(keeplinebreaks=False, maxsplit=sys.maxsize)

重要的是要注意，最后一个函数的行为与 Python 的 str.splitlines 略有不同。本机版本匹配 \n、\r、\v 或 \x0b、\f 或 \x0c、\x1c、\x1d、\x1e、\x85、\r\n、\u2028、\u2029，包括 3 个两个字节长的 runes。 StringZilla 版本仅匹配 \n、\v、\f、\r、\x1c、\x1d、\x1e、\x85，避免了两个字节长的 runes。

字符集操作

Python 字符串本身不支持字符集操作。这迫使人们使用正则表达式，这既慢又难以阅读。为了避免需要 re.finditer，StringZilla 提供了以下接口

x: int = text.find_first_of('chars', start=0, end=sys.maxsize)
x: int = text.find_last_of('chars', start=0, end=sys.maxsize)
x: int = text.find_first_not_of('chars', start=0, end=sys.maxsize)
x: int = text.find_last_not_of('chars', start=0, end=sys.maxsize)
x: Strs = text.split_charset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.rsplit_charset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)

你还可以使用查找表 (LUT) 转换字符串，将其映射到不同的字符集。这将导致一个副本 - str 用于 str 输入，bytes 用于其他类型。

x: str = text.translate('chars', {}, start=0, end=sys.maxsize, inplace=False)
x: bytes = text.translate(b'chars', {}, start=0, end=sys.maxsize, inplace=False)

出于效率原因，请将 LUT 作为字符串或字节对象传递，而不是作为字典传递。这在处理二进制数据的高吞吐量应用程序中非常有用，包括生物信息学和图像处理。这是一个例子

import stringzilla as sz
look_up_table = bytes(range(256)) # Identity LUT
image = open("/image/path.jpeg", "rb").read()
sz.translate(image, look_up_table, inplace=True)

集合级别操作

一旦分割成一个 Strs 对象，你可以对切片进行排序、打乱和重组，而且内存占用极小。如果所有块都位于连续的内存区域中，则每个块的内存开销可以低至 4 字节。

lines: Strs = text.split(separator='\n') # 4 bytes per line overhead for under 4 GB of text
batch: Strs = lines.sample(seed=42) # 10x faster than `random.choices`
lines.shuffle(seed=42) # or shuffle all lines in place and shard with slices
# WIP: lines.sort() # explodes to 16 bytes per line overhead for any length text
# WIP: sorted_order: tuple = lines.argsort() # similar to `numpy.argsort`

在使用 RedPajama 处理 200 亿个带注释的英文文档时，你只需要 160 GB 的 RAM，而不是 TB 级别。加载后，数据将被内存映射，并且可以在多个 Python 进程之间重用，而无需复制。当然，你可以使用切片来导航数据集并在多个 worker 之间分片它。

lines[::3] # every third line
lines[1::1] # every odd line
lines[:-100:-1] # last 100 lines in reverse order

迭代器和内存效率

Python 的操作（如 split() 和 readlines()）会立即物化一个已复制部分的 list。对于大型数据集来说，这可能非常耗费内存。StringZilla 通过将现有内存区域视为子字符串来节省大量内存，但通过使用延迟计算的迭代器可以节省更多内存。

x: SplitIterator[Str] = text.split_iter(separator=' ', keepseparator=False)
x: SplitIterator[Str] = text.rsplit_iter(separator=' ', keepseparator=False)
x: SplitIterator[Str] = text.split_charset_iter(separator='chars', keepseparator=False)
x: SplitIterator[Str] = text.rsplit_charset_iter(separator='chars', keepseparator=False)

对于分词，StringZilla 的内存效率可以轻松地比原生 Python 类高 10 倍。使用延迟操作，实际上变得免费。

import stringzilla as sz
%load_ext memory_profiler

text = open("enwik9.txt", "r").read() # 1 GB, mean word length 7.73 bytes
%memit text.split() # increment: 8670.12 MiB (152 ms)
%memit sz.split(text) # increment: 530.75 MiB (25 ms)
%memit sum(1 for _ in sz.split_iter(text)) # increment: 0.00 MiB

底层 Python API

除了调用 Str 和 Strs 类的方法外，你还可以直接在 str 和 bytes 实例上调用全局函数。假设 StringZilla CPython 绑定没有使用任何像 SWIG 或 PyBind 这样的中间工具实现，则调用延迟应该与原生类相似。

import stringzilla as sz

contains: bool = sz.contains("haystack", "needle", start=0, end=sys.maxsize)
offset: int = sz.find("haystack", "needle", start=0, end=sys.maxsize)
count: int = sz.count("haystack", "needle", start=0, end=sys.maxsize, allowoverlap=False)

编辑距离

assert sz.edit_distance("apple", "aple") == 1 # skip one ASCII character
assert sz.edit_distance("αβγδ", "αγδ") == 2 # skip two bytes forming one rune
assert sz.edit_distance_unicode("αβγδ", "αγδ") == 1 # one unicode rune

几个 Python 库提供了编辑距离计算。它们中的大多数是用 C 实现的，但并不总是像 StringZilla 那样快。以大约 10,000 个字符长的 1,000 个长蛋白为例，仅计算 100 个距离

JellyFish：62.3 秒
EditDistance：32.9 秒
StringZilla：0.8 秒

此外，你可以传递自定义的替换矩阵来计算 Needleman-Wunsch 比对分数。这项任务在生物信息学和计算生物学中非常常见。它在 BioPython 中原生支持，并且它的 BLOSUM 矩阵可以转换为 StringZilla 的格式。或者，你可以使用 NumPy 构建一个任意的 256x256 成本矩阵。根据参数，结果可能等于负的 Levenshtein 距离。

import numpy as np
import stringzilla as sz

costs = np.zeros((256, 256), dtype=np.int8)
costs.fill(-1)
np.fill_diagonal(costs, 0)

assert sz.alignment_score("first", "second", substitution_matrix=costs, gap_score=-1) == -sz.edit_distance(a, b)

使用与 Levenshtein 距离基准测试相同的蛋白质

BioPython：25.8 秒
StringZilla：7.8 秒

§ 从 BioPython 转换为 StringZilla 的示例。

import numpy as np
from Bio import Align
from Bio.Align import substitution_matrices

aligner = Align.PairwiseAligner()
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.open_gap_score = 1
aligner.extend_gap_score = 1

# Convert the matrix to NumPy
subs_packed = np.array(aligner.substitution_matrix).astype(np.int8)
subs_reconstructed = np.zeros((256, 256), dtype=np.int8)

# Initialize all banned characters to a the largest possible penalty
subs_reconstructed.fill(127)
for packed_row, packed_row_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
    for packed_column, packed_column_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
        reconstructed_row = ord(packed_row_aminoacid)
        reconstructed_column = ord(packed_column_aminoacid)
        subs_reconstructed[reconstructed_row, reconstructed_column] = subs_packed[packed_row, packed_column]

# Let's pick two examples for of tri-peptides (made of 3 aminoacids)
glutathione = "ECG" # Need to rebuild human tissue?
thyrotropin_releasing_hormone = "QHP" # Or to regulate your metabolism?

assert sz.alignment_score(
    glutathione,
    thyrotropin_releasing_hormone, 
    substitution_matrix=subs_reconstructed, 
    gap_score=1) == aligner.score(glutathione, thyrotropin_releasing_hormone) # Equal to 6

序列化

文件系统

类似于如何使用 File 读取大型文件，其他接口可以用于更快地将字符串转储到磁盘。Str 类具有 write_to 将字符串写入文件，以及 offset_within 获取较大字符串中子字符串视图的整数偏移量以进行导航。

web_archive = Str("<html>...</html><html>...</html>")
_, end_tag, next_doc = web_archive.partition("</html>") # or use `find`
next_doc_offset = next_doc.offset_within(web_archive)
web_archive.write_to("next_doc.html") # no GIL, no copies, just a view

PyArrow

Str 很容易转换为 PyArrow 缓冲区。

from pyarrow import foreign_buffer
from stringzilla import Str

original = "hello"
view = Str(native)
arrow = foreign_buffer(view.address, view.nbytes, view)

这意味着你可以将 Str 转换为 pyarrow.Buffer，将 Strs 转换为 pyarrow.Array，而无需额外的复制。

快速入门：C/C++ 🛠️

C 库是仅包含头文件的，因此你只需将 stringzilla.h 头文件复制到你的项目中即可。同样适用于 C++，你将复制 stringzilla.hpp 头文件。或者，将其添加为子模块，并将其包含在你的构建系统中。

git submodule add https://github.com/ashvardanian/stringzilla.git

或者使用纯 CMake 方法

FetchContent_Declare(stringzilla GIT_REPOSITORY https://github.com/ashvardanian/stringzilla.git)
FetchContent_MakeAvailable(stringzilla)

最后但同样重要的是，你也可以将其作为库安装，并链接它。这种方法对于内联不太好，但为最先进的 CPU 功能带来了动态运行时调度。

C 99 及更新版本的基本用法

存在一个稳定的 C 99 接口，其中所有函数名称都以 sz_ 为前缀。大多数接口都有很好的文档记录，并附带自解释的名称和示例。在某些情况下，硬件特定的重载是可用的，例如 sz_find_avx512 或 sz_find_neon。两者都是 sz_find 的伴侣，第一个用于支持 AVX-512 的 x86 CPU，第二个用于支持 Arm NEON 的 CPU。

#include <stringzilla/stringzilla.h>

// Initialize your haystack and needle
sz_string_view_t haystack = {your_text, your_text_length};
sz_string_view_t needle = {your_subtext, your_subtext_length};

// Perform string-level operations
sz_size_t substring_position = sz_find(haystack.start, haystack.length, needle.start, needle.length);
sz_size_t substring_position = sz_find_avx512(haystack.start, haystack.length, needle.start, needle.length);
sz_size_t substring_position = sz_find_neon(haystack.start, haystack.length, needle.start, needle.length);

// Hash strings
sz_u64_t hash = sz_hash(haystack.start, haystack.length);

// Perform collection level operations
sz_sequence_t array = {your_order, your_count, your_get_start, your_get_length, your_handle};
sz_sort(&array, &your_config);

§ 从 LibC 映射到 StringZilla。

根据设计，StringZilla 与 LibC 有几个显着差异

所有字符串都应该有一个长度，并且不一定是空终止的。
每个操作都有一个反向顺序的对应项。

这样 sz_find 和 sz_rfind 类似于 LibC 中的 strstr 和 strrstr。类似地，sz_find_byte 和 sz_rfind_byte 替换 memchr 和 memrchr。sz_find_charset 映射到 strspn 和 strcspn，而 sz_rfind_charset 在 LibC 中没有兄弟。

LibC 功能	StringZilla 等效项
`memchr(haystack, needle, haystack_length)`, `strchr`	`sz_find_byte(haystack, haystack_length, needle)`
`memrchr(haystack, needle, haystack_length)`	`sz_rfind_byte(haystack, haystack_length, needle)`
`memcmp`, `strcmp`	`sz_order`, `sz_equal`
`strlen(haystack)`	`sz_find_byte(haystack, haystack_length, needle)`
`strcspn(haystack, needles)`	`sz_rfind_charset(haystack, haystack_length, needles_bitset)`
`strspn(haystack, needles)`	`sz_find_charset(haystack, haystack_length, needles_bitset)`
`memmem(haystack, haystack_length, needle, needle_length)`, `strstr`	`sz_find(haystack, haystack_length, needle, needle_length)`
`memcpy(destination, source, destination_length)`	`sz_copy(destination, source, destination_length)`
`memmove(destination, source, destination_length)`	`sz_move(destination, source, destination_length)`
`memset(destination, value, destination_length)`	`sz_fill(destination, destination_length, value)`

C++ 11 及更新版本的基本用法

在 ashvardanian::stringzilla 命名空间中有一个稳定的 C++ 11 接口可用。它带有两个类似 STL 的类：string_view 和 string。第一个是字符串的非拥有视图，第二个是具有小字符串优化的可变字符串。

#include <stringzilla/stringzilla.hpp>

namespace sz = ashvardanian::stringzilla;

sz::string haystack = "some string";
sz::string_view needle = sz::string_view(haystack).substr(0, 4);

auto substring_position = haystack.find(needle); // Or `rfind`
auto hash = std::hash<sz::string_view>{}(haystack); // Compatible with STL's `std::hash`

haystack.end() - haystack.begin() == haystack.size(); // Or `rbegin`, `rend`
haystack.find_first_of(" \v\t") == 4; // Or `find_last_of`, `find_first_not_of`, `find_last_not_of`
haystack.starts_with(needle) == true; // Or `ends_with`
haystack.remove_prefix(needle.size()); // Why is this operation in-place?!
haystack.contains(needle) == true; // STL has this only from C++ 23 onwards
haystack.compare(needle) == 1; // Or `haystack <=> needle` in C++ 20 and beyond

StringZilla 还提供了用于自动类型解析的字符串字面量，类似于 STL

using sz::literals::operator""_sz;
using std::literals::operator""sv;

auto a = "some string"; // char const *
auto b = "some string"sv; // std::string_view
auto b = "some string"_sz; // sz::string_view

内存所有权和小字符串优化

StringZilla 中的大多数操作都不假定任何内存所有权。但是，除了只读的搜索类操作外，StringZilla 还为内存拥有的字符串 "类" 提供了最小的 C 和 C++ 实现。像其他高效的字符串实现一样，它使用小字符串优化 (SSO) 来避免短字符串的堆分配。

typedef union sz_string_t {
    struct internal {
        sz_ptr_t start;
        sz_u8_t length;
        char chars[SZ_STRING_INTERNAL_SPACE]; /// Ends with a null-terminator.
    } internal;

    struct external {
        sz_ptr_t start;
        sz_size_t length;        
        sz_size_t space; /// The length of the heap-allocated buffer.
        sz_size_t padding;
    } external;

} sz_string_t;

正如可以看到的，如果短字符串适合 internal.chars 数组，则可以将其保留在堆栈上。在 2015 年之前，GCC 字符串实现只有 8 个字节，并且只能容纳 7 个字符。如今，不同的 STL 实现对小字符串优化有不同的阈值。与 GCC 类似，StringZilla 的大小为 32 字节，并且与 Clang 类似，它可以将 22 个字符放在堆栈上。如果想避免分支，我们的布局可能是首选。如果你使用不同的编译器，你可能想使用简单的 Gist 检查其 SSO 缓冲区大小。

	GCC 13 中的 `libstdc++`	Clang 17 中的 `libc++`	StringZilla
`sizeof(std::string)`	32	24	32
小字符串容量	15	22	22

自此，这种设计已被移植到许多高级编程语言。例如，Swift 可以在 String 实例本身中存储 15 个字节。StringZilla 在 C 级别实现 SSO，提供 sz_string_t 联合体和用于基本操作的简单 API。

sz_memory_allocator_t allocator;
sz_string_t string;

// Init and make sure we are on stack
sz_string_init(&string);
sz_string_is_on_stack(&string); // == sz_true_k

// Optionally pre-allocate space on the heap for future insertions.
sz_string_grow(&string, 100, &allocator); // == sz_true_k

// Append, erase, insert into the string.
sz_string_expand(&string, 0, "_Hello_", 7, &allocator); // == sz_true_k
sz_string_expand(&string, SZ_SIZE_MAX, "world", 5, &allocator); // == sz_true_k
sz_string_erase(&string, 0, 1);

// Unpacking & introspection.
sz_ptr_t string_start;
sz_size_t string_length;
sz_size_t string_space;
sz_bool_t string_is_external;
sz_string_unpack(string, &string_start, &string_length, &string_space, &string_is_external);
sz_equal(string_start, "Hello_world", 11); // == sz_true_k

// Reclaim some memory.
sz_string_shrink_to_fit(&string, &allocator); // == sz_true_k
sz_string_free(&string, &allocator);

与传统的 C 字符串不同，sz_string_t 允许包含空字符。为了安全地打印这些，也将 string_length 传递给 printf。

printf("%.*s\n", (int)string_length, string_start);

C 标准库有什么问题？

StringZilla 不是 C 标准库的直接替代品。它旨在成为一个更安全和更现代的替代品。从概念上讲

LibC 字符串预计是以 null 结尾的，因此为了在较大字符串的切片上使用高效的 LibC 实现，你必须复制它们，这比原始字符串操作更昂贵。
LibC 功能是不对称的 - 你可以找到字符串中字符的第一次和最后一次出现，但你不能找到子字符串的最后一次出现。
LibC 函数名称通常非常短且难以理解。
LibC 缺少诸如哈希之类的关键功能，并且不为不太关键但相关的操作（如模糊匹配）提供原语。

必须说一下它对 UTF8 的支持。除了单字节 char 类型之外，LibC 还提供 wchar_t

wchar_t 的大小在各个平台上不一致。在 Windows 上，它通常是 16 位（适用于 UTF-16），而在类 Unix 系统上，它通常是 32 位（适用于 UTF-32）。这种不一致可能会导致编写跨平台代码时的可移植性问题。
wchar_t 旨在以固定宽度格式（UTF-16 或 UTF-32）表示宽字符。相比之下，UTF-8 是一种可变长度编码，其中每个字符可以占用 1 到 4 个字节。这种根本区别意味着 wchar_t 和 UTF-8 是不兼容的。

StringZilla 部分解决了这些问题。

C++ 标准库有什么问题？

C++ 代码	评估结果	调用的签名
`"Loose"s.replace(2, 2, "vath"s, 1)`	`"Loathe"` 🤢	`(pos1, count1, str2, pos2)`
`"Loose"s.replace(2, 2, "vath", 1)`	`"Love"` 🥰	`(pos1, count1, str2, count2)`

StringZilla 旨在成为 C++ 标准模板库的直接替代品。也就是说，STL 字符串的一些设计决策极具争议、容易出错且成本高昂。最值得注意的是

replace、insert、erase 和类似函数的参数顺序是不可能猜测的。
仅为范围的一侧抛出 substr 类函数的边界检查异常。
在 substr 类函数中返回字符串副本会导致大量的内存分配。
通过 push_back 类函数进行增量构造会经历太多的分支。
string 和 string_view 方法之间的不一致，例如缺少 remove_prefix 和 remove_suffix。

检查以下断言集，验证 std::string 规范。期望普通开发人员记住 std::string::replace 的 14 个重载是不现实的。

using str = std::string;

assert(str("hello world").substr(6) == "world");
assert(str("hello world").substr(6, 100) == "world"); // 106 is beyond the length of the string, but its OK
assert_throws(str("hello world").substr(100), std::out_of_range);   // 100 is beyond the length of the string
assert_throws(str("hello world").substr(20, 5), std::out_of_range); // 20 is beyond the length of the string
assert_throws(str("hello world").substr(-1, 5), std::out_of_range); // -1 casts to unsigned without any warnings...
assert(str("hello world").substr(0, -1) == "hello world");          // -1 casts to unsigned without any warnings...

assert(str("hello").replace(1, 2, "123") == "h123lo");
assert(str("hello").replace(1, 2, str("123"), 1) == "h23lo");
assert(str("hello").replace(1, 2, "123", 1) == "h1lo");
assert(str("hello").replace(1, 2, "123", 1, 1) == "h2lo");
assert(str("hello").replace(1, 2, str("123"), 1, 1) == "h2lo");
assert(str("hello").replace(1, 2, 3, 'a') == "haaalo");
assert(str("hello").replace(1, 2, {'a', 'b'}) == "hablo");

为了避免这些问题，StringZilla 提供了一个替代的一致接口。它支持有符号参数，并且每个函数不超过 3 个参数。可以使用 SZ_SAFETY_OVER_COMPATIBILITY=1 有条件地禁用标准 API 和我们的替代方案。启用后，标准中的 ~~主观上~~ 危险的重载将被禁用。

using str = sz::string;

str("a:b").front(1) == "a"; // no checks, unlike `substr`
str("a:b").front(2) == "2"; // take first 2 characters
str("a:b").back(-1) == "b"; // accepting negative indices
str("a:b").back(-2) == ":b"; // similar to Python's `"a:b"[-2:]`
str("a:b").sub(1, -1) == ":"; // similar to Python's `"a:b"[1:-1]`
str("a:b").sub(-2, -1) == ":"; // similar to Python's `"a:b"[-2:-1]`
str("a:b").sub(-2, 1) == ""; // similar to Python's `"a:b"[-2:1]`
"a:b"_sz[{-2, -1}] == ":"; // works on views and overloads `operator[]`

假设 StringZilla 是一个仅包含头文件的库，你可以在某些翻译单元中使用完整的 API，并在其他翻译单元中逐渐过渡到更安全的受限 API。额外的好处 - 所有边界检查都是无分支的，因此它具有恒定的成本，并且不会损害你的分支预测器。

超越 C++ 标准库 - 从 Python 学习

Python 可以说是数据科学中最流行的编程语言。部分原因是其标准接口的简单性。StringZilla 将其中一些功能带到了 C++ 中。

内容检查：isalnum、isalpha、isascii、isdigit、islower、isspace、isupper。
修剪字符集：lstrip, rstrip, strip。
修剪字符串匹配：remove_prefix, remove_suffix。
搜索结果范围：splitlines, split, rsplit。
非重叠子字符串匹配的数量：count。
分割：partition, rpartition。

例如，在解析文档时，将其分割成子字符串通常很有用。通常情况下，之后你会计算跳过的部分的长度、偏移量以及剩余部分的长度。这会导致大量的指针运算，并且容易出错。 StringZilla 提供了一个方便的 partition 函数，它返回一个包含三个字符串视图的元组，从而使代码更简洁。

auto parts = haystack.partition(':'); // Matching a character
auto [before, match, after] = haystack.partition(':'); // Structure unpacking
auto [before, match, after] = haystack.partition(sz::char_set(":;")); // Character-set argument
auto [before, match, after] = haystack.partition(" : "); // String argument
auto [before, match, after] = haystack.rpartition(sz::whitespaces_set()); // Split around the last whitespace

将这些与 split 函数结合使用，可以轻松解析 CSV 文件或 HTTP 标头。

for (auto line : haystack.split("\r\n")) {
    auto [key, _, value] = line.partition(':');
    headers[key.strip()] = value.strip();
}

Python 标准库中也不存在其他一些扩展。让我们按类别浏览 C++ 功能。

分割和范围.
无分配连接字符串.
随机生成.
编辑距离和模糊搜索.

即使是 Python 的原生 str 类，也无法使用某些 StringZilla 接口。这里展示了最有用的一些功能。

text.hash(); // -> 64 bit unsigned integer 
text.ssize(); // -> 64 bit signed length to avoid `static_cast<std::ssize_t>(text.size())`
text.contains_only(" \w\t"); // == text.find_first_not_of(sz::char_set(" \w\t")) == npos;
text.contains(sz::whitespaces_set()); // == text.find(sz::char_set(sz::whitespaces_set())) != npos;

// Simpler slicing than `substr`
text.front(10); // -> sz::string_view
text.back(10); // -> sz::string_view

// Safe variants, which clamp the range into the string bounds
using sz::string::cap;
text.front(10, cap) == text.front(std::min(10, text.size()));
text.back(10, cap) == text.back(std::min(10, text.size()));

// Character set filtering
text.lstrip(sz::whitespaces_set()).rstrip(sz::newlines_set()); // like Python
text.front(sz::whitespaces_set()); // all leading whitespaces
text.back(sz::digits_set()); // all numerical symbols forming the suffix

// Incremental construction
using sz::string::unchecked;
text.push_back('x'); // no surprises here
text.push_back('x', unchecked); // no bounds checking, Rust style
text.try_push_back('x'); // returns `false` if the string is full and the allocation failed

sz::concatenate(text, "@", domain, ".", tld); // No allocations

分割和范围

最常见的用例之一是将字符串拆分为子字符串的集合。这通常会导致 StackOverflow 查找和如下所示的代码片段。

std::vector<std::string> lines = split(haystack, "\r\n"); // string delimiter
std::vector<std::string> words = split(lines, ' '); // character delimiter

这些为每个字符串和临时向量分配内存。每次分配的成本可能比在字符上进行串行 for 循环的成本高出几个数量级。为了避免这些，StringZilla 提供了延迟计算的范围，与 Range-v3 库兼容。

for (auto line : haystack.split("\r\n"))
    for (auto word : line.split(sz::char_set(" \w\t.,;:!?")))
        std::cout << word << std::endl;

其中每一个也可以反向使用。如果你想要在 xxx 中同时包含 xx，它也允许交错匹配。调试指针偏移不是一个令人愉快的练习，所以请记住以下函数。

haystack.[r]find_all(needle, interleaving)
haystack.[r]find_all(sz::char_set(""))
haystack.[r]split(needle)
haystack.[r]split(sz::char_set(""))

对于$N$匹配项，拆分函数将报告$N+1$匹配项，可能包括空字符串。范围也有一些方便的方法

range.size(); // -> std::size_t
range.empty(); // -> bool
range.template to<std::set<std::sting>>(); 
range.template to<std::vector<std::sting_view>>();

无分配连接字符串

另一个常见的字符串操作是连接。 STL 提供了 std::string::operator+ 和 std::string::append，但如果执行多次调用，它们效率不高。

std::string name, domain, tld;
auto email = name + "@" + domain + "." + tld; // 4 allocations

有效的方法是预先分配内存并将字符串复制到其中。

std::string email;
email.reserve(name.size() + domain.size() + tld.size() + 2);
email.append(name), email.append("@"), email.append(domain), email.append("."), email.append(tld);

这很繁琐且容易出错。 StringZilla 提供了一个更方便的 concatenate 函数，它接受可变数量的参数。它还重载了 operator| 以延迟连接字符串，无需任何分配。

auto email = sz::concatenate(name, "@", domain, ".", tld);   // 0 allocations
auto email = name | "@" | domain | "." | tld;                // 0 allocations
sz::string email = name | "@" | domain | "." | tld;          // 1 allocations

随机生成

软件开发人员经常需要生成随机字符串以进行测试。 STL 提供了 std::generate 和 std::random_device，可以与 StringZilla 一起使用。

sz::string random_string(std::size_t length, char const *alphabet, std::size_t cardinality) {
    sz::string result(length, '\0');
    static std::random_device seed_source; // Expensive to construct - due to system calls
    static std::mt19937 generator(seed_source()); // Also expensive - due to the state size
    std::uniform_int_distribution<std::size_t> distribution(0, cardinality);
    std::generate(result.begin(), result.end(), [&]() { return alphabet[distribution(generator)]; });
    return result;
}

繁琐且缓慢。 StringZilla 提供了一个 C 原生方法 - sz_generate 和一个方便的 C++ 包装器 - sz::generate。与 Python 类似，它还定义了常用的字符集。

auto protein = sz::string::random(300, "ARNDCQEGHILKMFPSTWYV"); // static method
auto dna = sz::basic_string<custom_allocator>::random(3_000_000_000, "ACGT");

dna.randomize("ACGT"); // `noexcept` pre-allocated version
dna.randomize(&std::rand, "ACGT"); // pass any generator, like `std::mt19937`

char uuid[36];
sz::randomize(sz::string_span(uuid, 36), "0123456789abcdef-"); // Overwrite any buffer

批量替换

在文本处理中，通常需要替换字符串中所有出现的特定子字符串或字符集。标准库函数可能无法为执行批量替换提供最有效或最方便的方法，尤其是在处理大型字符串或对性能要求很高的应用程序时。

haystack.replace_all(needle_string, replacement_string)
haystack.replace_all(sz::char_set(""), replacement_string)
haystack.try_replace_all(needle_string, replacement_string)
haystack.try_replace_all(sz::char_set(""), replacement_string)
haystack.transform(sz::look_up_table::identity())
haystack.transform(sz::look_up_table::identity(), haystack.data())

莱文斯坦编辑距离和对齐分数

莱文斯坦和汉明编辑距离为字节字符串和 UTF-8 字符串提供。后者将输出 Unicode 代码点（而不是字节）中的距离。 Needleman-Wunsch 对齐分数仅针对字节字符串定义。

// Count number of substitutions in same length strings
sz::hamming_distance(first, second[, upper_bound]) -> std::size_t;
sz::hamming_distance_utf8(first, second[, upper_bound]) -> std::size_t;

// Count number of insertions, deletions and substitutions
sz::edit_distance(first, second[, upper_bound[, allocator]]) -> std::size_t;
sz::edit_distance_utf8(first, second[, upper_bound[, allocator]]) -> std::size_t;

// Substitution-parametrized Needleman-Wunsch global alignment score
std::int8_t costs[256][256]; // Substitution costs matrix
sz::alignment_score(first, second, costs[, gap_score[, allocator]) -> std::ptrdiff_t;

C 和 C++ 中的排序

LibC 提供了 qsort，而 STL 提供了 std::sort。两者都有自己的特性。 LibC 标准无法将上下文传递给比较函数，这只能通过特定于平台的扩展来实现。这些在每个操作系统上都有不同的参数顺序。

// Linux: https://linux.die.net/man/3/qsort_r
void qsort_r(void *elements, size_t count, size_t element_width, 
    int (*compare)(void const *left, void const *right, void *context),
    void *context);
// MacOS and FreeBSD: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/qsort_r.3.html
void qsort_r(void *elements, size_t count, size_t element_width, 
    void *context,
    int (*compare)(void *context, void const *left, void const *right));
// Windows conflicts with ISO `qsort_s`: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/qsort-s?view=msvc-170
void qsort_s(id *elements, size_t count, size_t element_width, 
    int (*compare)(void *context, void const *left, void const *right),
    void *context);

C++ 通用算法也不完美。该标准不能保证 std::sort 不会分配任何内存。如果你在嵌入式设备、实时设备或每个节点上有 100 多个 CPU 核心上运行，你可能需要避免这种情况。 StringZilla 不解决一般情况，但希望提高字符串的性能。使用 sz_sort 或高级 sz::sorted_order，它可以用于排序任何可转换为 sz::string_view 的元素集合。

std::vector<std::string> data({"c", "b", "a"});
std::vector<std::size_t> order = sz::sorted_order(data); //< Simple shortcut

// Or, taking care of memory allocation:
sz::sorted_order(data.begin(), data.end(), order.data(), [](auto const &x) -> sz::string_view { return x; });

带有字符串键的标准 C++ 容器

C++ 标准模板库提供了几个关联容器，通常与字符串键一起使用。

std::map<std::string, int, std::less<std::string>> sorted_words;
std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>> words;

这些容器的性能通常受到字符串键的性能限制，尤其是在读取时。 StringZilla 可用于通过覆盖默认比较器和哈希函数来加速具有 std::string 键的容器。

std::map<std::string, int, sz::string_view_less> sorted_words;
std::unordered_map<std::string, int, sz::string_view_hash, sz::string_view_equal_to> words;

或者，更好的方法是使用 sz::string 类作为键。将自动选择正确的哈希函数和比较器，并且如果键很短，性能增益将更加明显。

std::map<sz::string, int> sorted_words;
std::unordered_map<sz::string, int> words;

编译设置和调试

SZ_DEBUG:

为了获得最大的性能，C 库在 Release 版本中不执行任何边界检查。在 C++ 中，边界检查仅发生在 STL std::string 会执行它的地方。如果你想要启用更积极的边界检查，请在包含头文件之前定义 SZ_DEBUG。如果未显式设置，将从构建类型推断出来。

SZ_USE_X86_AVX512, SZ_USE_X86_AVX2, SZ_USE_ARM_NEON:

可以显式禁用某些系列的 SIMD 指令以实现兼容性。默认值在编译时推断出来。

SZ_DYNAMIC_DISPATCH:

默认情况下，StringZilla 是一个仅标头的库。但是如果你在不同代的设备上运行，那么一次性为所有支持的代预编译库并在运行时分派是有意义的。此标志就是这样做的，用于生成 stringzilla.so 共享库以及 Python 绑定。

SZ_USE_MISALIGNED_LOADS:

默认情况下，StringZilla 避免未对齐的加载。如果支持，它会将许多字节级操作替换为字级操作。从类似 char 的类型转换为类似 uint64_t 的类型可以显着加速串行 (SWAR) 后端。因此，如果你要为某些嵌入式设备构建，请考虑启用它。

SZ_AVOID_LIBC 和 SZ_OVERRIDE_LIBC

当使用 C 仅标头库时，可以禁用 LibC 的使用。这可能会影响模糊硬件平台上的类型解析系统。此外，可以让 stringzilla 用它自己的实现覆盖常见的符号，比如 memcpy 和 memset。在这种情况下，你可以使用 LD_PRELOAD 技巧来优先考虑它的符号而不是来自 LibC 的符号，并加速现有的字符串密集型应用程序，而无需重新编译它们。它还增加了一层安全性，因为对于像 memcpy(NULL, NULL, 0) 这样的 NULL 输入，stringzilla 不是未定义的。

SZ_AVOID_STL 和 SZ_SAFETY_OVER_COMPATIBILITY

当使用 C++ 接口时，可以禁用从 std::string 到 sz::string 以及从 sz::string 到 std::string 的隐式转换。如果不需要，则会排除 <string> 和 <string_view> 标头，从而减少编译时间。此外，如果 STL 兼容性优先级较低，则可以通过禁用主观上容易出错的重载来使 API 更安全。

CMake 用户的 STRINGZILLA_BUILD_SHARED, STRINGZILLA_BUILD_TEST, STRINGZILLA_BUILD_BENCHMARK, STRINGZILLA_TARGET_ARCH

编译测试和基准测试时，可以显式设置目标硬件架构。它与 GCC 的 -march 标志同义，用于启用/禁用适当的指令集。如果不需要，也可以禁用共享库构建。

快速入门：Rust 🦀

StringZilla 可用作 Rust crate，文档可在 docs.rs/stringzilla 上获得。要在你的项目中使用最新的 crate 版本，请将以下内容添加到你的 Cargo.toml 中

[dependencies]
stringzilla = ">=3"

或者，如果你想使用来自仓库的最新预发布版本

[dependencies]
stringzilla = { git = "https://github.com/ashvardanian/stringzilla", branch = "main-dev" }

安装后，所有功能都可以通过 stringzilla 命名空间获得。许多界面对于 memchr crate 的用户来说会很熟悉。

use stringzilla::sz;

// Identical to `memchr::memmem::find` and `memchr::memmem::rfind` functions
sz::find("Hello, world!", "world") // 7
sz::rfind("Hello, world!", "world") // 7

// Generalizations of `memchr::memrchr[123]`
sz::find_char_from("Hello, world!", "world") // 2
sz::rfind_char_from("Hello, world!", "world") // 11

与 memchr 不同，stringzilla 的吞吐量在正常和反向搜索中都很高。它也没有对字符集的大小施加任何约束，而 memchr 只允许 1、2 或 3 个字符。除了全局函数之外，stringzilla 还提供了一个 StringZilla 扩展特性

use stringzilla::StringZilla;

let my_string: String = String::from("Hello, world!");
let my_str = my_string.as_str();
let my_cow_str = Cow::from(&my_string);

// Use the generic function with a String
assert_eq!(my_string.sz_find("world"), Some(7));
assert_eq!(my_string.sz_rfind("world"), Some(7));
assert_eq!(my_string.sz_find_char_from("world"), Some(2));
assert_eq!(my_string.sz_rfind_char_from("world"), Some(11));
assert_eq!(my_string.sz_find_char_not_from("world"), Some(0));
assert_eq!(my_string.sz_rfind_char_not_from("world"), Some(12));

// Same works for &str and Cow<'_, str>
assert_eq!(my_str.sz_find("world"), Some(7));
assert_eq!(my_cow_str.as_ref().sz_find("world"), Some(7));

该库还公开了字节数组和 UTF-8 字符串的莱文斯坦和汉明编辑距离，以及 Needleman-Wunch 对齐分数。

use stringzilla::sz;

// Handling arbitrary byte arrays:
sz::edit_distance("Hello, world!", "Hello, world?"); // 1
sz::hamming_distance("Hello, world!", "Hello, world?"); // 1
sz::alignment_score("Hello, world!", "Hello, world?", sz::unary_substitution_costs(), -1); // -1

// Handling UTF-8 strings:
sz::hamming_distance_utf8("αβγδ", "αγγδ") // 1
sz::edit_distance_utf8("façade", "facade") // 1

快速入门：Swift 🍏

StringZilla 可以作为 Swift Package Manager 中的依赖项添加。在你的 Package.swift 文件中，添加以下内容

dependencies: [
    .package(url: "https://github.com/ashvardanian/stringzilla")
]

该软件包目前仅涵盖最基本的功能，但计划扩展到涵盖完整的 C++ API。

var s = "Hello, world! Welcome to StringZilla. 👋"
s[s.findFirst(substring: "world")!...] // "world! Welcome to StringZilla. 👋")    
s[s.findLast(substring: "o")!...] // "o StringZilla. 👋")
s[s.findFirst(characterFrom: "aeiou")!...] // "ello, world! Welcome to StringZilla. 👋")
s[s.findLast(characterFrom: "aeiou")!...] // "a. 👋")
s[s.findFirst(characterNotFrom: "aeiou")!...] // "Hello, world! Welcome to StringZilla. 👋"
s.editDistance(from: "Hello, world!")! // 29

算法 & 设计决策 📚

StringZilla 旨在优化一些最慢的字符串操作。然而，一些流行的操作，例如相等比较和相对顺序检查，几乎总是在任一字符串的最初几个字节上完成。在此类操作中，向量化几乎毫无用处，除非考虑巨大且非常相似的字符串。 StringZilla 也实现了这些操作，但不会产生显着的加速。

精确子字符串搜索

子字符串搜索算法通常分为：基于比较、基于自动机和基于位并行。不同的系列对于不同的字母表大小和针头长度是有效的。每个字符需要的操作越多 - SIMD 就越有效。针头越长 - 跳跃表就越有效。 StringZilla 对不同的针头长度和后端使用不同的精确子字符串搜索算法

当没有 SIMD 可用时 - 在 64 位字上使用 SWAR（寄存器内的 SIMD）算法。
具有 Raita 启发式变体的 Boyer-Moore-Horspool (BMH) 算法用于较长的针头。
SIMD 算法是随机化的，以查看针头的不同部分。

在非常短的针头上，特别是 1-4 个字符长，使用 SIMD 暴力破解是最快的解决方案。在中等长度的针头上，位并行算法是有效的，因为字符掩码适合 32 位或 64 位字。无论哪种方式，如果针头小于 64 字节长，在 haystack 遍历中，我们仍然会获取每个 CPU 缓存行。因此，提高性能的唯一方法是减少比较次数。下面的代码片段显示了 StringZilla 如何为长度为 2 的针头完成此操作。

StringZilla/include/stringzilla/stringzilla.h

第 1585 到 1637 行，在 266c017

 
      /**  
    
      * @brief 两个 64 位整数之间的 2Byte 级别相等比较。  
    
      * @return 64 位整数，其中每个 2byte 中的最高位表示匹配。  
    
       */  
    
     SZ_INTERNAL sz_u64_vec_t _sz_u64_each_2byte_equal(sz_u64_vec_t a, sz_u64_vec_t b) { 
    
      sz_u64_vec_t vec; 
    
     vec.u64 = ~(a.u64 ^ b.u64); 
    
      // 如果每个 2byte 中的每个位都已设置，则匹配有效。  
    
      // 为此，取每个 2byte 的底部 15 位，将它们加一，  
    
      // 如果这会将最高位设置为 1，那么所有 15 位也都是 1。  
    
     vec.u64 = ((vec.u64 & 0x7FFF7FFF7FFF7FFFull) + 0x0001000100010001ull) & ((vec.u64 & 0x8000800080008000ull)); 
    
      return vec; 
    
      }  
    
      /**  
    
      * @brief 在任意长度的 haystack (大海捞针) 中查找 @b 两个字符的 needle (针) 的首次出现。  
    
      * 此实现使用与硬件无关的 SWAR 技术，一次处理 8 个可能的偏移量。  
    
       */  
    
     SZ_INTERNAL sz_cptr_t _sz_find_2byte_serial(sz_cptr_t h, sz_size_t h_length, sz_cptr_t n) { 
    
      // 这是一个内部方法，并且保证 haystack 的长度至少为 2 个字节。  
    
      sz_assert(h_length >= 2 && "haystack 太短了。"); 
    
      sz_cptr_t const h_end = h + h_length; 
    
     #if !SZ_USE_MISALIGNED_LOADS 
    
      // 处理未对齐的头部，以避免在未对齐的 64 位加载时出现 UB (Undefined Behavior，未定义行为)。  
    
      for (; ((sz_size_t)h & 7ull) && h + 2 <= h_end; ++h) 
    
      if ((h[0] == n[0]) + (h[1] == n[1]) == 2) return h; 
    
     #endif 
    
      sz_u64_vec_t h_even_vec, h_odd_vec, n_vec, matches_even_vec, matches_odd_vec; 
    
     n_vec.u64 = 0; 
    
     n_vec.u8s[0] = n[0], n_vec.u8s[1] = n[1]; 
    
     n_vec.u64 *= 0x0001000100010001ull; // 广播 
    
      // 此代码模拟超标量执行，一次分析 8 个偏移量。  
    
      for (; h + 9 <= h_end; h += 8) { 
    
     h_even_vec.u64 = *(sz_u64_t *)h; 
    
     h_odd_vec.u64 = (h_even_vec.u64 >> 8) | ((sz_u64_t)h[8] << 56); 
    
     matches_even_vec = _sz_u64_each_2byte_equal(h_even_vec, n_vec); 
    
     matches_odd_vec = _sz_u64_each_2byte_equal(h_odd_vec, n_vec); 
    
     matches_even_vec.u64 >>= 8; 
    
      if (matches_even_vec.u64 + matches_odd_vec.u64) { 
    
      sz_u64_t match_indicators = matches_even_vec.u64 | matches_odd_vec.u64; 
    
      return h + sz_u64_ctz(match_indicators) / 8; 
    
      }  
    
      }  
    
      for (; h + 2 <= h_end; ++h) 
    
      if ((h[0] == n[0]) + (h[1] == n[1]) == 2) return h; 
    
      return SZ_NULL; 
    
      }

更进一步，对于更长的 needle，Boyer-Moore (BM) 及其变体通常是最佳选择。它有两个表：好后缀移位和坏字符移位。常见的选择是使用简化的 BMH 算法，该算法仅使用坏字符移位表，从而减少了预处理时间。对于长度高达 256 字节的中等长度的 needle，我们也这样做。这样，堆栈分配的移位表仍然很小。

StringZilla/include/stringzilla/stringzilla.h

Lines 1774 to 1825 in 46e957c

 
      /**  
    
      * @brief 用于精确匹配长度不超过 @b 256 字节模式的 Boyer-Moore-Horspool 算法。  
    
      * 使用 Raita 启发式方法来匹配模式的前两个字符、最后一个字符和中间字符。  
    
       */  
    
     SZ_INTERNAL sz_cptr_t _sz_find_horspool_upto_256bytes_serial(sz_cptr_t h_chars, sz_size_t h_length, // 
    
      sz_cptr_t n_chars, sz_size_t n_length) { 
    
      sz_assert(n_length <= 256 && "该模式太长了。"); 
    
      // 几种流行的字符串匹配算法都使用坏字符移位表。  
    
      // Boyer Moore: https://www-igm.univ-mlv.fr/~lecroq/string/node14.html  
    
      // Quick Search: https://www-igm.univ-mlv.fr/~lecroq/string/node19.html  
    
      // Smith: https://www-igm.univ-mlv.fr/~lecroq/string/node21.html  
    
      union {  
    
      { 
    
      sz_u8_t jumps[256]; 
    
     sz_u64_vec_t vecs[64]; 
    
      } bad_shift_table;  
    
      // 让我们使用 SWAR 将表初始化为字符串的总长度。 
    
      sz_u8_t const *h = (sz_u8_t const *)h_chars; 
    
      {  
    
      sz_u8_t const *n = (sz_u8_t const *)n_chars; 
    
     sz_u64_vec_t n_length_vec; 
    
     n_length_vec.u64 = n_length; 
    
      n_length_vec.u64 *= 0x0101010101010101ull; // 广播 
    
      for (sz_size_t i = 0; i != 64; ++i) bad_shift_table.vecs[i].u64 = n_length_vec.u64; 
    
      }  
    
      for (sz_size_t i = 0; i + 1 < n_length; ++i) bad_shift_table.jumps[n[i]] = (sz_u8_t)(n_length - i - 1);  
    
      // 另一个常见的启发式方法是从字符串的不同部分匹配几个字符。  
    
      // Raita 建议使用模式的前两个字符、最后一个字符和中间字符。 
    
      sz_u32_vec_t h_vec, n_vec;  
    
      // 选择值得比较的 needle 部分。 
    
      sz_size_t offset_first, offset_mid, offset_last; 
    
      _sz_locate_needle_anomalies(n_chars, n_length, &offset_first, &offset_mid, &offset_last);  
    
     // 将这些字符广播到一个无符号整数中。 
    
     n_vec.u8s[0] = n[offset_first]; 
    
     n_vec.u8s[1] = n[offset_first + 1]; 
    
     n_vec.u8s[2] = n[offset_mid]; 
    
      n_vec.u8s[3] = n[offset_last];  
    
      // 扫描整个 haystack，跳过最后的 `n_length - 1` 个字节。 
    
     for (sz_size_t i = 0; i <= h_length - n_length;) { 
    
     h_vec.u8s[0] = h[i + offset_first]; 
    
     h_vec.u8s[1] = h[i + offset_first + 1]; 
    
     h_vec.u8s[2] = h[i + offset_mid]; 
    
      h_vec.u8s[3] = h[i + offset_last]; 
    
     if (h_vec.u32 == n_vec.u32 && sz_equal((sz_cptr_t)h + i, n_chars, n_length)) return (sz_cptr_t)h + i; 
    
      }  
    
      return SZ_NULL; 
    
      }

在 C++ 标准库中，std::string::find 函数使用带有 Raita 启发式的 BMH 算法。在比较整个字符串之前，它会匹配第一个字符、最后一个字符和中间字符。非常实用，但对于重复字符可能很慢。 StringZilla 的 SWAR 和 SIMD 后端都有一个廉价的预处理步骤，我们可以在其中定位唯一字符。这使得该库在处理非英语语料库时更加实用。

StringZilla/include/stringzilla/stringzilla.h

Lines 1398 to 1431 in 46e957c

 
      * @brief 选择搜索 needle 中最有趣的字符的偏移量。  
    
       *  
    
      * 如果我们匹配错误的字符，搜索吞吐量可能会显着下降。  
    
      * 假设 needle 是 "aXaYa"，并且我们正在比较第一个字符、第二个字符和最后一个字符。  
    
      * 如果我们使用 SIMD 并且一次比较许多偏移量，那么在每个寄存器中与 "a" 进行比较是一种浪费。  
    
       *  
    
      * 同样，在处理 UTF8 输入时，我们知道每个字符代码的较低位携带更多信息。  
    
      * 例如，西里尔字母的大写 [А, Я] 的代码范围在 [0x0410, 0x042F] 中，  
    
      * 小写 [а, я] 的代码范围在 [0x0430, 0x044F] 中。扫描用俄语编写的文本，一半的  
    
      * 字节将绝对不携带任何值，并且等于 0x04。  
    
       */  
    
     SZ_INTERNAL void _sz_locate_needle_anomalies(sz_cptr_t start, sz_size_t length, // 
    
      sz_size_t *first, sz_size_t *second, sz_size_t *third) { 
    
     *first = 0; 
    
     *second = length / 2; 
    
     *third = length - 1; 
    
      //  
    
      int has_duplicates = // 
    
     start[*first] == start[*second] || // 
    
     start[*first] == start[*third] || // 
    
     start[*second] == start[*third]; 
    
      // 循环遍历字母以找到非冲突的变体。  
    
      if (length > 3 && has_duplicates) { 
    
      // 将中点向左旋转，直到找到与第一个字符不同的字符。  
    
      for (; start[*second] == start[*first] && *second; --(*second)) {} 
    
      // 将中点向右旋转，直到找到与第一个字符不同的字符。  
    
      for (; start[*second] == start[*first] && *second + 1 < *third; ++(*second)) {} 
    
      // 将第三个（最后一个）点向左旋转，直到找到不同的字符。  
    
      for (; (start[*third] == start[*second] || start[*third] == start[*first]) && *third > (*second + 1); 
    
     --(*third)) {} 
    
      }  
    
      }

所有这些仍然具有$O(hn)$最坏情况的复杂度。要保证$O(h)$最坏情况的时间复杂度，Apostolico-Giancarlo (AG) 算法添加了一个额外的跳跃表。预处理阶段是$O(n+sigma)$在时间和空间上。在遍历时，执行从$(h/n)$至$(3h/2)$比较。然而，它在现代 CPU 上并不实用。如果必须找到许多匹配项，那么更简单的想法，Galil 规则可能是一个更相关的优化。

先前考虑过且已弃用的其他算法

用于较长 needle 的 Apostolico-Giancarlo 算法。控制流太复杂，无法进行有效的向量化。
用于短 needle 的基于 Shift-Or 的 Bitap 算法。比 SWAR 慢。
SIMD 后端中基于 Horspool 风格的坏字符检查。仅对非常长的 needle 以及 needle 和 haystack 之间非常不均匀的字符分布有效。需要更快的“字符在集合中”检查才能推广。

§ 阅读材料。Java 中的精确字符串匹配算法。用于子字符串搜索的 SIMD 友好的算法。

莱文斯坦编辑距离

莱文斯坦距离是字符串最著名的编辑距离，用于检查将一个字符串转换为另一个字符串需要多少次插入、删除和替换。它广泛用于近似字符串匹配、拼写检查和生物信息学。

莱文斯坦距离的计算成本是$O(n * m)$，其中$n$和$m$是字符串参数的长度。要计算这个值，最简单的方法是需要$O(n * m)$用于存储“莱文斯坦矩阵”的空间，该矩阵的右下角将包含莱文斯坦距离。该算法对矩阵的计算由苏联数学家 Vladimir Levenshtein (1965年)、Taras Vintsyuk (1968年) 以及美国计算机科学家 Robert Wagner、David Sankoff、Michael J. Fischer 在随后的几年中同时研究/发现。已知有几种优化方法。

空间优化: 该矩阵可以在$O(min(n,m))$空间内计算，只需存储矩阵的最后两行即可。
分治法: 可以应用 Hirschberg 算法将计算分解为子任务。
自动机: 如果其中一个字符串不发生变化，并且需要进行多次比较，那么莱文斯坦自动机可能有效。
Shift-Or: 位并行算法将矩阵转置为位矩阵，并对其执行按位操作。

最后一种方法非常强大且高效，并被出色的 RapidFuzz 库使用。它不如其他方法那样广为人知，它源自 Baeza-Yates-Gonnet 算法，由 Manber 和 Wu 在 1990 年代扩展到有界编辑距离搜索，并由 Gene Myers 在 1999 年和 Heikki Hyyro 在 2002 年至 2004 年间进一步扩展。

StringZilla 引入了一种不同的方法，广泛用于 Unum 的内部组合优化库中。该方法不会改变琐碎操作的数量，而是以不同的顺序执行它们，从而消除计算插入成本时发生的数据依赖性。这使得内核内并行性和单个请求的潜在多核评估的向量化效果更好。

下一个设计目标

将快速遍历推广到矩形矩阵。
将 x86 AVX-512 解决方案移植到 Arm NEON。

§ 阅读材料。使用 SIMD 友好的遍历顺序实现更快的莱文斯坦距离。

用于生物信息学的 Needleman-Wunsch 比对得分

生物信息学领域研究生物结构的各种表示形式。“主要”表示形式通常是稀疏字母表上的字符串

DNA 序列，其中字母表为 {A, C, G, T}，范围从短读的约 100 个字符到人类基因组的 30 亿个字符。
RNA 序列，其中字母表为 {A, C, G, U}，范围从 tRNA 的约 50 个字符到 mRNA 的数千个字符。
蛋白质，其中字母表由 22 个氨基酸组成，范围从二肽的 2 个字符到 Titin（最长的蛋白质）的 35,000 个字符。

表示形式越短，研究人员可能越想使用自定义替换矩阵。这意味着两个字符之间的替换成本对于所有对可能并不相同。

StringZilla 采用相当高效的两行 Wagner-Fisher 算法作为 Needleman-Wunsch 得分的基线串行实现。它支持多达 256 个字符的任意字母表，并且可以与 BLOSUM、PAM 或其他替换矩阵一起使用。它还使用 SIMD 进行替换查找的硬件加速。然而，这尚未打破插入成本的数据依赖性，其中 80% 的时间被浪费。解决这个问题后，SIMD 实现将比串行实现快 5 倍。

内存复制、填充和移动

关于计算机花费在复制内存上的时间以及该操作在 LibC 中的实现方式，已经有很多文章。有趣的是，该操作仍然可以改进，因为大多数汇编实现都使用了过时的指令。即使是面向性能的 STL 替代品，例如 Meta 的 Folly v2024.09.23 也专注于 AVX2，而没有利用 AVX-512 或 SVE 中的新掩码指令。

在 AVX-512 中，StringZilla 使用非临时存储来避免在处理非常大的字符串时造成缓存污染。此外，它分别处理 target 缓冲区的未对齐头部和尾部，确保大副本中的写入始终与缓存行边界对齐。这对于 AVX2 和 AVX-512 后端都是如此。

StringZilla 还包含更智能但效率较低的算法的“草案”，这些算法可以最大限度地减少未对齐加载的数量，执行混洗和排列。这是一个未来研究的课题，因为性能增益尚未令人满意。

§ 阅读材料。Nadav Rotem 的 memset 基准测试。Sergey Slotin 的缓存关联性。

随机生成

从不同的字母表生成随机字符串是一种非常常见的操作。StringZilla 接受任意的伪随机数生成器来产生噪声，并接受一个字符数组来进行采样。对采样进行了优化，以避免整数除法，这在现代 CPU 上是一种代价高昂的操作。为此，使用了一个 768 字节长的查找表来执行 2 次查找、1 次乘法、2 次移位和 2 次累加。

StringZilla/include/stringzilla/stringzilla.h

266c017 中的第 2490 到 2533 行

 
      /**  
    
      * @brief 使用两个小型查找表（总共 768 字节）来加速除以一个小型  
    
      * 无符号整数。执行两次查找、一次乘法、两次移位和两次累加。  
    
       *  
    
      * @param divisor 大于 1 的整数值。  
    
      * @param number 要除的整数值。  
    
       */  
    
     SZ_INTERNAL sz_u8_t sz_u8_divide(sz_u8_t number, sz_u8_t divisor) { 
    
      static sz_u16_t const multipliers[256] = { 
    
      0, 0, 0, 21846, 0, 39322, 21846, 9363, 0, 50973, 39322, 29790, 21846, 15124, 9363, 4370,  
    
      0, 57826, 50973, 44841, 39322, 34329, 29790, 25645, 21846, 18351, 15124, 12137, 9363, 6780, 4370, 2115,  
    
      0, 61565, 57826, 54302, 50973, 47824, 44841, 42011, 39322, 36765, 34329, 32006, 29790, 27671, 25645, 23705,  
    
      21846, 20063, 18351, 16706, 15124, 13602, 12137, 10725, 9363, 8049, 6780, 5554, 4370, 3224, 2115, 1041,  
    
      0, 63520, 61565, 59668, 57826, 56039, 54302, 52614, 50973, 49377, 47824, 46313, 44841, 43407, 42011, 40649,  
    
      39322, 38028, 36765, 35532, 34329, 33154, 32006, 30885, 29790, 28719, 27671, 26647, 25645, 24665, 23705, 22766,  
    
      21846, 20945, 20063, 19198, 18351, 17520, 16706, 15907, 15124, 14356, 13602, 12863, 12137, 11424, 10725, 10038,  
    
      9363, 8700, 8049, 7409, 6780, 6162, 5554, 4957, 4370, 3792, 3224, 2665, 2115, 1573, 1041, 517,  
    
      0, 64520, 63520, 62535, 61565, 60609, 59668, 58740, 57826, 56926, 56039, 55164, 54302, 53452, 52614, 51788,  
    
      50973, 50169, 49377, 48595, 47824, 47063, 46313, 45572, 44841, 44120, 43407, 42705, 42011, 41326, 40649, 39982,  
    
      39322, 38671, 38028, 37392, 36765, 36145, 35532, 34927, 34329, 33738, 33154, 32577, 32006, 31443, 30885, 30334,  
    
      29790, 29251, 28719, 28192, 27671, 27156, 26647, 26143, 25645, 25152, 24665, 24182, 23705, 23233, 22766, 22303,  
    
      21846, 21393, 20945, 20502, 20063, 19628, 19198, 18772, 18351, 17933, 17520, 17111, 16706, 16305, 15907, 15514,  
    
      15124, 14738, 14356, 13977, 13602, 13231, 12863, 12498, 12137, 11779, 11424, 11073, 10725, 10380, 10038, 9699,  
    
      9363, 9030, 8700, 8373, 8049, 7727, 7409, 7093, 6780, 6470, 6162, 5857, 5554, 5254, 4957, 4662,  
    
      4370, 4080, 3792, 3507, 3224, 2943, 2665, 2388, 2115, 1843, 1573, 1306, 1041, 778, 517, 258,  
    
      };  
    
      // 可以使用一次加法并计算尾随零来避免此表。  
    
      static sz_u8_t const shifts[256] = { 
    
      0, 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, //  
    
      4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, //  
    
      5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, //  
    
      6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, //  
    
      6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //  
    
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //  
    
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //  
    
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //  
    
      };  
    
      sz_u32_t multiplier = multipliers[divisor]; 
    
      sz_u8_t shift = shifts[divisor]; 
    
      sz_u16_t q = (sz_u16_t)((multiplier * number) >> 16); 
    
      sz_u16_t t = ((number - q) >> 1) + q; 
    
      return (sz_u8_t)(t >> shift); 
    
      }

排序

对于字符串的词典排序，StringZilla 使用“混合-混合”方法，其中$O(n * log(n))$和。

基数排序，用于将第一个字节导出到连续缓冲区以实现局部性。
对部分排序的块进行 IntroSort，以平衡效率和最坏情况下的性能。
1. IntroSort 从快速排序开始。
2. 如果递归深度超过某个阈值，它将切换到堆排序。

正在开发一种更好的算法。请查看 #173 以获取设计目标和进度更新。

哈希

警告

哈希函数在密码学上不安全，并且目前正在积极开发中。它们可能会在未来的次要版本中发生变化。

为您的应用程序选择正确的哈希算法对于性能和安全性都至关重要。在 StringZilla 中，一个 64 位滚动哈希函数被重复用于字符串哈希和子字符串哈希，即 Rabin 风格的指纹。滚动哈希计算具有不同窗口大小的哈希所需的时间相同，并且更新速度很快。然而，这些不是完美的哈希，并且冲突很频繁。StringZilla 尝试使用 SIMD，但性能尚未令人满意。在 Intel Sapphire Rapids 上，对于 N 路并行变体，可以预期以下数字。

4 路 AVX2 吞吐量，采用 64 位整数乘法（无原生支持）：0.28 GB/s。
4 路 AVX2 吞吐量，采用 32 位整数乘法：0.54 GB/s。
4 路 AVX-512DQ 吞吐量，采用 64 位整数乘法：0.46 GB/s。
4 路 AVX-512 吞吐量，采用 32 位整数乘法：0.58 GB/s。
8 路 AVX-512 吞吐量，采用 32 位整数乘法：0.11 GB/s。

下一个设计目标

尝试 gear-hash 和其他滚动方法。

为什么不用 CRC32？

循环冗余校验 32 是计算机科学中最常用的哈希函数之一。它在 x86 和 Arm 上都具有硬件支持，支持 8 位、16 位、32 位和 64 位字。0x1EDC6F41 多项式用于 iSCSI、Btrfs、ext4，而 0x04C11DB7 用于 SATA、以太网、Zlib、PNG。对于 Arm，支持多个多项式。然而，它对于大数据用例来说有些限制，这些用例通常必须处理超过 40 亿个字符串，从而使冲突不可避免。此外，现有的 SIMD 方法很棘手，将通用计算与专用指令相结合，以在每个周期中利用更多的硅。

§ 阅读材料。方法的全面推导 x86 上 4 KB 缓冲区的更快计算比较不同的查找表优秀的开源实现。Peter Cawley 的实现 Stephan Brumme 的实现

其他现代替代方案

Austin Appleby 在 2008 年推出的 MurmurHash 是最著名的非加密哈希之一。它具有非常短的实现，并且能够生成 32 位和 128 位哈希。Google 在 2011 年推出的 CityHash 和 xxHash 在此基础上进行了改进，更好地利用了现代 CPU 的超标量特性，并生成 64 位和 128 位哈希。

与 MD5、SHA 和 BLAKE 算法不同，这些函数都不是加密的。大多数加密哈希都基于 Merkle-Damgård 结构，并且不能抵抗长度扩展攻击。当前最先进的技术可能是 BLAKE3 算法。它对各种攻击具有抵抗力，可以每 CPU 周期处理 2 个字节，并且为 C 和 Rust 提供了非常优化的官方实现。它具有与 BLAKE2 相同的 128 位安全级别，并通过减少混合轮数和以 1 KiB 块处理数据来获得其性能提升，这对于较长的字符串来说非常有用，但可能会导致短字符串的性能不佳。

所有提到的库都经过了广泛的测试，并且被认为是生产就绪的。它们肯定可以加速您的应用程序，但下游混合器也可能如此。例如，当构造哈希表时，哈希会进一步缩小以寻址表桶。如果混合器失去熵，则哈希函数带来的性能提升可能会丢失。一个例子是二次幂模，这是一种常见的混合器，但已知它很弱。一种替代方法是 Daniel Lemire 的 fastrange。另一种是使用黄金比例的斐波那契哈希技巧，StringZilla 中也使用了它。

Unicode、UTF-8 和宽字符

大多数 StringZilla 操作都是字节级别的，因此它可以直接处理 ASCII 和 UTF8 内容。在某些情况下，例如编辑距离计算，字节级别评估和字符级别评估的结果可能会有所不同。因此，StringZilla 提供了以下函数来处理 Unicode：

sz_edit_distance_utf8 - 计算两个 UTF-8 字符串之间的莱文斯坦距离。
sz_hamming_distance_utf8 - 计算两个 UTF-8 字符串之间的汉明距离。

然而，Java、JavaScript、Python 2、C# 和 Objective-C 使用宽字符 (wchar) - 两个字节长的代码，而不是更合理的固定长度 UTF32 或可变长度 UTF8。这导致了各种偏移量计数问题，尤其是在处理四字节长的 Unicode 字符时。因此，如果您来自这些环境，请考虑使用 simdutf 进行转码。

贡献 👾

请查看贡献指南，了解如何设置开发环境并为本项目做出贡献的更多详细信息。如果您喜欢这个项目，您可能还会喜欢 USearch、UCall、UForm 和 SimSIMD。🤗

如果您喜欢字符串并重视效率，您可能还会喜欢以下项目：

simdutf - 转码 UTF8、UTF16 以及 UTF32 LE 和 BE。
hyperscan - 使用 SIMD 加速的正则表达式。
pyahocorasick - Python 中的 Aho-Corasick 算法。
rapidfuzz - C++ 和 Python 中的快速字符串匹配。

如果您正在寻找有关此主题的更多阅读材料，请考虑以下内容：

许可证 📜

您可以根据自己的喜好，在 Apache 2.0 或 Three-clause BSD 许可证下自由使用该项目。

	/**
	* @brief 两个 64 位整数之间的 2Byte 级别相等比较。
	* @return 64 位整数，其中每个 2byte 中的最高位表示匹配。
	*/
	SZ_INTERNAL sz_u64_vec_t _sz_u64_each_2byte_equal(sz_u64_vec_t a, sz_u64_vec_t b) {
	sz_u64_vec_t vec;
	vec.u64 = ~(a.u64 ^ b.u64);
	// 如果每个 2byte 中的每个位都已设置，则匹配有效。
	// 为此，取每个 2byte 的底部 15 位，将它们加一，
	// 如果这会将最高位设置为 1，那么所有 15 位也都是 1。
	vec.u64 = ((vec.u64 & 0x7FFF7FFF7FFF7FFFull) + 0x0001000100010001ull) & ((vec.u64 & 0x8000800080008000ull));
	return vec;
	}

	/**
	* @brief 在任意长度的 haystack (大海捞针) 中查找 @b 两个字符的 needle (针) 的首次出现。
	* 此实现使用与硬件无关的 SWAR 技术，一次处理 8 个可能的偏移量。
	*/
	SZ_INTERNAL sz_cptr_t _sz_find_2byte_serial(sz_cptr_t h, sz_size_t h_length, sz_cptr_t n) {

	// 这是一个内部方法，并且保证 haystack 的长度至少为 2 个字节。
	sz_assert(h_length >= 2 && "haystack 太短了。");
	sz_cptr_t const h_end = h + h_length;

	#if !SZ_USE_MISALIGNED_LOADS
	// 处理未对齐的头部，以避免在未对齐的 64 位加载时出现 UB (Undefined Behavior，未定义行为)。
	for (; ((sz_size_t)h & 7ull) && h + 2 <= h_end; ++h)
	if ((h[0] == n[0]) + (h[1] == n[1]) == 2) return h;
	#endif

	sz_u64_vec_t h_even_vec, h_odd_vec, n_vec, matches_even_vec, matches_odd_vec;
	n_vec.u64 = 0;
	n_vec.u8s[0] = n[0], n_vec.u8s[1] = n[1];
	n_vec.u64 *= 0x0001000100010001ull; // 广播

	// 此代码模拟超标量执行，一次分析 8 个偏移量。
	for (; h + 9 <= h_end; h += 8) {
	h_even_vec.u64 = (sz_u64_t )h;
	h_odd_vec.u64 = (h_even_vec.u64 >> 8) \| ((sz_u64_t)h[8] << 56);
	matches_even_vec = _sz_u64_each_2byte_equal(h_even_vec, n_vec);
	matches_odd_vec = _sz_u64_each_2byte_equal(h_odd_vec, n_vec);

	matches_even_vec.u64 >>= 8;
	if (matches_even_vec.u64 + matches_odd_vec.u64) {
	sz_u64_t match_indicators = matches_even_vec.u64 \| matches_odd_vec.u64;
	return h + sz_u64_ctz(match_indicators) / 8;
	}
	}

	for (; h + 2 <= h_end; ++h)
	if ((h[0] == n[0]) + (h[1] == n[1]) == 2) return h;
	return SZ_NULL;
	}

	/**
	* @brief 用于精确匹配长度不超过 @b 256 字节模式的 Boyer-Moore-Horspool 算法。
	* 使用 Raita 启发式方法来匹配模式的前两个字符、最后一个字符和中间字符。
	*/
	SZ_INTERNAL sz_cptr_t _sz_find_horspool_upto_256bytes_serial(sz_cptr_t h_chars, sz_size_t h_length, //
	sz_cptr_t n_chars, sz_size_t n_length) {
	sz_assert(n_length <= 256 && "该模式太长了。");
	// 几种流行的字符串匹配算法都使用坏字符移位表。
	// Boyer Moore: https://www-igm.univ-mlv.fr/~lecroq/string/node14.html
	// Quick Search: https://www-igm.univ-mlv.fr/~lecroq/string/node19.html
	// Smith: https://www-igm.univ-mlv.fr/~lecroq/string/node21.html
	union {
	{
	sz_u8_t jumps[256];
	sz_u64_vec_t vecs[64];

	} bad_shift_table;
	// 让我们使用 SWAR 将表初始化为字符串的总长度。
	sz_u8_t const h = (sz_u8_t const )h_chars;
	{
	sz_u8_t const n = (sz_u8_t const )n_chars;
	sz_u64_vec_t n_length_vec;
	n_length_vec.u64 = n_length;
	n_length_vec.u64 *= 0x0101010101010101ull; // 广播
	for (sz_size_t i = 0; i != 64; ++i) bad_shift_table.vecs[i].u64 = n_length_vec.u64;
	}

	for (sz_size_t i = 0; i + 1 < n_length; ++i) bad_shift_table.jumps[n[i]] = (sz_u8_t)(n_length - i - 1);
	// 另一个常见的启发式方法是从字符串的不同部分匹配几个字符。
	// Raita 建议使用模式的前两个字符、最后一个字符和中间字符。

	sz_u32_vec_t h_vec, n_vec;
	// 选择值得比较的 needle 部分。
	sz_size_t offset_first, offset_mid, offset_last;

	_sz_locate_needle_anomalies(n_chars, n_length, &offset_first, &offset_mid, &offset_last);
	// 将这些字符广播到一个无符号整数中。
	n_vec.u8s[0] = n[offset_first];
	n_vec.u8s[1] = n[offset_first + 1];
	n_vec.u8s[2] = n[offset_mid];

	n_vec.u8s[3] = n[offset_last];
	// 扫描整个 haystack，跳过最后的 `n_length - 1` 个字节。
	for (sz_size_t i = 0; i <= h_length - n_length;) {
	h_vec.u8s[0] = h[i + offset_first];
	h_vec.u8s[1] = h[i + offset_first + 1];
	h_vec.u8s[2] = h[i + offset_mid];
	h_vec.u8s[3] = h[i + offset_last];
	if (h_vec.u32 == n_vec.u32 && sz_equal((sz_cptr_t)h + i, n_chars, n_length)) return (sz_cptr_t)h + i;
	}
	return SZ_NULL;
	}

	* @brief 选择搜索 needle 中最有趣的字符的偏移量。
	*
	* 如果我们匹配错误的字符，搜索吞吐量可能会显着下降。
	* 假设 needle 是 "aXaYa"，并且我们正在比较第一个字符、第二个字符和最后一个字符。
	* 如果我们使用 SIMD 并且一次比较许多偏移量，那么在每个寄存器中与 "a" 进行比较是一种浪费。
	*
	* 同样，在处理 UTF8 输入时，我们知道每个字符代码的较低位携带更多信息。
	* 例如，西里尔字母的大写 [А, Я] 的代码范围在 [0x0410, 0x042F] 中，
	* 小写 [а, я] 的代码范围在 [0x0430, 0x044F] 中。扫描用俄语编写的文本，一半的
	* 字节将绝对不携带任何值，并且等于 0x04。
	*/
	SZ_INTERNAL void _sz_locate_needle_anomalies(sz_cptr_t start, sz_size_t length, //
	sz_size_t first, sz_size_t second, sz_size_t *third) {
	*first = 0;
	*second = length / 2;
	*third = length - 1;

	//
	int has_duplicates = //
	start[first] == start[second] \|\| //
	start[first] == start[third] \|\| //
	start[second] == start[third];

	// 循环遍历字母以找到非冲突的变体。
	if (length > 3 && has_duplicates) {
	// 将中点向左旋转，直到找到与第一个字符不同的字符。
	for (; start[second] == start[first] && second; --(second)) {}
	// 将中点向右旋转，直到找到与第一个字符不同的字符。
	for (; start[second] == start[first] && second + 1 < third; ++(*second)) {}
	// 将第三个（最后一个）点向左旋转，直到找到不同的字符。
	for (; (start[third] == start[second] \|\| start[third] == start[first]) && third > (second + 1);
	--(*third)) {}
	}
	}

	/**
	* @brief 使用两个小型查找表（总共 768 字节）来加速除以一个小型
	* 无符号整数。执行两次查找、一次乘法、两次移位和两次累加。
	*
	* @param divisor 大于 1 的整数值。
	* @param number 要除的整数值。
	*/
	SZ_INTERNAL sz_u8_t sz_u8_divide(sz_u8_t number, sz_u8_t divisor) {
	static sz_u16_t const multipliers[256] = {
	0, 0, 0, 21846, 0, 39322, 21846, 9363, 0, 50973, 39322, 29790, 21846, 15124, 9363, 4370,
	0, 57826, 50973, 44841, 39322, 34329, 29790, 25645, 21846, 18351, 15124, 12137, 9363, 6780, 4370, 2115,
	0, 61565, 57826, 54302, 50973, 47824, 44841, 42011, 39322, 36765, 34329, 32006, 29790, 27671, 25645, 23705,
	21846, 20063, 18351, 16706, 15124, 13602, 12137, 10725, 9363, 8049, 6780, 5554, 4370, 3224, 2115, 1041,
	0, 63520, 61565, 59668, 57826, 56039, 54302, 52614, 50973, 49377, 47824, 46313, 44841, 43407, 42011, 40649,
	39322, 38028, 36765, 35532, 34329, 33154, 32006, 30885, 29790, 28719, 27671, 26647, 25645, 24665, 23705, 22766,
	21846, 20945, 20063, 19198, 18351, 17520, 16706, 15907, 15124, 14356, 13602, 12863, 12137, 11424, 10725, 10038,
	9363, 8700, 8049, 7409, 6780, 6162, 5554, 4957, 4370, 3792, 3224, 2665, 2115, 1573, 1041, 517,
	0, 64520, 63520, 62535, 61565, 60609, 59668, 58740, 57826, 56926, 56039, 55164, 54302, 53452, 52614, 51788,
	50973, 50169, 49377, 48595, 47824, 47063, 46313, 45572, 44841, 44120, 43407, 42705, 42011, 41326, 40649, 39982,
	39322, 38671, 38028, 37392, 36765, 36145, 35532, 34927, 34329, 33738, 33154, 32577, 32006, 31443, 30885, 30334,
	29790, 29251, 28719, 28192, 27671, 27156, 26647, 26143, 25645, 25152, 24665, 24182, 23705, 23233, 22766, 22303,
	21846, 21393, 20945, 20502, 20063, 19628, 19198, 18772, 18351, 17933, 17520, 17111, 16706, 16305, 15907, 15514,
	15124, 14738, 14356, 13977, 13602, 13231, 12863, 12498, 12137, 11779, 11424, 11073, 10725, 10380, 10038, 9699,
	9363, 9030, 8700, 8373, 8049, 7727, 7409, 7093, 6780, 6470, 6162, 5857, 5554, 5254, 4957, 4662,
	4370, 4080, 3792, 3507, 3224, 2943, 2665, 2388, 2115, 1843, 1573, 1306, 1041, 778, 517, 258,
	};
	// 可以使用一次加法并计算尾随零来避免此表。
	static sz_u8_t const shifts[256] = {
	0, 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, //
	4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, //
	5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, //
	6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, //
	6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //
	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //
	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //
	7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //
	};
	sz_u32_t multiplier = multipliers[divisor];
	sz_u8_t shift = shifts[divisor];

	sz_u16_t q = (sz_u16_t)((multiplier * number) >> 16);
	sz_u16_t t = ((number - q) >> 1) + q;
	return (sz_u8_t)(t >> shift);
	}