SymSpell 的 Swift 实现:拼写校正 & 模糊搜索:通过对称删除拼写校正算法,速度提升 100 万倍
描述来自 https://github.com/wolfgarbe/SymSpell/
对称删除拼写校正算法降低了给定 Damerau-Levenshtein 距离的编辑候选生成和字典查找的复杂性。它比标准方法(使用删除 + 转置 + 替换 + 插入)快六个数量级,并且与语言无关。
与其他算法相反,仅需要删除操作,不需要转置 + 替换 + 插入。输入词的转置 + 替换 + 插入被转换为字典词的删除操作。替换和插入的成本很高,并且与语言相关:例如,中文有 70,000 个 Unicode 汉字!
速度来自廉价的仅删除编辑候选生成和预计算。一个平均 5 个字母的单词在最大编辑距离为 3 的范围内,大约有 300 万个可能的拼写错误,但 SymSpell 只需要生成 25 个删除操作即可覆盖所有错误,无论是在预计算时还是在查找时。神奇!
Lookup 提供了非常快速的单字拼写校正。
支持复合词感知的多字输入字符串的自动拼写校正。
lookup()
假定每个输入字符串都是单个词。lookupCompound()
还支持复合词拆分/分解,包括三种情况:
拆分错误、连接错误、替换错误、转置错误、删除错误和插入错误可以混合在同一个词中。
大型文档集合使手动校正不可行,需要无监督的、完全自动的拼写校正。在传统的单个词的拼写校正中,用户会看到多个拼写校正建议。对于长篇多字文本的自动拼写校正,算法本身必须做出明智的选择。
- whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixthgrade and ins pired him
+ where is the love he had dated for much of the past who couldn't read in sixth grade and inspired him (9 edits)
- in te dhird qarter oflast jear he hadlearned ofca sekretplan
+ in the third quarter of last year he had learned of a secret plan (9 edits)
- the bigjest playrs in te strogsommer film slatew ith plety of funn
+ the biggest players in the strong summer film slate with plenty of fun (9 edits)
- Can yu readthis messa ge despite thehorible sppelingmsitakes
+ can you read this message despite the horrible spelling mistakes (9 edits)
WordSegmentation 通过在适当的位置插入缺失的空格,将字符串分割成单词。
- thequickbrownfoxjumpsoverthelazydog
+ the quick brown fox jumps over the lazy dog
- itwasabrightcolddayinaprilandtheclockswerestrikingthirteen
+ it was a bright cold day in april and the clocks were striking thirteen
- itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness
+ it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
当前的实现基于原始的 SymSpell,但使用 Swift 最佳实践和现代范例来实现相同的结果,甚至具有更好的性能。
let symSpell = SymSpell(maxDictionaryEditDistance: 2, prefixLength: 3)
if let path = Bundle.main.url(forResource: "frequency_dictionary_en_82_765", withExtension: "txt") {
try? symSpell.loadDictionary(from: path, termIndex: 0, countIndex: 1, termCount: 82765)
}
let results = symSpell.lookup("intermedaite")
print(results.first?.term)